Before this DALL-E, a creation by commercial Al lab openAl was the leader in the field before Google. Imagen has overthrown DALL-E in quality and output. Human indicators also rate Google Imagen more than other competitors. According to Google’s Imagen is the unprecedented photorealism deep level of language understanding Imagen is an Al system that generates photorealistic images from the input text provided to it by the user. Imagen uses a large frozen T5-XXL encoder to encode the text given into embedding. The large pertained frozen text encoders are quite effective for the text-to-image task. A conditional diffusion model maps the text embedding into 6464 image. The diffusion model can even map the text into 256256 and 1024*1024 image. Google has the point of view that scaling the concerned text encoder size is more essential than scaling the diffusion model size. It has introduced a new thresh holding diffusion model sampler which helps in use of very large classifier-free guidance weights. The new efficient U-Net architecture has been introduced in the system which is more memory efficient, converges faster and compute efficient. Imagen attains a new state-of-art COCO FID of 7.27 which gives samples that are on par with the reference images in terms of image-text alignment. Google is not making the Imagen public yet because it thinks the text to image has a creative potential and side by side it also has potential to spread inhumanity in the society by spreading fake news and harassment through these images. Let’s see when it makes it public after sorting out the shortcomings. Also Read: Google celebrates 15 years of Street View