Image Captioning

Image Captioning is a Computer Vision and Natural Language Processing task where the goal is to objectively describe the content of an image in a few words. Model architectures for this problem therefore mix components from image processing architectures (such as convolution layers) and text processing architectures (such as transformer decoder layers). Image Captioning datasets can also be used for the reverse task, that is, generating synthetic images from text descriptions.