SABER

Autores
Gelbukh Alexander

Título	Generating image captions through multimodal embedding
Tipo	Revista
Sub-tipo	JCR
Descripción	Journal of Intelligent & Fuzzy Systems
Resumen	Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.
Observaciones	DOI 10.3233/JIFS-179027
Lugar	Amsterdam
País	Paises Bajos
No. de páginas	4787-4796
Vol. / Cap.	v. 36 no. 5
Inicio	2019-05-14
Fin
ISBN/ISSN