RGTranCNet: Effective image captioning model using cross-attention and semantic knowledge
Author affiliations
DOI:
https://doi.org/10.15625/2525-2518/22381Keywords:
Image captioning, Cross-attention mechanism, Transformer, ConceptNet knowledge baseAbstract
Image captioning is an important task that bridges computer vision and natural language processing. However, methods based on long short-term memory (LSTM) and traditional attention mechanisms are limited in handling complex relationships and parallelization capabilities. Moreover, accurately describing objects that have yet to appear in the training set poses a significant challenge. This study proposes a novel image captioning model, utilizing Transformer with cross-attention mechanisms combined with semantic knowledge from ConceptNet to address these issues. The model adopts an encoder-decoder framework, where the encoder extracts object region features and constructs a relational graph to represent the image, while the decoder integrates visual and semantic features through cross-attention to generate precise and diverse captions. Integrating ConceptNet knowledge enhances accuracy, particularly for objects not present in the training set. Experimental results on the MS COCO, a benchmark dataset, demonstrate that the model outperforms recent state-of-the-art approaches. Furthermore, this study's semantic knowledge integration method can be easily applied to other image captioning models.
Downloads
References
Jamil A., et al. - Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential, IEEE Access, 2024.
2. Verma A., et al. - Automatic image caption generation using deep learning, Multimedia Tools and Applications 83 (2) (2024) 5309-5325.
3. Kavitha R., et al. - Deep learning-based image captioning for visually impaired people, in E3S Web of Conferences, EDP Sciences, 2023.
4. Pavlopoulos J., Kougia V., and Androutsopoulos I. - A survey on biomedical image captioning, In: Proceedings of the second workshop on shortcomings in vision and language, 2019.
5. Szafir D. and Szafir D. A. - Connecting human-robot interaction and data visualization. in Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021.
6. Lin Y. J., Tseng C. S., and Hung-Yu K. - Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering, Journal of Information Science & Engineering 40 (3) (2024).
7. Vinyals O., et al. - Show and tell: A neural image caption generator. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
8. Huang L., et al. - Attention on attention for image captioning, In: Proceedings of the IEEE/CVF international conference on computer vision, 2019.
9. Xu K., et al. - Show, attend and tell: Neural image caption generation with visual attention. in International conference on machine learning, 2015. PMLR.
10. Thinh N. V., Lang T. V., and Thanh V. T. - Automatic image captioning based on object detection and attention mechanism, In: The 16th National Conference on Fundamental and Applied IT Research, FAIR'2023, Natural Science and Technology Publishing House: Da Nang, 2023, pp. 395-404.
11. Anderson P., et al. - Bottom-up and top-down attention for image captioning and visual question answering, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
12. Thinh N. V., Lang T. V., and Thanh V. T., OD-VR-CAP: Image captioning based on detecting and predicting relationships between objects. Journal of Computer Science and Cybernetics 40 (4) (2024).
13. Xu N., et al., Scene graph captioner: Image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, 2019. 58: p. 477-485.
14. Thinh N. V., Lang T. V., and Thanh V. T. - A Method of Automatic Image Captioning Based on Scene Graph and LSTM Network, in The 15th National Conference on Fundamental and Applied IT Research (FAIR'2022), Natural Science and Technology Publishing House: Ha Noi, 2022, pp. 431-439.
15. Li Z., et al. - Modeling graph-structured contexts for image captioning, Image and Vision Computing 129 (2023) 104591.
16. Vaswani A., et al. - Attention is all you need. Advances in neural information processing systems 30 (2017).
17. Hendricks L. A., et al. - Deep compositional captioning: Describing novel object categories without paired training data, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
18. Zhou Y., Sun Y., and Honavar V. G. - Improving Image Captioning by Leveraging Knowledge Graphs. IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, IEEE, January 7-11, 2019.
19. Hafeth D. A., Kollias S., and Ghafoor M. - Semantic representations with attention networks for boosting image captioning, IEEE Access. 11 (2023) 40230-40239.
20. Patwari N. and Naik D. - En-de-cap: An encoder decoder model for image captioning, In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, 2021.
21. Xie T., et al. - Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning, Applied Sciences 13 (13) (2023) 7916.
22. Chen S., et al. - Say as you wish: Fine-grained control of image caption generation with abstract scene graphs, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
23. Yan J., et al. - Caption TLSTMs: combining transformer with LSTMs for image captioning, International Journal of Multimedia Information Retrieval 11 (2) (2022) 111-121.
24. Ramos L., et al. - A study of convnext architectures for enhanced image captioning, IEEE Access, 2024.
25. Wang Y., Xu J., and Sun Y. - End-to-end transformer based model for image captioning, In: Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
26. Yang X., et al. - Context-aware transformer for image captioning, Neurocomputing 549 (2023) 126440.
27. Li Z., Q. Su, and T. Chen - External knowledge-assisted Transformer for image captioning, Image and Vision Computing 140 (2023) 104864.
28. Hamilton W., Z. Ying, and J. Leskovec - Inductive representation learning on large graphs, Advances in neural information processing systems 30 (2017).
29. Speer R., J. Chin, and C. Havasi - Conceptnet 5.5: An open multilingual graph of general knowledge, In: Proceedings of the AAAI conference on artificial intelligence, 2017.
30. Lin T. Y., et al. - Microsoft coco: Common objects in context. in European conference on computer vision, Springer, 2014.
31. Karpathy A. and L. Fei-Fei - Deep visual-semantic alignments for generating image descriptions. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
32. Papineni K., et al. - Bleu: a method for automatic evaluation of machine translation. in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.
33. Banerjee S. and A. Lavie -. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.
34. Lin C. Y. - Rouge: A package for automatic evaluation of summaries, In: Text summarization branches out, 2004.
35. Vedantam R., C. Lawrence Zitnick, and D. Parikh - Cider: Consensus-based image description evaluation, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Vietnam Journal of Sciences and Technology (VJST) is an open access and peer-reviewed journal. All academic publications could be made free to read and downloaded for everyone. In addition, articles are published under term of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) Licence which permits use, distribution and reproduction in any medium, provided the original work is properly cited & ShareAlike terms followed.
Copyright on any research article published in VJST is retained by the respective author(s), without restrictions. Authors grant VAST Journals System a license to publish the article and identify itself as the original publisher. Upon author(s) by giving permission to VJST either via VJST journal portal or other channel to publish their research work in VJST agrees to all the terms and conditions of https://creativecommons.org/licenses/by-sa/4.0/ License and terms & condition set by VJST.
Authors have the responsibility of to secure all necessary copyright permissions for the use of 3rd-party materials in their manuscript.