RGTranCNet: Effective image captioning model using cross-attention and semantic knowledge

Author affiliations

Authors

  • Nguyen Van Thinh Institute of Machanics and Applied Informatics, Vietnam Academy of Science and Technology (VAST), 291 Dien Bien Phu Street, 3 District, Ho Chi Minh City, Viet Nam https://orcid.org/0000-0002-7543-5207
  • Tran Van Lang Journal Editorial Department, HCMC University of Foreign Languages and Information Technology (HUFLIT), 828 Su Van Hanh, 10 District, Ho Chi Minh City, Viet Nam https://orcid.org/0000-0002-8925-5549
  • Van The Thanh Faculty of Information Technology, HCMC University of Education (HCMUE), 280 An Duong Vuong, 5 District, Ho Chi Minh City, Viet Nam

DOI:

https://doi.org/10.15625/2525-2518/22381

Keywords:

Image captioning, Cross-attention mechanism, Transformer, ConceptNet knowledge base

Abstract

Image captioning is an important task that bridges computer vision and natural language processing. However, methods based on long short-term memory (LSTM) and traditional attention mechanisms are limited in handling complex relationships and parallelization capabilities. Moreover, accurately describing objects that have yet to appear in the training set poses a significant challenge. This study proposes a novel image captioning model, utilizing Transformer with cross-attention mechanisms combined with semantic knowledge from ConceptNet to address these issues. The model adopts an encoder-decoder framework, where the encoder extracts object region features and constructs a relational graph to represent the image, while the decoder integrates visual and semantic features through cross-attention to generate precise and diverse captions. Integrating ConceptNet knowledge enhances accuracy, particularly for objects not present in the training set. Experimental results on the MS COCO, a benchmark dataset, demonstrate that the model outperforms recent state-of-the-art approaches. Furthermore, this study's semantic knowledge integration method can be easily applied to other image captioning models.

Downloads

Download data is not yet available.

References

Jamil A., et al. - Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential, IEEE Access, 2024.

2. Verma A., et al. - Automatic image caption generation using deep learning, Multimedia Tools and Applications 83 (2) (2024) 5309-5325.

3. Kavitha R., et al. - Deep learning-based image captioning for visually impaired people, in E3S Web of Conferences, EDP Sciences, 2023.

4. Pavlopoulos J., Kougia V., and Androutsopoulos I. - A survey on biomedical image captioning, In: Proceedings of the second workshop on shortcomings in vision and language, 2019.

5. Szafir D. and Szafir D. A. - Connecting human-robot interaction and data visualization. in Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021.

6. Lin Y. J., Tseng C. S., and Hung-Yu K. - Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering, Journal of Information Science & Engineering 40 (3) (2024).

7. Vinyals O., et al. - Show and tell: A neural image caption generator. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.

8. Huang L., et al. - Attention on attention for image captioning, In: Proceedings of the IEEE/CVF international conference on computer vision, 2019.

9. Xu K., et al. - Show, attend and tell: Neural image caption generation with visual attention. in International conference on machine learning, 2015. PMLR.

10. Thinh N. V., Lang T. V., and Thanh V. T. - Automatic image captioning based on object detection and attention mechanism, In: The 16th National Conference on Fundamental and Applied IT Research, FAIR'2023, Natural Science and Technology Publishing House: Da Nang, 2023, pp. 395-404.

11. Anderson P., et al. - Bottom-up and top-down attention for image captioning and visual question answering, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.

12. Thinh N. V., Lang T. V., and Thanh V. T., OD-VR-CAP: Image captioning based on detecting and predicting relationships between objects. Journal of Computer Science and Cybernetics 40 (4) (2024).

13. Xu N., et al., Scene graph captioner: Image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, 2019. 58: p. 477-485.

14. Thinh N. V., Lang T. V., and Thanh V. T. - A Method of Automatic Image Captioning Based on Scene Graph and LSTM Network, in The 15th National Conference on Fundamental and Applied IT Research (FAIR'2022), Natural Science and Technology Publishing House: Ha Noi, 2022, pp. 431-439.

15. Li Z., et al. - Modeling graph-structured contexts for image captioning, Image and Vision Computing 129 (2023) 104591.

16. Vaswani A., et al. - Attention is all you need. Advances in neural information processing systems 30 (2017).

17. Hendricks L. A., et al. - Deep compositional captioning: Describing novel object categories without paired training data, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

18. Zhou Y., Sun Y., and Honavar V. G. - Improving Image Captioning by Leveraging Knowledge Graphs. IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, IEEE, January 7-11, 2019.

19. Hafeth D. A., Kollias S., and Ghafoor M. - Semantic representations with attention networks for boosting image captioning, IEEE Access. 11 (2023) 40230-40239.

20. Patwari N. and Naik D. - En-de-cap: An encoder decoder model for image captioning, In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, 2021.

21. Xie T., et al. - Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning, Applied Sciences 13 (13) (2023) 7916.

22. Chen S., et al. - Say as you wish: Fine-grained control of image caption generation with abstract scene graphs, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.

23. Yan J., et al. - Caption TLSTMs: combining transformer with LSTMs for image captioning, International Journal of Multimedia Information Retrieval 11 (2) (2022) 111-121.

24. Ramos L., et al. - A study of convnext architectures for enhanced image captioning, IEEE Access, 2024.

25. Wang Y., Xu J., and Sun Y. - End-to-end transformer based model for image captioning, In: Proceedings of the AAAI Conference on Artificial Intelligence, 2022.

26. Yang X., et al. - Context-aware transformer for image captioning, Neurocomputing 549 (2023) 126440.

27. Li Z., Q. Su, and T. Chen - External knowledge-assisted Transformer for image captioning, Image and Vision Computing 140 (2023) 104864.

28. Hamilton W., Z. Ying, and J. Leskovec - Inductive representation learning on large graphs, Advances in neural information processing systems 30 (2017).

29. Speer R., J. Chin, and C. Havasi - Conceptnet 5.5: An open multilingual graph of general knowledge, In: Proceedings of the AAAI conference on artificial intelligence, 2017.

30. Lin T. Y., et al. - Microsoft coco: Common objects in context. in European conference on computer vision, Springer, 2014.

31. Karpathy A. and L. Fei-Fei - Deep visual-semantic alignments for generating image descriptions. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.

32. Papineni K., et al. - Bleu: a method for automatic evaluation of machine translation. in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.

33. Banerjee S. and A. Lavie -. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.

34. Lin C. Y. - Rouge: A package for automatic evaluation of summaries, In: Text summarization branches out, 2004.

35. Vedantam R., C. Lawrence Zitnick, and D. Parikh - Cider: Consensus-based image description evaluation, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.

Downloads

Published

15-07-2025

How to Cite

[1]
N. V. Thinh, T. V. Lang, and V. T. Thanh, “RGTranCNet: Effective image captioning model using cross-attention and semantic knowledge”, Vietnam J. Sci. Technol., vol. 63, no. 5, Jul. 2025.

Issue

Section

Electronics - Telecommunication