[1]N. V. Thinh, T. V. Lang, and V. T. Thanh, “RGTranCNet: Effective image captioning model using cross-attention and semantic knowledge”, Vietnam J. Sci. Technol., vol. 64, no. 1, pp. 123–138, Jul. 2025.