Enhancing object detection efficiency with transformers through multi-level feature integration

Dung Nguyen; Van-Dung Hoang; Van-Tuong-Lan Le

doi:10.15625/1813-9663/21114

Author affiliations

Authors

Dung Nguyen Hue University of Sciences, Hue University, No. 77, Nguyen Hue Street, Thuan Hoa Ward, Hue City, Viet Nam https://orcid.org/0009-0000-4510-7504
Van-Dung Hoang 2Ho Chi Minh City University of Technology and Engineering, No. 1, Vo Van Ngan Street, Linh Chieu Ward, Thu Duc City, Ho Chi Minh City https://orcid.org/0000-0001-7554-1707
Van-Tuong-Lan Le 3Department of Academic and Students’ Affairs, Hue University, No. 3, Le Loi Street, Thuan Hoa Ward, Hue City, Viet Nam https://orcid.org/0009-0008-2687-5346

DOI:

https://doi.org/10.15625/1813-9663/21114

Keywords:

High resolution, multi-level features, object detection, transformer.

Abstract

This paper presents a novel approach to enhancing object detection efficiency by integrating multi-level features within a transformer architecture. Traditional object detection methods often rely on single-level feature representations, which may limit their ability to accurately detect objects of varying sizes and complexities. By leveraging multi-level feature integration within the transformer framework, our method captures a richer set of spatial and semantic information, leading to more precise and robust object detection.The powerful attention mechanisms of transformers are utilized to effectively combine these features, improving detection accuracy and localization. The proposed approach is evaluated on the PASCAL VOC benchmark dataset, demonstrating superior performance over conventional single-level feature-based methods. Experimental results show that our model achieves an mAP@0.5 of 87% on PASCAL VOC, outperforming recent state-of-the-art methods while maintaining computationally efficient. These findings highlight the potential of multi-level feature integration within transformers in advancing the field of object detection.

References

C. Chen, A. Seff, A. Kornhauser, and J. Xiao, "Deepdriving: Learning affordance for direct perception in autonomous driving," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2722-2730.

X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, "Multi-view 3d object detection network for autonomous driving," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 1907-1915.

S. Ramos, S. Gehrig, P. Pinggera, U. Franke, and C. Rother, "Detecting unexpected obstacles for self-driving cars: Fusing deep learning and geometric modeling," in 2017 IEEE Intelligent Vehicles Symposium (IV), 2017: IEEE, pp. 1025-1032.

J. Ni, K. Shen, Y. Chen, W. Cao, and S. X. Yang, "An improved deep network-based scene classification method for self-driving cars," IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-14, 2022.

V.-D. Hoang, N. T. Huynh, N. Tran, K. Le, T.-M.-C. Le, A. Selamat, and H. D. Nguyen, "Powering AI-driven car damage identification based on VeHIDE dataset," Journal of Information and Telecommunication, pp. 1-19, doi: 10.1080/24751839.2024.2367387.

C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, "Abusive language detection in online user content," in Proceedings of the 25th international conference on world wide web, 2016, pp. 145-153.

A. M. Founta, D. Chatzakou, N. Kourtellis, J. Blackburn, A. Vakali, and I. Leontiadis, "A unified deep learning architecture for abuse detection," in Proceedings of the 10th ACM conference on web science, 2019, pp. 105-114.

Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, "Human action recognition from various data modalities: A review," IEEE transactions on pattern analysis and machine intelligence, 2022.

A. B. Nassif, M. A. Talib, Q. Nasir, Y. Afadar, and O. Elgendy, "Breast cancer detection using artificial intelligence techniques: A systematic literature review," Artificial Intelligence in Medicine, vol. 127, p. 102276, 2022.

T.-C. Pham, A. Doucet, C.-M. Luong, C.-T. Tran, and V.-D. Hoang, "Improving skin-disease classification based on customized loss function combined with balanced mini-batch logic and real-time image augmentation," IEEE Access, vol. 8, pp. 150725-150737, 2020.

A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," (in eng), Nature, vol. 521, no. 7553, pp. 436-44, May 28 2015, doi: 10.1038/nature14539.

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in European conference on computer vision, 2020: Springer, pp. 213-229.

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, "Deformable detr: Deformable transformers for end-to-end object detection," arXiv preprint arXiv:2010.04159, 2020.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.

R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448.

S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.

J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263-7271.

J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.

A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020.

G. Jocher et al., "ultralytics/yolov5: v3. 0," Zenodo, 2020.

C. Li et al., "Yolov6 v3. 0: A full-scale reloading," arXiv preprint arXiv:2301.05586, 2023.

C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464-7475.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 21-37.

T.-Y. Lin et al., "Microsoft coco: Common objects in context," in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014: Springer, pp. 740-755.

A. Neubeck and L. Van Gool, "Efficient non-maximum suppression," in 18th international conference on pattern recognition (ICPR'06), 2006, vol. 3: IEEE, pp. 850-855.

J. Hosang, R. Benenson, and B. Schiele, "Learning non-maximum suppression," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4507-4515.

L. Dong, S. Xu, and B. Xu, "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition," in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018: IEEE, pp. 5884-5888.

N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, "Neural speech synthesis with transformer network," in Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, no. 01, pp. 6706-6713.

M. O. Topal, A. Bas, and I. van Heerden, "Exploring transformers in natural language generation: Gpt, bert, and xlnet," arXiv preprint arXiv:2102.08036, 2021.

H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," arXiv preprint arXiv:1402.1128, 2014.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

T. Brown et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.

O. Russakovsky et al., "Imagenet large scale visual recognition challenge," International journal of computer vision, vol. 115, pp. 211-252, 2015.

Z. Liu et al., "Swin transformer v2: Scaling up capacity and resolution," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12009-12019.

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, "Linformer: Self-attention with linear complexity," arXiv preprint arXiv:2006.04768, 2020.

R. Fathony, S. Behpour, X. Zhang, and B. Ziebart, "Efficient and consistent adversarial bipartite matching," in International Conference on Machine Learning, 2018: PMLR, pp. 1457-1466.

P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, "A tutorial on the cross-entropy method," Annals of operations research, vol. 134, pp. 19-67, 2005.

S. Mannor, D. Peleg, and R. Rubinstein, "The cross entropy method for classification," in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 561-568.

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988.

T.-Y. Ross and G. Dollár, "Focal loss for dense object detection," in proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2980-2988.

V. O. C. C. (VOC2012). "Visual Object Classes Challenge 2012 (VOC2012)." http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html (accessed 2023/11/23, 2023).

A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.

Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012-10022.

Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, "Vision permutator: A permutable mlp-like architecture for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 1328-1334, 2022.