Small open vocabulary object detection from drone images  using OWL-VIT combined with SAHI

Nguyet Nguyen; Cong Tran; Michael Neff; Cuong Pham

doi:10.15625/1813-9663/21899

Small open vocabulary object detection from drone images using OWL-VIT combined with SAHI

Nguyet Nguyen, Cong Tran, Michael Neff, Cuong Pham

Author affiliations

Authors

Nguyet Nguyen https://orcid.org/0009-0004-4910-473X
Cong Tran Posts and Telecommunications Institute of Technology
Michael Neff University of California, Davis, One Shields Avenue, Davis, CA 95616, United States
Cuong Pham Posts and Telecommunications Institute of Technology https://orcid.org/0000-0003-0973-0889

DOI:

https://doi.org/10.15625/1813-9663/21899

Keywords:

Closed-set object detection, open-vocabulary object detection, drone imagery, vision transformer, small object detection.

Abstract

The demand for precise and efficient object detection in aerial imagery has surged, driven by applications in agriculture, surveillance, disaster management, and environmental monitoring. However, detecting small objects in drone-captured images remains challenging due to factors like low resolution, occlusion, and varying scales. This research explores a novel approach to small, open vocabulary object detection by combining the OWL-ViT (OpenWorld Vision Transformer) model with the SAHI (Slicing Aided Hyper Inference) technique. OWL-ViT, known for its ability to handle open vocabulary object detection, is leveraged for its robust feature extraction and generalization capabilities across diverse object categories. SAHI is integrated to address the small object detection challenge by slicing high resolution drone images into smaller patches, enabling more focused and detailed inference. In a comprehensive evaluation, our combined method achieves significant improvements in mAP@50 for small-scale object detection, with an average increase of +6.8% on the VisDrone dataset.

Downloads

Published

07-05-2026

How to Cite

[1]N. Nguyen, C. Tran, Michael Neff, and C. Pham, “Small open vocabulary object detection from drone images using OWL-VIT combined with SAHI”, J. Comput. Sci. Cybern., vol. 42, no. 2, p. 153–165, May 2026.

Download Citation

Issue

Vol. 42 No. 2 (2026)

Section

Articles

License

1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.
2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.