Skip to main navigation Skip to search Skip to main content

Adversarially Resilient Multi-Label Object Detection: An Ensemble of ViT, EfficientNetV2-L, and YOLO12 for Forensic Imagery

  • Leila Rzayeva *
  • , Perizat Tazhibayeva
  • , Yonghao Wang
  • *Corresponding author for this work
  • Astana IT University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Detecting prohibited objects in images requires models that combine high accuracy with resilience to adversarial perturbations. This study presents a comparative evaluation of three state-of-the-art architectures—Vision Transformer (ViT B16), EfficientNetV2-L, and YOLO12—and an ensemble of these models for multi-label detection of illicit content (weapons, drugs, nudity, violence, benign) in forensic imagery. A dataset of 13,556 images was split 60/20/20 into training, validation, and test sets. Each model was trained using binary cross-entropy (for classification heads) and YOLO loss (for detection heads) and evaluated on clean data and adversarial examples generated by FGSM (ϵ = 8/255) and PGD (α = 2/255, 10 iterations). On clean images, ViT B16 and EfficientNetV2-L achieved 93.5% and 92.8% accuracy, respectively, while YOLO12 reached 84.0%. Under PGD attack, accuracies dropped to 81.0%, 79.5%, and 68.0%. The averaging ensemble attained 95.2% clean accuracy, 85.5% adversarial accuracy, and 96.8% recall. Results demonstrate that architectural diversity and ensemble fusion significantly enhance both detection performance and adversarial robustness in security-critical applications.
Original languageEnglish
Title of host publication2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)
PublisherIEEE
ISBN (Print)9798331512583
DOIs
Publication statusPublished (VoR) - 28 Nov 2025

Fingerprint

Dive into the research topics of 'Adversarially Resilient Multi-Label Object Detection: An Ensemble of ViT, EfficientNetV2-L, and YOLO12 for Forensic Imagery'. Together they form a unique fingerprint.

Cite this