Abstract
Detecting prohibited objects in images requires models that combine high accuracy with resilience to adversarial perturbations. This study presents a comparative evaluation of three state-of-the-art architectures—Vision Transformer (ViT B16), EfficientNetV2-L, and YOLO12—and an ensemble of these models for multi-label detection of illicit content (weapons, drugs, nudity, violence, benign) in forensic imagery. A dataset of 13,556 images was split 60/20/20 into training, validation, and test sets. Each model was trained using binary cross-entropy (for classification heads) and YOLO loss (for detection heads) and evaluated on clean data and adversarial examples generated by FGSM (ϵ = 8/255) and PGD (α = 2/255, 10 iterations). On clean images, ViT B16 and EfficientNetV2-L achieved 93.5% and 92.8% accuracy, respectively, while YOLO12 reached 84.0%. Under PGD attack, accuracies dropped to 81.0%, 79.5%, and 68.0%. The averaging ensemble attained 95.2% clean accuracy, 85.5% adversarial accuracy, and 96.8% recall. Results demonstrate that architectural diversity and ensemble fusion significantly enhance both detection performance and adversarial robustness in security-critical applications.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) |
| Publisher | IEEE |
| ISBN (Print) | 9798331512583 |
| DOIs | |
| Publication status | Published (VoR) - 28 Nov 2025 |
Fingerprint
Dive into the research topics of 'Adversarially Resilient Multi-Label Object Detection: An Ensemble of ViT, EfficientNetV2-L, and YOLO12 for Forensic Imagery'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver