A Benchmark Study of Hybrid CNN-Transformer Architectures in Vision-Language Tasks
Downloads
The intersection of computer vision and natural language processing has led to the rapid development of vision-language models capable of performing complex multimodal tasks such as image captioning, visual question answering (VQA), and image-text retrieval. In this context, hybrid architectures that combine Convolutional Neural Networks (CNNs) for visual feature extraction with Transformer-based encoders for multimodal fusion have become a dominant paradigm. However, with the emergence of fully Transformer-based models, particularly those leveraging Vision Transformers (ViT) and contrastive learning frameworks, the performance, efficiency, and scalability of hybrid models are increasingly under scrutiny.
This research presents a comprehensive benchmark study comparing hybrid CNN-Transformer architectures with CNN-only and Transformer-only models across three core vision-language tasks: image captioning (MS COCO), visual question answering (VQAv2), and image-text retrieval (Flickr30k). We evaluate leading models such as ViLBERT, VisualBERT, OSCAR, VinVL, BLIP, CLIP, METER, and ViLT, analyzing their performance using widely adopted metrics including BLEU, METEOR, CIDEr, Recall@K, and VQA accuracy. In addition to performance metrics, we assess models in terms of computational efficiency, inference time, parameter count, and real-time deployment potential.
The experimental results reveal that while hybrid CNN-Transformer models have historically achieved state-of-the-art accuracy on vision-language benchmarks by benefiting from explicit object-level representations and multimodal fusion, the gap is narrowing. Recent Transformer-only models like METER and BLIP not only match or exceed hybrid models in accuracy but also significantly outperform them in inference speed, often by a factor of 5 to 60 depending on hardware configurations. Additionally, dual-encoder models such as CLIP demonstrate remarkable zero-shot capabilities and efficient retrieval performance without cross-attention fusion.
This study underscores a critical shift in vision-language modeling, highlighting the movement from complex hybrid architectures to streamlined, scalable Transformer-based solutions. The results provide valuable insights into model design trade-offs, emphasizing the importance of architectural efficiency, pretraining strategies, and deployment constraints. Finally, the paper highlights open research challenges and future directions, including the development of lightweight vision-language models for edge devices, improved multimodal alignment techniques, and broader generalization across modalities and domains.
Copyright (c) 2025 Xin NIE, Yuan CHEN (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.