A Benchmark Study of Hybrid CNN-Transformer Architectures in Vision-Language Tasks

Hybrid Models , Vision-Language Tasks , CNN-Transformer, Image Captioning , VQA , Deep Learning, CLIP , Benchmarking

Authors

  • Xin NIE School of Computer Science and Engineering, Wuhan Institute of Technology Wuhan, Hubei, China, China
  • Yuan CHEN School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, Hubei, China, China
Volume 2025
Articles
June 24, 2025

Downloads

The intersection of computer vision and natural language processing has led to the rapid development of vision-language models capable of performing complex multimodal tasks such as image captioning, visual question answering (VQA), and image-text retrieval. In this context, hybrid architectures that combine Convolutional Neural Networks (CNNs) for visual feature extraction with Transformer-based encoders for multimodal fusion have become a dominant paradigm. However, with the emergence of fully Transformer-based models, particularly those leveraging Vision Transformers (ViT) and contrastive learning frameworks, the performance, efficiency, and scalability of hybrid models are increasingly under scrutiny.

This research presents a comprehensive benchmark study comparing hybrid CNN-Transformer architectures with CNN-only and Transformer-only models across three core vision-language tasks: image captioning (MS COCO), visual question answering (VQAv2), and image-text retrieval (Flickr30k). We evaluate leading models such as ViLBERT, VisualBERT, OSCAR, VinVL, BLIP, CLIP, METER, and ViLT, analyzing their performance using widely adopted metrics including BLEU, METEOR, CIDEr, Recall@K, and VQA accuracy. In addition to performance metrics, we assess models in terms of computational efficiency, inference time, parameter count, and real-time deployment potential.

The experimental results reveal that while hybrid CNN-Transformer models have historically achieved state-of-the-art accuracy on vision-language benchmarks by benefiting from explicit object-level representations and multimodal fusion, the gap is narrowing. Recent Transformer-only models like METER and BLIP not only match or exceed hybrid models in accuracy but also significantly outperform them in inference speed, often by a factor of 5 to 60 depending on hardware configurations. Additionally, dual-encoder models such as CLIP demonstrate remarkable zero-shot capabilities and efficient retrieval performance without cross-attention fusion.

This study underscores a critical shift in vision-language modeling, highlighting the movement from complex hybrid architectures to streamlined, scalable Transformer-based solutions. The results provide valuable insights into model design trade-offs, emphasizing the importance of architectural efficiency, pretraining strategies, and deployment constraints. Finally, the paper highlights open research challenges and future directions, including the development of lightweight vision-language models for edge devices, improved multimodal alignment techniques, and broader generalization across modalities and domains.