A Benchmark Study of Hybrid CNN-Transformer Architectures in Vision-Language Tasks

Xin NIE; Yuan CHEN

Authors

Xin NIE School of Computer Science and Engineering, Wuhan Institute of Technology Wuhan, Hubei, China, China
Yuan CHEN School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, Hubei, China, China

Volume 2025

Articles

June 24, 2025

Downloads

PDF

Abstract
How to Cite
Metrics
License

The intersection of computer vision and natural language processing has led to the rapid development of vision-language models capable of performing complex multimodal tasks such as image captioning, visual question answering (VQA), and image-text retrieval. In this context, hybrid architectures that combine Convolutional Neural Networks (CNNs) for visual feature extraction with Transformer-based encoders for multimodal fusion have become a dominant paradigm. However, with the emergence of fully Transformer-based models, particularly those leveraging Vision Transformers (ViT) and contrastive learning frameworks, the performance, efficiency, and scalability of hybrid models are increasingly under scrutiny.

This research presents a comprehensive benchmark study comparing hybrid CNN-Transformer architectures with CNN-only and Transformer-only models across three core vision-language tasks: image captioning (MS COCO), visual question answering (VQAv2), and image-text retrieval (Flickr30k). We evaluate leading models such as ViLBERT, VisualBERT, OSCAR, VinVL, BLIP, CLIP, METER, and ViLT, analyzing their performance using widely adopted metrics including BLEU, METEOR, CIDEr, Recall@K, and VQA accuracy. In addition to performance metrics, we assess models in terms of computational efficiency, inference time, parameter count, and real-time deployment potential.

The experimental results reveal that while hybrid CNN-Transformer models have historically achieved state-of-the-art accuracy on vision-language benchmarks by benefiting from explicit object-level representations and multimodal fusion, the gap is narrowing. Recent Transformer-only models like METER and BLIP not only match or exceed hybrid models in accuracy but also significantly outperform them in inference speed, often by a factor of 5 to 60 depending on hardware configurations. Additionally, dual-encoder models such as CLIP demonstrate remarkable zero-shot capabilities and efficient retrieval performance without cross-attention fusion.

This study underscores a critical shift in vision-language modeling, highlighting the movement from complex hybrid architectures to streamlined, scalable Transformer-based solutions. The results provide valuable insights into model design trade-offs, emphasizing the importance of architectural efficiency, pretraining strategies, and deployment constraints. Finally, the paper highlights open research challenges and future directions, including the development of lightweight vision-language models for edge devices, improved multimodal alignment techniques, and broader generalization across modalities and domains.

Welcome to Emerging Science Research: Your Gateway to Cutting-Edge Research

A Benchmark Study of Hybrid CNN-Transformer Architectures in Vision-Language Tasks

Authors

Downloads

Make a Submission

Journal Policy

Information

Latest publications