Return to Article Details A Benchmark Study of Hybrid CNN-Transformer Architectures in Vision-Language Tasks Download Download PDF