Heterogeneous Ensemble Learning for Context-Aware Image Captioning with Transformers

Authors

  • Mr Kothakonda Chandhar Department of Computer Science & Engineering, Kakatiya University, Warangal https://orcid.org/0009-0000-3795-237X
  • Manchala Sadanandam Department of Computer Science & Engineering, Kakatiya University, Warangal, Telangana, India

DOI:

https://doi.org/10.56042/jsir.v84i12.23517

Keywords:

Attention mechanism, Multimodal fusion, Natural language generation, Vision–language modeling, Visual semantics

Abstract

Image captioning remains a central challenge in multimodal artificial intelligence, requiring systems to jointly reason over visual content and natural language. Despite remarkable progress from deep vision–language transformers, single-model architectures often face a trade-off: they excel in either syntactic fluency or semantic grounding but rarely achieve both. This work introduces a heterogeneous ensemble learning framework that unifies convolutional, hierarchical, and self-attention–based encoders (ConvNeXt, ResNet-101, ViT) with advanced language decoders (T5 and BLIP). Unlike prior captioning ensembles, the current approach integrates attention-guided feature fusion with a consensus re-ranking mechanism, enabling the system to adaptively combine complementary strengths of diverse models. The framework is evaluated on two challenging benchmarks—MS COCO 2017 and Flickr30K—achieving state-of-the-art improvements over strong baselines, with BLEU-4 = 37.2, CIDEr = 124.5, SPICE = 22.3 on COCO, and BLEU-4 = 30.8, CIDEr = 98.7, SPICE = 19.6 on Flickr30K. Beyond quantitative gains, qualitative analysis shows that the ensemble produces captions that are both contextually faithful and semantically rich. These results establish ensemble learning as a scalable paradigm for vision–language generation, with implications for multilingual captioning, real-time accessibility tools, and future general-purpose multimodal reasoning systems.

Author Biography

  • Manchala Sadanandam, Department of Computer Science & Engineering, Kakatiya University, Warangal, Telangana, India

    Machine Learning, Deep Learning, Computer Vision, Natural Language Processing

Downloads

Published

09.04.2026

Issue

Section

Computer Sciences, Communication and Information Technology

How to Cite

Heterogeneous Ensemble Learning for Context-Aware Image Captioning with Transformers. (2026). Journal of Scientific & Industrial Research (JSIR), 84(12), 1322-1330. https://doi.org/10.56042/jsir.v84i12.23517

Similar Articles

1-10 of 115

You may also start an advanced similarity search for this article.