Heterogeneous Ensemble Learning for Context-Aware Image Captioning with Transformers

Authors

  • Mr Kothakonda Chandhar Department of Computer Science & Engineering, Kakatiya University, Warangal https://orcid.org/0009-0000-3795-237X
  • Manchala Sadanandam Department of Computer Science & Engineering, Kakatiya University, Warangal, Telangana, India

DOI:

https://doi.org/10.56042/jsir.v84i12.23517

Keywords:

Attention mechanism, Multimodal fusion, Natural language generation, Vision–language modeling, Visual semantics

Abstract

Image captioning remains a central challenge in multimodal artificial intelligence, requiring systems to jointly reason over visual content and natural language. Despite remarkable progress from deep vision–language transformers, single-model architectures often face a trade-off: they excel in either syntactic fluency or semantic grounding but rarely achieve both. This work introduces a heterogeneous ensemble learning framework that unifies convolutional, hierarchical, and self-attention–based encoders (ConvNeXt, ResNet-101, ViT) with advanced language decoders (T5 and BLIP). Unlike prior captioning ensembles, the current approach integrates attention-guided feature fusion with a consensus re-ranking mechanism, enabling the system to adaptively combine complementary strengths of diverse models. The framework is evaluated on two challenging benchmarks—MS COCO 2017 and Flickr30K—achieving state-of-the-art improvements over strong baselines, with BLEU-4 = 37.2, CIDEr = 124.5, SPICE = 22.3 on COCO, and BLEU-4 = 30.8, CIDEr = 98.7, SPICE = 19.6 on Flickr30K. Beyond quantitative gains, qualitative analysis shows that the ensemble produces captions that are both contextually faithful and semantically rich. These results establish ensemble learning as a scalable paradigm for vision–language generation, with implications for multilingual captioning, real-time accessibility tools, and future general-purpose multimodal reasoning systems.

Author Biography

  • Manchala Sadanandam, Department of Computer Science & Engineering, Kakatiya University, Warangal, Telangana, India

    Machine Learning, Deep Learning, Computer Vision, Natural Language Processing

Downloads

Published

09-04-2026

Issue

Section

Computer Sciences, Communication and Information Technology

How to Cite

Heterogeneous Ensemble Learning for Context-Aware Image Captioning with Transformers. (2026). Journal of Scientific & Industrial Research (JSIR), 84(12), 1322-1330. https://doi.org/10.56042/jsir.v84i12.23517

Similar Articles

1-10 of 109

You may also start an advanced similarity search for this article.