Heterogeneous Ensemble Learning for Context-Aware Image Captioning with Transformers
DOI:
https://doi.org/10.56042/jsir.v84i12.23517Keywords:
Attention mechanism, Multimodal fusion, Natural language generation, Vision–language modeling, Visual semanticsAbstract
Image captioning remains a central challenge in multimodal artificial intelligence, requiring systems to jointly reason over visual content and natural language. Despite remarkable progress from deep vision–language transformers, single-model architectures often face a trade-off: they excel in either syntactic fluency or semantic grounding but rarely achieve both. This work introduces a heterogeneous ensemble learning framework that unifies convolutional, hierarchical, and self-attention–based encoders (ConvNeXt, ResNet-101, ViT) with advanced language decoders (T5 and BLIP). Unlike prior captioning ensembles, the current approach integrates attention-guided feature fusion with a consensus re-ranking mechanism, enabling the system to adaptively combine complementary strengths of diverse models. The framework is evaluated on two challenging benchmarks—MS COCO 2017 and Flickr30K—achieving state-of-the-art improvements over strong baselines, with BLEU-4 = 37.2, CIDEr = 124.5, SPICE = 22.3 on COCO, and BLEU-4 = 30.8, CIDEr = 98.7, SPICE = 19.6 on Flickr30K. Beyond quantitative gains, qualitative analysis shows that the ensemble produces captions that are both contextually faithful and semantically rich. These results establish ensemble learning as a scalable paradigm for vision–language generation, with implications for multilingual captioning, real-time accessibility tools, and future general-purpose multimodal reasoning systems.