Context-aware Image Understanding in VQA with Dense-captioning

Alighardash, Elham; Khotanlou, Hassan; Dobnik, Simon

Back to the articles list | Back to browse issues page

Context-aware Image Understanding in VQA with Dense-captioning

Elham Alighardash¹

, Hassan Khotanlou²

, Simon Dobnik³

1- RIV Lab., Dept. of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University Hamedan, Iran
2- RIV Lab., Dept. of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University Hamedan, Iran , khotanlou@basu.ac.ir
3- CLASP, Department of Philosophy, Linguistics and Theory of Science (FLoV) University of Gothenburg Gothenburg, Sweden

Abstract: (17 Views)

Visual Question Answering (VQA) is a complex task that requires models to jointly analyze visual and textual inputs to generate accurate answers. Reasoning and inference are critical for addressing questions that involve relationships, spatial arrangements, and contextual details within an image. In this study, we propose a model based on the BLIP framework, as a generative model, that enhances contextual understanding by incorporating dense-captions -detailed textual descriptions generated for specific regions within an image- along with spatial information extracted from the image. The model focuses on emphasizing visual information and extracting additional context to improve answer accuracy. Experimental results on the GQA dataset demonstrate that the proposed approach achieves competitive performance compared to state-of-the-art methods

Keywords: visual question answering, dense-captioning, BLIP, context-aware VQA, reasoning

Full-Text [PDF 871 kb] (6 Downloads)

Type of Study: Research | Subject: Information Technology

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Principal Contact