1- RIV Lab., Dept. of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University Hamedan, Iran
2- RIV Lab., Dept. of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University Hamedan, Iran , khotanlou@basu.ac.ir
3- CLASP, Department of Philosophy, Linguistics and Theory of Science (FLoV) University of Gothenburg Gothenburg, Sweden
Abstract: (17 Views)
Visual Question Answering (VQA) is a complex task that requires models to jointly analyze visual and textual inputs to generate accurate answers. Reasoning and inference are critical for addressing questions that involve relationships, spatial arrangements, and contextual details within an image. In this study, we propose a model based on the BLIP framework, as a generative model, that enhances contextual understanding by incorporating dense-captions -detailed textual descriptions generated for specific regions within an image- along with spatial information extracted from the image. The model focuses on emphasizing visual information and extracting additional context to improve answer accuracy. Experimental results on the GQA dataset demonstrate that the proposed approach achieves competitive performance compared to state-of-the-art methods