Visual Question Answering (VQA) System for Real-World Image Understanding

Project Overview

This project implements a Visual Question Answering (VQA) system using multimodal transformers in PyTorch to enable real-world image understanding. The system combines image and text data for feature extraction, fusion, and prediction, leveraging advanced transformer architectures to achieve robust results. Two approaches—classification and generation—are explored using the DAQUAR dataset.

Abstract
- Overview of the project goals, methods, and key findings.
Introduction
- Background, objectives, and scope of the project.
Methodology
- Feature extraction techniques, multimodal fusion, and implementation tools.
Datasets
- Description of the DAQUAR dataset used in this project.
Assessment Methodology
- Metrics and evaluation techniques.
Literature Review
- Thematic and comparative analyses of existing approaches.
Critical Analysis
- Gaps, limitations, and implications of the study.
Conclusion
- Summary of findings and future directions.
References
- Cited sources and resources.

Key Features

Models and Techniques

Image Feature Extraction: Vision Transformers (ViT) for tokenizing images into spatial representations.
Text Feature Extraction: BERT for encoding natural language questions.
Multimodal Fusion: Late fusion, bilinear pooling, and attention mechanisms to integrate visual and textual data.
Generation Model: Combining BERT, ViT, and GPT2 for sequence generation tasks.

Regularization Techniques

Dropout layers to mitigate overfitting.
Gradient clipping to stabilize backpropagation.

Tools and Frameworks

PyTorch
Hugging Face Transformers
Scikit-learn
NLTK

Dataset

DAQUAR (DAtaset for QUestion Answering on Real-world images):

Size: 12,500 question-answer pairs.
Focus: Indoor scenes and basic object recognition.
Applications: Ideal for single-word/phrase-answer modeling.

Evaluation Metrics

Accuracy: Measures correctness of predictions.
Macro F1 Score: Evaluates model balance across classes.
Wu and Palmer Similarity (WUPS): Captures semantic similarity between predicted answers and ground truths.

Ablation Studies

Input Dimensions: Effect of image patch and token embedding sizes.
Pre-processing: Analysis of normalization, resizing, and tokenization methods.
Fusion Mechanisms: Comparing concatenation and bilinear pooling.
Attention Mechanisms: Evaluating different attention models.

Findings

Classification Model: BERT + ViT achieved a WUPS score of 0.26.
Generation Model: BERT + ViT + GPT2 achieved superior performance with a WUPS score of 0.27.
Challenges: Limited dataset size and high computational requirements.
Future Directions:
- Transfer learning for diverse datasets.
- Integration of external knowledge graphs.
- Optimization for computational efficiency.

Setup and Installation

Prerequisites

Python 3.8+
PyTorch
Hugging Face Transformers
Scikit-learn

Installation

Clone the repository:

git clone https://github.com/RobuRishabh/Multimodal-Visual-Question-Answering-VQA-with-Generative-AI-utilizing-LLM-and-Vision-Language-Model.git

Install dependencies:
```
pip install -r requirements.txt
```
Download the DAQUAR dataset and place it in the data/ folder.

Running the Project

Training the Classification Model:
```
python VQA_Classification.ipynb
```
Training the Generation Model:
```
python VQA_Generation.ipynb
```

References

Author

Rishabh Singh

Course: CS 6120 (Natural Language Processing)
Instructor: Prof. Uzair Ahmad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visual Question Answering (VQA) System for Real-World Image Understanding

Project Overview

Contents

Key Features

Models and Techniques

Regularization Techniques

Tools and Frameworks

Dataset

Evaluation Metrics

Ablation Studies

Findings

Setup and Installation

Prerequisites

Installation

Running the Project

References

Author

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Visual Question Answering (VQA) System for Real-World Image Understanding

Project Overview

Contents

Key Features

Models and Techniques

Regularization Techniques

Tools and Frameworks

Dataset

Evaluation Metrics

Ablation Studies

Findings

Setup and Installation

Prerequisites

Installation

Running the Project

References

Author