Show simple item record

dc.contributor.advisorNguyễn, Thị Thuý Loan
dc.contributor.authorMai, Đặng Nhật Anh
dc.date.accessioned2025-02-14T03:53:51Z
dc.date.available2025-02-14T03:53:51Z
dc.date.issued2024
dc.identifier.urihttp://keep.hcmiu.edu.vn:8080/handle/123456789/6598
dc.description.abstractBuilding models that can quickly adapt to new tasks using just a few annotated examples is an open challenge for multimodal machine learning research. It requires a lot of creativity and an understanding of the task, promoting algorithms with multi-modal connectivity and high accuracy. This paper proposes an architecture that connects pre-trained language-only and vision-only models, processes interwoven text and image data sequences, and imports images as input. Thanks to the flexibility and popularity of language models, we can train them with datasets containing alternating pairs of images and text, which is also an approach to new training methods for multimodal models. They can do tasks like visual question answering, where the model is prompted with a question that it must answer based on the input image. This thesis will mention the multi-modal connection method between two data types.en_US
dc.subjectDeep Learningen_US
dc.subjectVisual Question Answering (Vqa)en_US
dc.subjectVietnamese Applicationen_US
dc.titleDeep Learning Approaches For Visual Question Answering (Vqa) Vietnamese Applicationen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record