dc.description.abstract | Building models that can quickly adapt to new tasks using just a few annotated
examples is an open challenge for multimodal machine learning research. It requires a lot of
creativity and an understanding of the task, promoting algorithms with multi-modal
connectivity and high accuracy. This paper proposes an architecture that connects pre-trained
language-only and vision-only models, processes interwoven text and image data sequences,
and imports images as input. Thanks to the flexibility and popularity of language models, we
can train them with datasets containing alternating pairs of images and text, which is also an
approach to new training methods for multimodal models. They can do tasks like visual
question answering, where the model is prompted with a question that it must answer based
on the input image.
This thesis will mention the multi-modal connection method between two data types. | en_US |