dc.description.abstract | Generating high-quality images from text descriptions is a challenging task in computer
vision, but it has many practical applications. Existing text-to-image (T2I) approaches can
generate images that roughly match the given descriptions, including: Stacked Generative
Adversarial Networks (Stack-GAN), Attentional Generative Networks (Attn-GAN),
Conditional Generative Networks (CGAN), Mirror Generative Networks (Mirror-GAN) or
Variational Autoencoders (VAEs). Although these models had been proven for achieving
significant results, there is still room for improvements in terms of generating important details,
realistic object features or even the ability in understanding the text description. To overcome
this problem, I propose an experimental approach on new text embedding technique whether it
can improve the original StackGAN model.
The methodology proposed in this thesis comprises using RoBERTa as a text
embedding technique to the original StackGAN model. To do this, the RoBERTa model was
fine-tuned on a text-to-image synthesis job with a dataset comparable to that used in the original
StackGAN paper. In addition, the StackGAN model was modified to accustomed to use
RoBERTa embeddings as input instead of the traditional character-level embedding with CNN
and RNN. The improved StackGAN model’s performance, with RoBERTa incorporated in the
preprocessing stage, was then evaluated on a set of important metrics and also compared to that
of the baseline StackGAN model.
Finally, this thesis presents a new approach to assess the performance of StackGAN
model by adding RoBERTa. My experimental results show that the proposed approach, that
including RoBERTa into image captioning models can answer the question of whether it can
improve the original StackGAN model. This study has practical implications for the
development of more accurate and descriptive picture captioning models, which could have
applications in disciplines such as computer vision and natural language processing. | en_US |