Vietnamese Optical Character Recognition Based On Transformer
Abstract
Text recognition has been a key factor in the resolution of many of the issues associated with
the digitization of documents in this 4.0 era. The majority of techniques begin by identifying
character characteristics using a Convolutional Neural Network (CNN) model and then move
those features into a Recurrent Neural Network (RNN) to generate character-level information.
The primary emphasis of this thesis will be on using a different model for the Vietnamese
Optical Character Recognition (OCR) problem and comparing it to models that are presently
being utilized.
According to a recent paper, the Transformer model has surpassed the well-known CNN models
in the classification challenge. This model was accomplished by considering a picture as a
sequence similar to a phrase and building a model that is considered to be state-of-the-art. In
addition, with the advancement of Natural Language Processing (NLP) of human languages in
general and Vietnamese in particular, a research team from VinAI Research has successfully
constructed an NLP model for Vietnamese called phoBERT. The phoBERT model is derived
from the well-known Roberta model, which can be found all over the globe. It is superior to the
RNN model in many ways, including efficiency and the amount of time it takes to train.
This study uses a mixture of the two models described above to solve the Vietnamese OCR task
and has generated results that are generally consistent and have an accuracy of up to 96.2
percent, demonstrating that this technique is successful. On the other hand, this technique has
many drawbacks, from data preparation to training.