Facial Expression Recognition Using Deep Learning
Abstract
Facial expressions play a vital role in communication by transmitting subtle details about a
person's emotional condition and improving social interactions. Therefore, precise recognition
and interpretation of facial expressions are crucial for various applications. Facial Expression
Recognition (FER) has applications in assessing candidate suitability for client-facing roles,
refining video games through beta testing, enhancing marketing research, improving AIhuman interactions, supporting mental health care, and evaluating audience engagement in
events. Although humans are proficient at FER, automating this process using computer
methods is very difficult because of facial emotions' intricate and unpredictable nature. Deep
learning has become a potential method for addressing this difficulty, significantly improving
the accuracy and efficiency of FER systems. This thesis uses an innovative deep-learning
approach that utilizes the EfficientViTM5 model, an advanced Vision Transformer (ViT)
architecture variation. ViT have achieved notable success in computer vision tasks by using
self-attention processes to grasp complex patterns and connections inside pictures.
EfficientViT improves upon this design by providing a more computationally efficient version
that maintains high performance, making it well-suited for real-time applications. The
suggested approach entails training the EfficientViTM5 model using three well-known facial
expression recognition datasets: FER2013+, AffectNet, and RAF-DB. To increase the variety
of the training data and strengthen the model's resilience, a thorough data augmentation
pipeline is used. This pipeline incorporates many approaches, including random horizontal
and vertical flipping, adding Gaussian noise, applying Gaussian blur, and normalization.
These enhancements aid the model's ability to generalize more effectively by emulating a
diverse array of real-world differences in facial expressions. To further enhance the training
process and avoid overfitting throughout 30 epochs, the model is first trained on a randomly
chosen 80% subset of the training data for the first 15 epochs. This approach guarantees that
the model is exposed to novel characteristics throughout each epoch. Afterward, the model is
trained using the whole training dataset to reinforce its learning. The training approach is
specifically intended to optimize the capabilities of the EfficientViTM5 architecture, enabling
it to acquire discriminative features and patterns indicative of different facial emotions. The
trained model demonstrated exceptional accuracy rates of 94.28%, 94.69%, and 97.76% on
the FER2013+, AffectNet, and RAF-DB datasets. The findings emphasize the model's
resilience and effectiveness in identifying facial emotions across various datasets, showcasing
its potential for practical use in emotion-aware computing, security, and health diagnostics. This work dramatically enhances FER by presenting a dependable and practical approach to
identifying emotions via state-of-the-art deep learning methodologies. The results indicate the
potential for improved and adaptable interactions between humans and computers,
showcasing the effectiveness of advanced deep learning models such as EfficientViTM5 in
tackling intricate computer vision tasks.