ASL Recognition

Localization and Classification of ASL Hand Signs from image

Problem Statement

Design a model trained to learn human features gestures for signed interpretations by predicting and classifying American Sign Language without the help of a physical translator.

Dataset

The dataset was picked from kaggle https://www.kaggle.com/bsashank/asl-recognition which consists of hand gestures depicting the alphabets from American Sign language, divided into 26 folders which represent the different classes. The 26 classes consists of 26 letters of the English alphabet.

The dataset has 1584 images. The images were already annotated. The data was split into train and test files respectively.
Example of the image is attached and its properties.

Data Augmentation

In an attempt to increase the number of images per class, RoboFlow was used for Data Augmentation. After using the data augmentation function in RoboFlow, most classes had more than 100 images and the dataset was increased from 1573 images to 3828 images. The Data Augmentation performed were rotating images within 45 degrees and flipping on the vertical Axis.

SSD

SSD-Resnet Architecture Comparison with plain(or VGG) Architecture

This is important to understand the architecture of the model before working on the model, so the intuition behind RESNET, also known as Residual Network was to avoid the problem of vanishing or exploding gradient which was occurring in models like VGG specially for deep models. In VGG it is seen that when the model gets deeper for example in the above figure the loss is also increasing for 56-layers as compared to 20-layers for both training as well as testing, and we definitely would not want that. Some may think it could be because of overfitting but if that would have been the case the model should have lesser loss for the deeper model which is not the case.

Therefore, to avoid this optimization problem in VGG or plain models, RESNET came into picture, where some of the connections are skipped as shown in figure below. As the name says, the output from the lower nodes is forwarded to the deeper nodes so that when there is backpropagation and gradient is vanishing the model will have one more path of the previous layers’ weights and hence the vanishing or exploding gradient problem is solved.

SSD-Resnet 50 Base model without GPU

Model name : ssd_resnet50_v1_fpn_640x640_coco17_tpu-8 Trained through transfer learning
Number of steps =200
Batch_size =10

per-step time 289.906s

The graph above shows the loss curve for 200 steps without GPU as seen the loss graph has not yet converged but due to less computation power, we can’t move ahead without GPU model as for each step it took around 289.906 seconds and for total 200 steps it took almost 16 hours.

SSD-Resnet 50 Base model with GPU

Then after GPU became available, training with GPU allowed to train on much more training steps with bigger augmented images dataset.
The graphs below demonstrate training loss curves for 10 000 steps on augmented images dataset with 4 278 images for training and 159 for testing.

Which in reality didn’t give good results. When inferring the model, the model didn’t work anywhere good, the bound box regression predicts mostly correct, except some cases.
For example, on images below bounding box is predicted well correct, but the predicted class is the same and incorrect.

And image below shows how sometimes model predicts bounding box incorrectly.

YOLOv5

During this Project, the Model structure of YOLOV5 was not modified. It was kept the same. As can be seen in Figure: YOLOV5 Model Structure, the model has 270 layers, and 7,089,751 parameters and gradients.

To allow YOLOV5 to be able to run on a local computer, the following Folder structure in Figure: YOLOV5 Folder Setup was created. First, there are two subcategories, Images and Labels. Next, within those subcategories, there are three subcategories, Test, Validation, and Train. Within those folders, there are JPG files of ASL signs in the images folder and TXT files of the annotation classifications and size in the label folder.

Results

YOLOV5 Training Without Data Augmentation

The YOLOV5 model was trained for 150 epochs at a batch size of 2. Initially, the training was meant to be completed using Google Colab, but it was rendered useless as Google Colab kept crashing close to the midpoint of training. Next, training on a local computer was used. Training on a local computer took 2 full days. The results from the training can be seen in Figure: Epochs Vs Training Metrics. During the training, it can be seen that the Precision, Recall and Mean Average Precision at 0.5 and 0.5 to 0.95 have stabilized close to 100 epochs and after 100 epochs it seems that the model is simply overfitting on the validation data.

As can be seen in Figure: YOLOV5 Performance at 150 Epochs, the Performance of the model on the validation data had come out to be excellent. The model metrics of Precision, Recall and Mean Average Precision at 0.5 and 0.5 to 0.95 had come out to over 80% in most classes.

Video YOLOV5 Testing 1: (https://youtu.be/IYlTAkk7jn4)

It can be seen in Video YOLOV5 Testing 1 that the accuracy of our model is extremely poor. Out of the 26 letters tested, only 7 were correctly annotated. Although, the model was capable of distinguishing the hand and creating an annotation around it without creating an annotation that is too large or too small.

YOLOV5 Training Without Data Augmentation

The training process of YOLOV5 with data Augmentation took twice as much time to train when compared to the dataset without data augmentation. It was expected for the data augmentation model to take twice as long as the dataset doubled in size.

When comparing metrics found after training the data augmentation model and the model without, it was seen that the metrics of Precision, Recall and Mean Average Precision at 0.5 and 0.5 to 0.95 for Data Augmentation were considerably lower than the results found for the metrics on the model without data augmentation. This could be because of the number of epochs the Dataset of Data Augmentation model was run for, which was caused by the time restriction of this project. Although, as seen in Figure: YOLOV5 Training Metrics With Data Augmentation, it is visible that the training metrics were stabilizing. Therefore, it is reasonable to assume the model had trained for enough epochs.

Video: YOLOV5 Testing With Data Augmentation(https://youtu.be/o_SXytxkJqw)

It was found, after testing the YOLOV5 model with data Augmentation, the model performed much worse than the model without data augmentation. As seen in Video: YOLOV5 Testing With Data Augmentation. Out of the 26 letters tested, only 4 were annotated correctly.

Faster RCNN

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.

Our dataset being as followed, 3,828 images and annotation and image ratio being - 416 * 416 We included 70% test data (2.7k) 20% validation data(756) 10% test data (385) from the image dataset.
We used Roboflow for data preparation and used the Detectron2 (a computer vision model zoo of its own written in PyTorch by the FAIR Facebook AI Research group.) with COCO JSON format for the Faster RCNN model.

Demo

For the demo of the project API was chosen. Django backend application was developed with prediction endpoint, that accepts image as an input and returns predicted sign letter, bounding box and prediction score.

Conclusion

Our team was able to build an ASL alphabet recognition system using pre-trained SSD , faster RCNN and YOLOV5 based on the TF model zoo, https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md

Data augmentation increased the number of images and gave us better results .