An Empirical Evaluation of Convolutional and Recurrent Neural Networks for Lip Reading

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

The 3DCNN and the LSTM are both suited for video classification because of their ability to take into account temporal information. However, the two models do this in a very distinct manner. The aim of this work is to investigate which of the two models is better suited for automatic lip reading. Moreover, we also tested which model is better suited for transfer learning. We conducted two groups of experiments in this work. The first group consisted of experiments in which the two models were tested under several conditions in which the models were trained from scratch. The second group was conducted to determine which of the two models is better suited for transfer learning. We used a pretrained 3DCNN and LSTM from the first group of experiments to verify whether the accuracy of a model trained on a different dataset improved, compared to when it was trained from scratch. From the first group of experiments, we concluded that the 3DCNN is better suited for automatic lip reading because it achieves a higher test set accuracy than the LSTM. However, the 3DCNN takes a lot longer to train than the LSTM. From the second group of experiments, we can conclude that overall the 3DCNN is better suited for transfer learning. On the basis of all the experiments conducted, we conclude that overall the 3DCNN seems to be better suited for use in automatic lip reading in many different conditions.

Keywords

automatic lip reading, neural network, convolutional neural network, recurrent neural network

Citation