Facial expression recognition from videos in the wild is a challenging task due to the lack of abundant labeled training data. Large DNN (deep neural network) architectures and ensemble methods have resulted in better performance, but soon reach saturation at some point due to data inadequacy. This thesis presents a video-based cost-effective facial expression recognition system that is capable of recognizing basic facial expressions. Our method addresses three fundamental issues: (i) it isolates different regions of the face and processes them independently using a multi-level attention mechanism achieving good performance in a cost-effective manner, (ii) it uses a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset (Body Language Dataset - BoLD) iteratively helping in addressing the lack of labelled datasets, and (iii) it generates a large-scale facial expression dataset Affect-Net-Vid using a proposed generative network StarGAN-EgVA. Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets Affect-Net-Vid and AFEW 8.0 when compared to other single models.