Computer vision using deep learning and image processing is at the forefront of realizing the benefits of AI in the real world. For example, Facebook provides automatic alt text that provides visually impaired and blind people with a text description of a photo using image captioning technology.
The machine-generated image captioning has several potential applications of assisting visually-impaired to describe the contents of an image, image-based search for search engines, and real-time anomaly detection among others.
Image captioning combines two fields of machine learning – computer vision for image recognition and object detection and natural language processing. It uses a process of extracting coherent features of an image using an image recognition model and describing that image in a natural language using a language-based model that makes use of the features extracted by the image-based model.
Why Nepali image captioning?
While there were implementations for image captioning in languages like English and Mandarin, there wasn’t much research for captioning images with the Nepali language. Hindi and Bhojpuri were some of the closest languages to Nepali in terms of grammatical structure among other languages upon which image captioning has been tried. Our AI engineers addressed this with a machine learning model with computer vision and natural language processing that describes an image in the Nepali language.
Building machine-generated image captioning
When humans look into an image, their eyes along with the neural network connect and communicate to analyze, identify and finally register the image content into the brain. Similarly, a neural network, a type of machine learning, models itself after the human brain. This creates an artificial neural network via an algorithm that allows the computer to learn by incorporating new data.
For this research, the data set was generated on top of the existing Microsoft COCO. The original data set consists of 100,000+ image-caption pairs. There were 82,783 image-caption pairs in the training set, 40,504 image-caption pairs in the validation set, and 40,775 in the test set.
For each given English captions, we used Google Translate service to convert it to a Nepali caption. Hence the model’s captions’ grammatical compositionality is dependent upon the correctness of Google Translation Service’s translated captions. Instead of machine-translated Nepali captions, using captions generated in Nepali manually for all the targets of the training samples should increase the performance of the models presented by a great margin. In short, we used the following methodology to build a Nepali image captioning model
- Automatic image recognition using deep learning
- A convolution neural network to extract high-level features
- Feed the features to a recurrent neural network that generates a caption
We experimented with two encoder-decoder machine learning architectures, one with visual attention and another without visual attention.
- Model A uses a plain encoder-decoder architecture where the encoder is a pre-trained ResNet model and the decoder is a LSTM network.
- Model B uses a Show, Attend, and Tell model with Visual attention architecture
The research showed that model A performed better without visual attention than model B with visual attention hinting that visual attention is not always the optimal method for image captioning. The captioning on the top used Model A while, a figure below had used Model B.
Fig: Captions Produced by Model A
Fig: Captions Produced by Model B
Table 1: Result of Model A
Table 2: Result of Model B
Language modeling is a task where efficiency increases with the increasing availability of the data set and a sound training regime. The performance could be enhanced using a larger data set. With an extensive training data set, manually generated training captions in Nepali, and proper fine-tuning, researchers should be able to come up with more compositionally valid captions for the test inputs to compare against the baseline presented in this paper.
Further, the models presented here were trained using a single type of loss function and the variation in the efficiency of the models with the variation in the loss function used can be an interesting extension to this work.
Also, there are bigger and better techniques that researchers have been investigating – such as using reinforcement learning for building end-to-end deep learning systems or using the Transformer or BERT model as a language model that can reap better results.
This study opens doors to similar but more challenging problems including video description generation in the Nepali Language and creates a baseline model against which new research can be compared and improved. If you want to learn more in greater detail, it is described in a research paper “Nepali Image Captioning” describing the different models.