Nepali Image Captioning Using Machine Learning

Computer vision using deep learning and image processing is at the forefront of realizing the benefits of AI in the real world.  For example, Facebook provides automatic alt text that provides visually impaired and blind people with a text description of a photo using image captioning technology.

The machine-generated image captioning has several potential applications of assisting visually-impaired to describe the contents of an image, image-based search for search engines, and real-time anomaly detection among others.

Image captioning combines two fields of machine learning – computer vision for image recognition and object detection and natural language processing. It uses a process of extracting coherent features of an image using an image recognition model and describing that image in a natural language using a language-based model that makes use of the features extracted by the image-based model.

Why Nepali image captioning?

While there were implementations for image captioning in languages like  English and Mandarin, there wasn’t much research for captioning images with the Nepali language. Hindi and  Bhojpuri were some of the closest languages to Nepali in terms of grammatical structure among other languages upon which image captioning has been tried. Our AI engineers addressed this with a machine learning model with computer vision and natural language processing that describes an image in the Nepali language. 

Building machine-generated image captioning

When humans look into an image, their eyes along with the neural network connect and communicate to analyze, identify and finally register the image content into the brain.  Similarly, a neural network, a type of machine learning, models itself after the human brain. This creates an artificial neural network via an algorithm that allows the computer to learn by incorporating new data. 

For this research, the data set was generated on top of the existing  Microsoft COCO. The original data set consists of 100,000+ image-caption pairs. There were 82,783   image-caption pairs in the training set, 40,504 image-caption pairs in the validation set, and 40,775 in the test set.

For each given English captions,  we used Google Translate service to convert it to a Nepali caption. Hence the model’s captions’ grammatical compositionality is dependent upon the correctness of Google Translation Service’s translated captions. Instead of machine-translated Nepali captions, using captions generated in Nepali manually for all the targets of the training samples should increase the performance of the models presented by a  great margin. In short, we used the following methodology to build a Nepali image captioning model

  • Automatic image recognition using deep learning
  • A convolution neural network to extract high-level features
  • Feed the features to a recurrent neural network that generates a caption

We experimented with two encoder-decoder machine learning architectures, one with visual attention and another without visual attention.

  • Model  A uses a  plain encoder-decoder architecture where the encoder is a  pre-trained ResNet model and the decoder is a LSTM network.  
  • Model  B uses a Show, Attend, and Tell model with Visual attention architecture

The research showed that model  A performed better without visual attention than model  B with visual attention hinting that visual attention is not always the optimal method for image captioning. The captioning on the top used Model A while, a figure below had used Model B. 


Fig: Captions Produced by Model A



Fig: Captions Produced by Model B


Loss Perplexity
Adam 0.09300 1.09412429
RAdam 0.0218 0.88853
ASGD 0.049287


Table 1: Result of Model A


Optimizers Loss Perplexity
Adam 0.06777 1.0069894
RAdam 0.07793 1.0078235

Table 2: Result of Model B

Further Research 

Language modeling is a  task where efficiency increases with the increasing availability of the data set and a  sound training regime. The performance could be enhanced using a larger data set. With an extensive training data set, manually generated training captions in Nepali, and proper fine-tuning, researchers should be able to come up with more compositionally valid captions for the test inputs to compare against the baseline presented in this paper. 

Further, the models presented here were trained using a  single type of loss function and the variation in the efficiency of the models with the variation in the loss function used can be an interesting extension to this work. 

Also, there are bigger and better techniques that researchers have been investigating – such as using reinforcement learning for building end-to-end deep learning systems or using the Transformer or BERT model as a language model that can reap better results. 

Next Steps

This study opens doors to similar but more challenging problems including video description generation in the Nepali Language and creates a baseline model against which new research can be compared and improved. If you want to learn more in greater detail, it is described in a research paper “Nepali Image Captioning” describing the different models. 

Want to Learn More About Integrating AI into your Product?AI-Playbook

Sushil Ghimire

Sushil Ghimire is a Machine Learning Engineer at Leapfrog Technology. He is proficient at Natural language processing having equally interest in Deep learning algorithms and data structures.

More in Blogs

Enhancing User Engagement by Profile-Matching-Algorithm in a Social Network Platform using AI Artificial Intelligence

Enhancing User Engagement by Profile-Matching-Algorithm in a Social Network Platform using AI

Overview: Traditionally, mentorships are provided based on titles, wisdom, hierarchy, and status. Tribute platform believes that wisdom is gained from

Read more
Standardization of End-to-End Data Pipeline for AI Project Using Kedro Artificial IntelligenceInsights

Standardization of End-to-End Data Pipeline for AI Project Using Kedro

Background The conventional process for a Machine Learning/Data Science related project usually starts with the data. It consists of different

Read more
Switching from a Data Science Hobbyist to a Professional? Artificial Intelligence

Switching from a Data Science Hobbyist to a Professional?

As a data hobbyist, we might build many models with different data sets and alter these to analyze the outputs.

Read more