Real-time Image Captioning with Attention Mechanism

Imaging you’re taking a picture of something in front of you – you pause to look at the picture and process the visual information in that picture cognitively. This is done to understand what’s in the picture. What if we wanted to automatically give a context or a tag to a picture? We can train an algorithm that recognizes the objects or features in a picture and then it assigns the picture a sentence describing what is happening in the picture. This is known as Image Captioning.

For my MSc thesis at the University of Mauritius, we tried to do something that was along the lines of AI4Good. We developed an android application, EyeSee, that helped people with poor eyesight to navigate around by understanding what was in front of them via text-to-speech. While there are many applications that take pictures and caption them, there weren’t a lot that did it in real-time.

The EyeSee model

The architecture of the model implemented can be summarized as:

\begin{split} F&=encoder(I)\\ c_{t=0}&=F\\ O_{t}&=decoder(c_{t:0\rightarrow t}) \end{split}

F represents the convolved features of an image I. c(t=0) represents the initial context. Ot represents the generated caption at time t. The encoder consists of the CNN and a fully-connected layer while the decoder is made up of the attention mechanism and GRU cells.

How the EyeSee model works to generate a caption

We built the model in Tensorflow. We used two datasets; one being the MSCOCO2014 dataset and the other, our own dataset made from pictures of Tech Avenue found on the campus of the University of Mauritius.

Testing the EyeSee model

Below is a random image from the MSCOCO2014 dataset and to its right, the true caption and its predicted one.

Now an image from Tech Avenue and to its right the predicted caption.

Running the model on the Android application

We built the application in Android Studio. Basically, whenever a new frame is captured by the Camera API in android framework, it is preprocessed and normalised and then run through our model. Based on the detected extracted features in that frame, a caption is generated and displayed on the application. The caption is also read out by a TextToSpeech module.

Overview of the android application, EyeSee
Demo of the EyeSee application for English captions

We wanted to take it up a notch. Our designed application took in an image and spat out English captions. We annonated the Tech Avenue dataset this time around with Mauritian Kreol sentences, retrained the model and updated the application with the new model.

Demo of the EyeSee application for Kreol captions

However, lack of a TextToSpeech module for the Mauritian Kreol hampered the most crucial part of the application; one could only read the caption in Kreol. While there are some ways we could improve the application or make the model even more accurate, we were able to see an application of AI in the real day-to-day world. Someone with bad eyesight or a blind person could now know what’s in Tech Avenue using EyeSee.

If you wish to know more about the model, we published the work here. This paper and thesis would not have been possible without Tarini.

%d bloggers like this: