[ML] ‘Bottom Up and Top Down Attention for Image’ Paper One-page Summary
1. Abstraction
Image capture and VQA encompass both CV and NLP. The top-down approach left side of figure is mainly used for the CV task of it. But as we can see that intuitively there is a limitation of a top-down approach that it just treats the image in a grid-wise way not object-wise.
So if we use a properly bottom-up approach(right side of figure), the model can understand and utilized image information for the tasks. This paper covers how to improve image captioning and VQA performance with the combination of top- down and bottom-up manner.
2. Structure
a) Bottom -Up Attention Model
In this paper, they use Faster R-CNN is a Bottom-up way for combining with the top- down one. The model uses Resnet-101 as a pre-trained model. With the Faster R-CNN, It sets the region and labels to detect the object from a given image. And they add “Attribute Pre-Dictor” to Faster R-CNN for predicting object class in the candidate area better. The given bottom-up mechanism is combined with a different top-down one to carry out image captioning and VQA.
b) Captioning Model
The capturing model uses LSTM that is utilizing output partial sequence as a context. soft top-down orientation is used to calculate the feature weight for each caption generation.
` `It consists of two layers of LSTM and each layer is responsible for different parts. The first layer is the Top-Down assertion LSTM layer which is inputted to the previous language LSTM, the mean pooling value of the image, words previously generated. The final output y_t is a series of words from Language LSTM, and the final output statement is determined by multiplying the conditional distribution at each time point.
c) VQA Model
In VQA models, the question is used as a context, and just like on the VQA model, soft attention is used.
And It is a joint multi-modal embedding structure that uses both images and questions for the task. When creating an image feature using the question and lastly calculate the predicted score about the candidate answers with the question and image feature concatenation.
3. Conclusion
The experiment was carried out using “MS-COCO” and “VQA v2.0” datasets and, as expected, shows better performance than the existing ones. And also the model has a good understanding of the image from a quantitative analysis point of view.
All credit by Anderson et al., “Bottom-Up and Top-Down Attention for Image Captioning and VQA”., CVPR 2018
Leave a comment