M.S. AAI Capstone Chronicles 2024
Figure 8 shows an example of attention maps for a single caption generated by each
model. The attention maps are overlaid as heatmaps on the resized input images, where lighter cells have higher average attention scores. The first set of attention maps are from the DenseNet visual attention model, and the second set are from the ViT visual attention model. The difference in detail between the two sets of attention maps is due to the different image feature map sizes for each model (7x7 for DenseNet, 14x14 for ViT). For the image on the top left, the predicted caption was “a young man is bowling at a bowling alley.” In the maps associated with the words describing the person in the image (young man is bowling), the cells in the center of the image where the person is located have higher scores. Conversely, the words describing the setting (at a bowling alley) have higher attention scores for the cells that border the image. In the second set of attention maps for the caption “a red and white vehicle is on the beach,” there is also a clear distinction between attention to the foreground (a red and white vehicle) and the background (on the beach). The visual attention model architecture showed substantial improvement in performance over the CNN-LSTM model architecture, with the visual attention model using ViT achieving the highest performance of all the candidate models. Most of the models were able to generate coherent captions for images with varying accuracy and quality, with the exception of the Custom CNN-LSTM and ResNet50-LSTM models. The CNN-LSTM models generally produce accurate captions for easily identifiable images, but the visual attention model variations showed increased capacity for accurately captioning less interpretable images with higher ambiguity, with room for improvement. Results and Conclusion
224
Made with FlippingBook - professional solution for displaying marketing and sales documents online