M.S. AAI Capstone Chronicles 2024

From the small set of images we tested, it seems that the performance of the DenseNet121 Visual Attention model is comparable to results produced by ViT GPT2, but ViT GPT2 has an advantage in being able to generate more diverse descriptions than our model. Since the secondary goal of this project was to see if we could build an image captioning model with a much smaller size than contemporary models but with comparable performance, we also compared the size of the models (measured by file size of the model). The DenseNet 121 Visual Attention model is 97 MB, making it less than 1/10th of the size of ViT GPT2, which is 982 MB. ​ Looking to the future, there are many interesting avenues of investigation for improving and expanding this project. To start, we could add a temperature value to the model prediction logic to make the caption generation non-deterministic. Using our current methods, the model will always predict either the token with the highest probability (for greedy search) or the set of tokens with the highest joint probabilities (for beam search), but introducing temperature would allow for adjusting the predicted probabilities to pick tokens that are slightly less probable. This could improve the diversity of predicted captions and make them more human-like. MS COCO. Typically, generative machine learning models are trained on enormous amounts of data, and the fact that all of the models are overfitting slightly-to-moderately suggests that these models have more learning capacity than the available data. Additionally, when testing the web app that we built as a demo for one of the visual attention models, we observed that the model tends to produce poorer quality captions for pictures captured with smartphones or images that would be uncommon in a collection of personal photography, such as wildlife scenes. We suspect this is because the dataset, curated in 2014, primarily consists of images from Flickr, which at the Conclusion and Next Steps ​ If feasible, it would be beneficial to train the same models on a bigger dataset, such as

229

Made with FlippingBook - professional solution for displaying marketing and sales documents online