M.S. AAI Capstone Chronicles 2024

First page Table of contents Previous page 207 Next page Last page

Introduction The goal of this project is to build an image captioning model that can accurately generate text descriptions of images. Image captioning combines Natural Language Processing and Computer Vision methods to learn semantic relationships between images and language to produce captions for images. Image captioning has many different applications ranging from image search tools to accessibility services for people who are blind or visually impaired. It can even be used to assist in medical diagnosis by generating medical reports from diagnostic imaging (Beddiar & Oussalah, 2023). The dataset we will use in this project is Flickr30k, introduced in the 2014 paper From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions (Young et al.). Flickr30k is an open-source dataset containing thousands of images and reference captions curated for computer vision problems such as image captioning and image retrieval. Each image in the dataset has five associated captions describing the image, obtained by crowdsourcing. Many different state-of-the-art solutions for using machine learning models to caption images exist already; however, they tend to be very large, making them expensive to train and deploy. As a secondary goal of this project, we hope to build an image captioning model that has reasonable performance, but with a much smaller size than contemporary models. We foresee the end-users of this model being developers who want to integrate an image captioning model into a system that has strict memory or latency constraints, such as a wearable device for people with low or no vision that describes their surroundings to them.

207

Made with FlippingBook - professional solution for displaying marketing and sales documents online