M.S. AAI Capstone Chronicles 2024
the images, but for the model that did not use pre-computed image features, we resized the images to be 64x64 pixels and normalized the pixel values to the scale (0, 1). Initially, we experimented with randomly selecting one caption for each image to reduce the memory and time needed to train the models but after the improvements described above, we were able to train all of the models on all of the image-caption pairs for a total of 155,070 training examples. Background Information Image captioning is a popular deep learning problem and has been studied extensively. One of the most common kinds of architecture for this problem is the encoder-decoder pattern ( Papers with Code - Image Captioning , n.d.). At a high level, this pattern can be broken into two general parts: The encoder can be any kind of neural network that is well-suited to image data, and the decoder can be any kind of neural network well-suited to sequential data. One common method is to use a CNN (Convolutional Neural Network) for the encoder for the images and an LSTM (Long Short Term Memory) as an encoder for the captions, or as the decoder. In this architecture, the feature maps extracted by the CNN can be used as input to the LSTM decoder along with the captions as one input sequence. In generating image captions, image information is included in the initial state of an LSTM (Hossain et al., 2018). Alternatively, the LSTM can be used as part of the encoder for the caption independently of the CNN, and then the combined representations can be decoded through a feed forward network. The next words are generated based on the (1) An input image and its caption are each encoded into dense feature representations. (2) The output of the previous step is decoded into an output sequence of words and combined phrases to produce an image caption.
212
Made with FlippingBook - professional solution for displaying marketing and sales documents online