M.S. AAI Capstone Chronicles 2024

wearing) and so on. Figure 5 shows a full example of this process. Breaking the captions up like this allows the encoder to capture both smaller dependencies between individual words as well as longer dependencies that capture the context of the full caption. Similar to the image encoder, the features extracted by the LSTM are compressed into a dense representation.

Figure 6

Example of Training Data Structure for CNN-LSTM Models

The image representation and the caption representation are then combined into a single representation and then fed as input to the decoder, which is a feed forward network. The decoder predicts the next word in a caption, one word at a time. During inference, captions are built incrementally starting with the start sequence token. The following table shows the variations in the architecture of each of the candidate CNN-LSTM models. All of the pre-trained CNNs used on this project use weights from training on ImageNet.

216

Made with FlippingBook - professional solution for displaying marketing and sales documents online