M.S. AAI Capstone Chronicles 2024

First page Table of contents Previous page 214 Next page Last page

Experimental Methods

Training Protocol

We built and tested six models, four of which were CNN-LSTM models and two of which were visual attention models. To train the models, the data was broken into training, testing, and validation sets. The Flickr30k dataset comes with predefined training splits used for benchmarking, so these splits were used instead of using sampling. The training set has about 94% of the data (29,000 samples) and the testing and validation set each have about 3% of the data (1,000 samples each). All of the models were trained using cross-entropy loss, with accuracy logged as an additional metric. Categorical cross-entropy is typically used to evaluate multi-class classification models, but it is well suited for this task because the models output a single word at each time step, which can be interpreted as a multi-class classification problem where the possible class labels are the tokenizer vocabulary. The models were trained for 5-15 epochs, with callbacks used to decrease the learning rate and stop training the model if performance on the validation set stopped improving. Another callback was used to display a caption prediction for a test image after every epoch. Each model used a similar training protocol, but was optimized differently based on the particulars of the model.

CNN-LSTM Architecture

Four of the candidate models are variations of CNN-LSTM architecture. Each model has an encoder, broken into two parts – an image encoder and a caption (or language) encoder – and a decoder. Figure 5 shows a high level diagram for this architecture.

214

Made with FlippingBook - professional solution for displaying marketing and sales documents online