M.S. AAI Capstone Chronicles 2024

Table 1

Candidate Models

Model

Image Encoder

Caption Encoder

Decoder

A

Custom CNN

LSTM

Feed Forward Network

B

MobileNetV3

LSTM

Feed Forward Network

C

VGG16

LTSM

Feed Forward Network

D

ResNet50

LSTM

Feed Forward Network

Custom CNN-LSTM

​ For this model, a simple CNN was built from scratch for the image encoder instead of using a pre-trained CNN to extract image features as a preprocessing step. The CNN has 3 convolutional layers each with a stride of 2 and 16, 32, and 64 filters, respectively. The number of filters doubles every layer in order to be able to capture more from the deeper layers of the network, and a stride of 2 was chosen because the input images for this model are relatively small (64x64) to stay within memory constraints. This model stacks two LSTMs in the language encoder, each with 256 units. Dropout layers are used in the language encoder and the decoder to reduce overfitting, however, increasing the dropout rate above 0.2 hurt the performance of the model without improving overfitting. The model was trained for 10 epochs with a learning rate of 1e-3, ending on epoch 8 after early stopping with a reduced learning rate of 1e-5.

MobileNetV3-LSTM

​ This model is very similar to the Custom CNN-LSTM model, but uses image features extracted from MobileNetV3-Small (Howard et al., 2019) as input to the image encoder.

217

Made with FlippingBook - professional solution for displaying marketing and sales documents online