M.S. AAI Capstone Chronicles 2024
Table 1
Candidate Models
Model
Image Encoder
Caption Encoder
Decoder
A
Custom CNN
LSTM
Feed Forward Network
B
MobileNetV3
LSTM
Feed Forward Network
C
VGG16
LTSM
Feed Forward Network
D
ResNet50
LSTM
Feed Forward Network
Custom CNN-LSTM
For this model, a simple CNN was built from scratch for the image encoder instead of using a pre-trained CNN to extract image features as a preprocessing step. The CNN has 3 convolutional layers each with a stride of 2 and 16, 32, and 64 filters, respectively. The number of filters doubles every layer in order to be able to capture more from the deeper layers of the network, and a stride of 2 was chosen because the input images for this model are relatively small (64x64) to stay within memory constraints. This model stacks two LSTMs in the language encoder, each with 256 units. Dropout layers are used in the language encoder and the decoder to reduce overfitting, however, increasing the dropout rate above 0.2 hurt the performance of the model without improving overfitting. The model was trained for 10 epochs with a learning rate of 1e-3, ending on epoch 8 after early stopping with a reduced learning rate of 1e-5.
MobileNetV3-LSTM
This model is very similar to the Custom CNN-LSTM model, but uses image features extracted from MobileNetV3-Small (Howard et al., 2019) as input to the image encoder.
217
Made with FlippingBook - professional solution for displaying marketing and sales documents online