M.S. AAI Capstone Chronicles 2024

MobileNetV3 is part of a family of lightweight CNN models that are optimized to run on a wide range of hardware, including phone CPUs, making them very efficient and small compared to other pre-trained CNNs. The feature vectors extracted with MobileNetV3-Small have a shape of (576,) which is relatively small. The MobileNetV3-LSTM model was trained for 15 epochs with a learning rate of 1e-3, ending on epoch 11 after early stopping with a reduced learning rate of 1e-7. ​ This model uses VGG16 (Simonyan & Zisserman, 2015) as the image feature extractor. VGG16 is an object detection and classification algorithm which can classify 1000 images of 1000 different categories with 92.7% accuracy. VGG16 is a popular image classification model due to its unique architecture of a deep neural network with relatively small filters (3x3). The feature vectors extracted with VGG16 have a shape of (512,) making it the smallest pre-trained CNN tested. A single LSTM is used in this model to encode the caption, and a single dense layer is used in the hidden layers of the decoder. This model was trained for 5 epochs and a learning rate of 1e-3. VGG16-LTSM The final variation of the CNN-LSTM architecture uses ResNet50 (He et al., 2015). The ResNet CNN family uses residual learning, in which layers of the network learn from residuals (the difference between correct output and input) between layers instead of the full transformation, to build substantially deeper networks than other CNN architectures. The feature vectors for ResNet50 have the shape (2048,), making this the largest pre-trained CNN used. ​ This model originally shared a very similar architecture to the other CNN-LSTM models discussed, but was modified to concatenate the encoded image and embedded caption into one ResNet50-LTSM

218

Made with FlippingBook - professional solution for displaying marketing and sales documents online