M.S. AAI Capstone Chronicles 2024

sequence. As mentioned earlier, transformers “see” the entire sequence at once, so causal masking is applied to prevent tokens from attending to tokens later in the sequence. The updated sequential embedding and the image features are then fed to the cross-attention layer, where the sequential embedding is the query and the image features are the key. This allows tokens in the sequential embedding to attend to the image, enriching the sequential embedding again, this time with contextual information about the relationships between parts of the image and tokens in the sequence. The output of the cross-attention layer then passes through a feed forward network to further refine the sequential embedding. At this point, the updated embedding is either fed to the next block of the decoder as input, or is fed through the output layer to make a prediction. ​ The first variation of the visual attention architecture uses DenseNet121(Huang et al., 2018) as the image feature extractor. DenseNet is a family of CNNs pre-trained on ImageNet. The DenseNet models differ from other pre-trained CNNs by using dense convolutional layer blocks, in which there are direct connections between all the layers in a block instead of using feed-forward connections. This has the benefit of improving feature propagation and mitigating the vanishing gradient problem. ​ This model uses two decoder layers and was trained with a dropout rate of 0.5. Like the CNN-LSTM models, it was trained with early stopping and reduce learning rate on plateau callbacks, as well as a callback that generates a prediction from the model after every epoch. It was trained for 15 epochs and a learning rate of 1e-4, and trained for 10 epochs before stopping. DenseNet121 Visual Attention

221

Made with FlippingBook - professional solution for displaying marketing and sales documents online