M.S. AAI Capstone Chronicles 2024
sequence and run this sequence through the LSTM instead of feeding only the caption sequences to the LSTM. This was done to see if better captions could be generated by including the image features as context for the LSTM, however, this approach was not effective and the model loss did not decrease past one epoch.
Visual Attention Models
The second architecture explored was a visual attention architecture in which a multi-layer transformer block is used as the decoder. A high-level visualization of this architecture is presented in the diagram below. The dashed lines encapsulate the transformer block, which can be stacked.
Figure 7
Visual Attention Architecture Diagram
Similar to the CNN-LSTM architecture, the visual attention architecture uses a
pre-trained model, either a CNN or a vision transformer, to extract features from the images. For the captions, a sequential embedding is created. Because transformers process sequences in parallel, by default, transformers are permutation invariant (Vaswani et al., 2023). This means
219
Made with FlippingBook - professional solution for displaying marketing and sales documents online