M.S. AAI Capstone Chronicles 2024
Figure 5
CNN-LSTM Architecture Diagram
In the image encoder, a CNN is used to extract features from the images. In each convolutional layer of the CNN, the model learns to capture progressively abstract features. For example, earlier layers of the CNN might learn to identify corners or edges, while deeper layers of the CNN may learn to identify more intricate patterns like shapes or textures. These features are then compressed into a dense representation. In the language encoder, a similar process is carried out for the captions. The captions are broken up into independent segments, embedded to capture their semantic context, and the LSTM learns temporal relationships from the segments. For example, the caption “A man is wearing a hat” would become multiple training examples: (a, man), (a man, is), (a man is,
215
Made with FlippingBook - professional solution for displaying marketing and sales documents online