M.S. AAI Capstone Chronicles 2024

current time step and the previous state of the input. This process continues until it gets the end token of the sentence. As an alternative to using a CNN for the image encoder, a vision transformer (ViT) can be used instead. Vision transformers are a vision-only variation of traditional transformer architecture that are designed to process images as sequential patches that can be treated as a sequence of tokens (Dosovitskiy et al., 2021). Vision transformers are much less compute resource intensive than CNNs, but require a substantial amount of training data to reach equivalent performance. Another approach is to use a transformer for the decoder ( Vision Encoder Decoder Models , n.d.). In this approach, a transformer or pre-trained language model like BERT or GPT-2 is used to decode the input sequences into captions. One popular pre-trained model that follows this pattern is ViT GPT2, which uses a vision transformer as the encoder and GPT-2 as the decoder (nlpconnect, 2023). A unique aspect of training machine learning models to generate text is that during training, the model is shown the correct captions for each image both as input and the ground truth to compare predictions against. Including the reference captions for an image as input to the decoder is part of a technique known as teacher forcing (Wong, 2019). Teacher forcing uses the correct output as input to the decoder part of the model at each sequential step instead of using the model predictions for the previous step. This technique makes training the model faster and more stable by preventing the accumulation of errors that can happen when the model makes very bad predictions in the early stages of training.

213

Made with FlippingBook - professional solution for displaying marketing and sales documents online