M.S. AAI Capstone Chronicles 2024

ViT Visual Attention

​ The second variation used a vision transformer instead of a CNN as the image feature

extractor. The specific vision transformer used was the google/vit-base-patch16-224-in21k model available on Hugging Face (Wightman, 2019). This model also uses two decoder layers, but was trained with a dropout rate of 0.3. It was trained for 15 epochs and a learning rate of 1e-4, and trained for 11 epochs before stopping.

Inference

Image captioning is an autoregressive task, meaning​that during inference, the captions are generated sequentially with each word predicted based on the words predicted in prior steps. To predict a caption for a new image, the model is called with the image and the start sequence token as input. The model predicts a word, and that word is added to the input sequence and the model is called again. This continues until either the model predicts the end sequence token, indicating the end of the caption, or the caption reaches a specified maximum length. The approach described above is a greedy search, which is fast to compute but can sometimes produce suboptimal results because it only considers the top word at each time step. An alternative approach for generating captions is to use a beam search. With beam search using k beams, multiple candidate sequences are maintained and at each time step, the candidate sequences are sorted by their joint probabilities and only the top k sequences are kept. Once all of the candidate sequences are complete, the sequence (caption) with the highest joint probability is chosen. A benefit of the visual attention approach is that the attention scores from the cross-attention layers in the decoder can be mapped onto an input image to visualize which part

222

Made with FlippingBook - professional solution for displaying marketing and sales documents online