M.S. AAI Capstone Chronicles 2024

For example, in the figure on the previous page none of the reference captions describe the uniform the dirt bikers are wearing, while the generated caption is focused on their uniforms. This would result in a lower accuracy score, even though the generated caption is correct. That being said, it is simple and efficient to compute, so it can be used as a proxy for how well the model is learning from the data during training. The loss function, cross-entropy loss, is less susceptible to this issue because it evaluates the predicted probabilities for all tokens and not exact matches. Although dropout layers and other regularization techniques were used in the models, they did show signs of slight to moderate overfitting during training, depending on the model. This was especially prominent for the visual attention model variations. Since adding regularization to the models did not make much of a difference in the overfitting behavior, this could indicate that the learning capacity of the models is greater than the available amount of data. predictions generated by each model on the test set. BLEU (Bilingual Evaluation Understudy) is a commonly used set of metrics for natural language processing tasks that evaluate the quality of generated text by computing the proportion of n-gram overlap (Doshi, 2021). BLEU-1, for instance, compares unigram overlap between a predicted caption and a set of references while BLUE-4 compares 4-gram overlap. BLEU is a popular choice of metric because it is easy to compute and interpret, but it is limited by not being able to capture semantic similarity. Similar to accuracy, BLEU scores would penalize the caption in Figure 9 for using words that do not appear in the reference captions, even though they correctly describe the image. ​ For the final evaluation of the models, we computed BLEU and METEOR scores for

226

Made with FlippingBook - professional solution for displaying marketing and sales documents online