M.S. AAI Capstone Chronicles 2024
METEOR (Metric for Evaluation of Translation with Explicit ORdering) also measures the alignment of generated text against of references, but is a more sophisticated metric then BLEU because it treats semantically similar words as the same and uses weighted averages of various evaluation metrics such as precision and recall to measure both accuracy and fluency (Avinash, 2024). This makes it a more expressive measure, but more expensive to compute. The table below shows the scores for each model (ResNet50-LSTM was not evaluated).
Table 2
Evaluation Scores for Each Model, Normalized to 0-100 Scale
Model
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR
Visual Attention (ViT) (B)
65
47
37
23
42
Visual Attention (ViT) (G)
65
46
36
22
41
Visual Attention (DenseNet) (B)
59
41
32
19
38
Visual Attention (DenseNet) (G)
61
42
32
19
38
MobileNetV3-LSTM (G)
57
39
30
17
35
VGG16-LSTM (G)
36
22
17
9
30
Custom CNN-LSTM (G)
46
27
20
11
26
Note . (G) indicates greedy search. (B) indicates beam search.
The METEOR scores for the visual attention models are very good relative to the size and complexity of the Flickr30k dataset, however, the BLEU-4 scores range from passable for the visual attention models to not very good for the CNN-LSTM models. The fact that the METEOR scores are consistently higher than the BLEU-4 scores likely indicates that the captions are accurate, but tend to use different words than the reference captions. The top three models registered on Papers with Code evaluated against the Flickr30k test set have the BLEU-4
227
Made with FlippingBook - professional solution for displaying marketing and sales documents online