M.S. AAI Capstone Chronicles 2024

Figure 3 Word Cloud for Flickr30k Captions

We further explored the text by checking for non-alphabetic characters and examining what context they appear in, and how frequently they appear. It is a common text cleaning step to remove non-alphabetic characters from the text because they can create noisy input that hurts model performance, but in some cases these characters contain important information. For example, captions describing an image of a clock would most likely write the time in numeric form (e.g., 2:20 PM) instead of text form (e.g., two twenty PM). Removing these characters would make such a caption incoherent, which would in turn make the captions generated by the model incoherent. Most of the non-alphabetic characters we identified in the captions occurred very infrequently. Numeric characters appeared the most frequently, but were still uncommon overall (1776 instances out of 155,070 total captions). This infrequency of appearance made it easy to analyze the usage of these characters in the captions. Some of the special characters, such as “@” and “=” only appeared as typos, whereas other characters such as “#” or “&” were always meaningful. The most ambiguity presented itself with question marks, which sometimes were used literally (e.g. “A man walking reads a wall asking ‘Where are you?’”), and sometimes used

210

Made with FlippingBook - professional solution for displaying marketing and sales documents online