M.S. AAI Capstone Chronicles 2024
by the annotator to express doubt (e.g., “A crowd admiring modern art?”). We used this information to inform our text cleaning strategy. Data Cleaning Common text preprocessing steps like stop word removal and stemming will cause a text generation model to produce incoherent text, so for this reason, we only applied minimal text cleaning operations to the captions. This included converting the text to lowercase, removing punctuation, removing unuseful special characters, removing excess whitespace, and replacing characters that easily translated to a word, such as “%” (percent) or “&” (and), with that word. After cleaning the captions, we applied tokenization and encoding to convert the captions into a numerical representation. We used either the tokenizer from Tensorflow’s Keras API or the Text Vectorization layer from the same API, which can be used as a tokenizer when adapted to the vocabulary for the data (Abadi et al., 2015). Both tokenization methods split each caption into substrings that are usually at the word level. The text preprocessing steps were applied to the entire dataset, however, this was not feasible for preprocessing the images. The biggest challenge presented by this dataset is how large it is. Preprocessing all the images at once would require loading all of them into memory, which can quickly exceed available memory. To get around this issue, we designed a data generator that dynamically loads the images and their captions in batches, only loading one batch into a memory at a time. This greatly reduces memory usage. Most of the candidate models we built use pre-trained models to extract features from the images, so we pre-computed the image features for each model and served these pre-computed image features from the data generators for those models instead of the full images, further reducing memory usage and training time. Each of these pre-trained models has its own preprocessing method that scales and normalizes
211
Made with FlippingBook - professional solution for displaying marketing and sales documents online