M.S. AAI Capstone Chronicles 2024
A.S.LINGUIST
6
pronouns, auxiliaries, verbs and articles are the most frequent words in the conversations
datasets.
Figure 3
Boxplots of the number of words and characters in the questions and answers of the
conversations dataset
During text preprocessing, we applied some modifications to both questions and answers.
We first removed contractions, by modifying expressions like “i’ve” with “I have” or “i’m” with
“I am”, eliminated punctuation and finally applied lowercasing. Regarding tokenizati on, which is
a common preprocessing step while building a natural language model, it was implemented
through the tokenizer of the pre-trained model we chose for the project, that is Flan-T5-Base.
186
Made with FlippingBook - professional solution for displaying marketing and sales documents online