M.S. AAI Capstone Chronicles 2024

First page Table of contents Previous page 186 Next page Last page

A.S.LINGUIST

pronouns, auxiliaries, verbs and articles are the most frequent words in the conversations

datasets.

Figure 3

Boxplots of the number of words and characters in the questions and answers of the

conversations dataset

During text preprocessing, we applied some modifications to both questions and answers.

We first removed contractions, by modifying expressions like “i’ve” with “I have” or “i’m” with

“I am”, eliminated punctuation and finally applied lowercasing. Regarding tokenizati on, which is

a common preprocessing step while building a natural language model, it was implemented

through the tokenizer of the pre-trained model we chose for the project, that is Flan-T5-Base.

186

Made with FlippingBook - professional solution for displaying marketing and sales documents online