AAI_2025_Capstone_Chronicles_Combined

7

Exploratory analysis revealed clear, topic-specific vocabulary patterns at different math levels within the dataset. For instance, terms like "sin", "theta", and "pi" are prominent in trigonometry dialogues, whereas "derivatives," "series," and "In" are prevalent in calculus conversations. The ’Math Level’ variable serves as valuable metadata that contextualizes each conversation entry. Additionally, newly engineered features, including conversational turns, total word count, reading grade level, and sentiment, provide further descriptive characteristics. Figure 1 visually represents the distribution of the Math Level across all conversations in the CoMTA dataset. It illustrates the number of conversational entries available for each educational level, such as Elementary, Algebra, Geometry, Trigonometry, and Calculus. This visualization is crucial for understanding the composition of the dataset’s mathematical content, highlighting which levels are more heavily represented, and ensuring a balanced (or intentionally imbalanced) foundation for training our unsupervised clustering model. Figure 1 Distribution of mathematical proficiency levels in the CoMTA dataset.

The length of student-chatbot conversations, measured by both the number of turns and the total word count, provides insight into the complexity of the interaction. As shown in

231

Made with FlippingBook - Share PDF online