AAI_2025_Capstone_Chronicles_Combined
16
Alternative Models Optimization
A range of LLMs were tested for the LLM-as-classifier model using human evaluation. Tested models included Gemini Flash 2.5, Claude Sonnet 4, and OpenAI GPT-5 (standard, mini, and nano variants). One notable outcome of the testing process was that many of the larger and more advanced models showed worse human evaluation results than the smaller models. This appeared to be due to the tendency of the larger models to get stuck in highly nuanced readings during the labeling phase, resulting in ambiguous or unintuitive cluster labels. The LLM chosen for the final comparison was GPT-5 Nano, as it presented the best balance of speed, cost, and cluster quality. Results The final autoencoder showed strong performance, achieving an MSE reconstruction loss of 0.36 on the validation dataset. Training and validation loss rapidly decreased in the first 20 epochs (see Figure 4), then gradually decreased until plateauing around epoch 120, with early stopping triggering the end of training at epoch 160. Training loss and validation loss were largely consistent with each other, with the training loss diverging from the validation loss in later epochs. The plateauing of validation loss and divergence of training loss indicate overfitting, which could potentially be addressed with more aggressive regularization, or data augmentation to increase the size of the dataset. As shown in Figure 5, the DEC phase of combined clustering and autoencoder training converged after 35 iterations, with fully stable clusters after this point. The four generated clusters were generally balanced in size with 51, 31, 50, and 56 conversations. To improve cluster interpretability, the conversations in each cluster were provided to the Claude Sonnet 4 large language model to interpret the common themes present in each cluster. This technique found that the DEC model clustered the conversations largely along pedagogical boundaries, creating groupings for “Step-By-Step Guided Practice”, “Addressing Errors and Building Understanding”, “Extended Practice Across Topics”, and “Diverse Problem-Solving Approaches”. The full LLM interpretation is provided in Appendix A. The generated cluster
240
Made with FlippingBook - Share PDF online