M.S. AAI Capstone Chronicles 2024

To support the optimization work, the SCT was implemented as a configurable hyper-model, with a suite of parameters that could be specified to determine the model’s characteristics. This included parameters to define configurations such as the number and size of the attention heads in each transformer encoder, the number of encoders to create and the depth and width of the fully connected network. The hyper-model also supported configurable hyperparameters, such as learning rate. An optimization job was executed using the Keras Random Search Tuner (Keras Team, n.d.-c). For each configurable parameter, an acceptable range of values was provided to the tuner which then performed 25 trials, each consisting of randomly generated values for each parameter, within this defined range. Within each trial, a model was compiled with the generated parameters and trained on the dataset. The tuner monitored the progress of the training job, looking for an optimal value of validation loss across all trials. The model configuration that achieved the lowest validation loss was selected as the best model and used for final performance validation against the hold out test dataset. Results Through experimentation with varying depths and configurations, the best of the standalone CNN and Transformer-based models were able to achieve good performance in both precision and recall metrics. However, both model types tended to overfit on the training dataset and did not converge smoothly, as illustrated in the example training/validation loss curves shown in the left two graphs in Figure 1 below. The condition did not improve much with the addition of regularization techniques, including dropout, L1 regularization and batch normalization. The move to the SCT architecture effectively eliminated the overfitting problem, as illustrated in the far right graph in Figure 1.

276

Made with FlippingBook - professional solution for displaying marketing and sales documents online