AAI_2025_Capstone_Chronicles_Combined
ResolveAI
After this, the text is tokenized using a custom tokenizer that includes an out-of-vocabulary token and appropriate filtering of punctuation and special characters. The tokenized sequences are then padded to the optimal maximum length, and these padded sequences are concatenated with additional numerical features that have been reshaped appropriately, resulting in a comprehensive feature set. The labels are subsequently converted into a NumPy array for use during training. The training data is split into training and test sets using an 80/20 ratio with stratification to ensure that the class distribution is maintained in both subsets. This stratified, random split is essential for achieving a representative evaluation of the model’s performance.Training is configured to run for up to 30 epochs using a small batch size of 32 to allow for more granular updates to the model weights. An early stopping mechanism is implemented to monitor the validation loss and halt training if the loss does not improve for 10 consecutive epochs, thereby preventing overfitting. In parallel, the training process employs a model checkpointing strategy that saves the best-performing model based on validation loss, ensuring that only the model with the lowest observed validation loss is preserved. This combination of early stopping and checkpointing allows the training to be both efficient and robust, terminating the process when improvements plateau and retaining the optimal model configuration. Model performance is evaluated using standard metrics such as accuracy, precision, and recall; these metrics offer transparency and insight into both the overall and class-specific effectiveness of the model. Additionally, F1-score, derived from precision and recall, is used in the analysis to provide a balanced measure of performance on the imbalanced binary classification task.
17
65
Made with FlippingBook - Share PDF online