AAI_2025_Capstone_Chronicles_Combined

7

structures is a known failure mode that can prevent the model from producing detections even when losses decrease normally (an issue evident in early project results). This pipeline remains an active optimization target, alongside anchor-scale tuning for small fractures and longer training schedules supported by the literature (Pike et al., 2024). 5 Results 5.1​ CNN ​ The initial simple CNN baseline model was designed for the binary classification task of “fracture” or “normal.” The model trained for ten epochs, and achieved a final training accuracy of 84% as shown as Figure 3. In parallel, the training loss consistently decreased from start to finish. The training F1-score rose to approximately 0.67, suggesting that the model had begun to recognize certain features associated with fractures. Validation performance, however, was less stable. Validation accuracy remained in a narrow mid-seventies band, but the validation F1-score shifted considerably depending on the epoch.

Figure 4 Test set confusion matrix of simple CNN model.

The model achieved a test accuracy of 75.7%, and a test loss of 2.827, but failed in detecting fractures. There was a severe bias to the majority class as it achieved a recall of 0.079, which means it only found 7.9% of actual fractures, and an overall f1-score of .140 on the fracture class. In the majority class, the model achieved a recall of 0.983 with an f1-score of 0.859. This performance indicates that the model is almost always predicting “normal,” which is visually confirmed in Figure 4. Out of 972 actual fracture cases in the test set, the model only correctly identified 77. On the other hand, the model only missed 49 normal cases. This instability reinforces the influence of the dataset’s imbalance; it also suggests that subtle fracture indicators visible in only a fraction of slices are difficult for a basic 2D network to capture without additional architectural support. Overall, the simple CNN baseline establishes that the model can learn meaningful anatomical information, yet struggles to detect cervical fractures reliably. This baseline serves as a reference point for future modeling, and its behavior highlights the need for more capable architectures and richer contextual information.

Figure 3 Simple CNN training and validation accuracy curves.

310

Made with FlippingBook - Share PDF online