AAI_2025_Capstone_Chronicles_Combined
9
For example, the sample in Figure 7 shows a sample image with bounding box prediction. Although the model captured the location of the fracture, it did so with only 6% confidence. To help to further combat overfitting in the future, methods to explore include more advanced data augmentation, regularization, K-fold cross validation, or increasing the subset size for greater variability.
Figure 6 DETR model validation report metrics during training. trained the model for 40 epochs, with the best f1-score at epoch 34. We optimized the prediction confidence threshold by evaluating the best model on the validation set. The model trained with a threshold of 0.1, and we tested thresholds from 0.01 to 0.50. We found the optimal confidence threshold to be 0.01, which achieved the best final test set results. Fracture detection in X-rays is challenging to the features being subtle and hard to detect. Missed diagnoses are a critical issue, so the application of this model would be a high-sensitivity assistant for radiologists. The final model achieved a 99% fracture recall, meaning that the model flagged almost every patient with a fracture. The model had a bounding box recall of 61%, meaning that the model successfully located the fracture region in 61% of cases. This supports the initiative that transformer-based models like DETR can learn global contexts to identify subtle medical anomalies like fractures. The model’s results of 99% sensitivity and 61% localization accuracy present it as a screening assistant rather than an automatic diagnostic tool. The model has not trained long enough to gain high confidence in its predictions.
Figure 7 Sample ground truth (left) with the best model’s predicted bounding box (right). 5.4 Model Results Comparison Here we compare the final test set results for all three models, looking at classification report metrics of fracture recall, precision, and F1-score, as well as the detection metric for bounding box recall or box accuracy. These results are outlined in Table 1. The Simple CNN missed too many fractures and had the lowest recall. The Faster R-CNN provided a balanced performance with the highest precision and f1-score, but only 58% recall. In a real-world clinical setting, this equates to missing roughly 4 out of 10 fracture diagnoses. On the other hand, DETR achieved an almost perfect fracture recall with a higher box accuracy, but at the cost of precision. This means that the DETR model is highly sensitive and results in many more false positives than Faster R-CNN. When it comes to choosing the best model, the trade-offs need to be weighed. Would you want a model that generates
312
Made with FlippingBook - Share PDF online