AAI_2025_Capstone_Chronicles_Combined

validation performance continued to improve throughout, which justified the continued fine-tuning.

After applying per-class threshold tuning (based on validation-set F1 scores), the model achieved a macro-averaged F1 score of 0.43, a micro-averaged F1 of 0.47, and a micro recall of 0.70 on the final test set, substantially outperforming the untuned baseline (which had a macro-F1 near 0.23). These metrics, shown in Figure 8 , suggest the model learned clinically meaningful patterns while maintaining generalization.

Fig.5 EfficientNetB0 Fine-Tuning results

Performance was strongest for categories with larger class support:

●​ Fluid Related Issues : F1 = 0.60 at threshold 0.40​ Lung Structure Issues : F1 = 0.57 at threshold 0.40 ●​ Infection/Infiltration : F1 = 0.46 at threshold 0.35

●​ No Finding : F1 = 0.56 at threshold 0.40 ●​ Nodule/Mass : F1 = 0.42 at threshold 0.35

While underrepresented conditions like Hernia yielded very low precision (0.01), threshold tuning helped raise recall to 0.57. This tradeoff reflects our deliberate design choice to

19

Made with FlippingBook - Share PDF online