AAI_2025_Capstone_Chronicles_Combined
Reliability diagram analysis reveals that the elevated ECE scores on the out-location dataset are primarily driven by miscalibration at higher confidence levels. Figure 14 demonstrates substantial overconfidence in both uncalibrated models (red curves), with Model 2 exhibiting slightly better initial calibration (red curve closer to the diagonal). TvA Histogram Binning calibration significantly improved Model 2's reliability across higher confidence bins, while Model 1 showed minimal response to calibration. Although both models achieve comparable classification accuracy and baseline calibration performance on both in-distribution and out-of-distribution datasets, Model 2 demonstrates better calibration responsiveness when combined with TvA Histogram Binning. This enhanced calibration adaptability makes Model 2 the preferred choice for our active learning framework for deployment. We simulated the deployment of Model 2 using periodic performance monitoring at 3-month, 6-month, and 12-month intervals on the remaining in-location and out-location datasets. At each time point, we generated predictions using the most recently fine-tuned model, then applied TvA histogram binning calibration to the maximum softmax probabilities. This calibration method preserves the original predicted labels while improving confidence reliability (Le Coz et al., 2024). For active learning sample selection, we identified uncertain predictions requiring human annotation using a confidence threshold of 0.70 (Chen et al., 2023). Ground truth labels for these uncertain samples simulated human annotation in our experimental setup. "Model fine-tuning was performed by freezing all backbone parameters except the final residual block (layer4) of the ResNet18 component, a common approach that preserves low-level feature representations while allowing adaptation of higher-level features to the target domain (Yosinski et al., 2014)". The calibration dataset for each subsequent interval incorporated both the newly annotated uncertain samples and the calibration data from the previous time period.
295
Made with FlippingBook - Share PDF online