AAI_2025_Capstone_Chronicles_Combined

11

examined along two axes: the learning dynamics during training and the quality of segmentation and volumetric predictions on unseen cases. The learning curves indicate that the network did learn structure from the data. Training and validation losses both decreased over time, with the training loss exhibiting a smooth monotonic decline and the validation loss following a similar trend with greater variability from epoch to epoch. The most favorable validation loss (–0.2631) occurred at epoch 230. At this point, the corresponding single-epoch validation pseudo Dice reached 0.4086, which was the highest observed value. When the same metric was smoothed with an exponential moving average, the curve rose more gradually and reached 0.3166 near the end of training (Figure 5.1). Over the final 20 epochs, the validation pseudo Dice values had a mean of approximately 0.30 and a median of 0.31, suggesting that the underlying level of performance was improving, even though individual epochs remained noisy.

loading and caching effects settled. The learning rate followed the expected polynomial decay schedule, falling from about 0.0096 around epoch 50 to approximately 0.0076 by epoch 263. There was no evidence of loss explosions, mode collapse, or other pathological behavior. Taken together, these curves suggest that, at the point training was stopped for compute and time reasons, the model remained in an under-trained regime: it was still improving but had not fully converged. The picture is less favorable on the held-out test set. Dice coefficients for the 20 test cases were low and widely dispersed, with a mean of about 0.16, a median of 0.09, and a maximum of 0.55. Many cases had Dice values close to zero. Jaccard indices followed the same pattern, with a mean of roughly 0.10 and a median of 0.05. Because the central goal of this project is tumor volumetry, these segmentation errors are best interpreted through their impact on volume estimates. Ground-truth tumor volumes in the test set ranged from approximately 7 mL to 601 mL, with a median around 50 mL. Predicted volumes ranged from approximately 22 mL to 430 mL, with a median around 111 mL. On average, the model produced larger volumes than the RTSTRUCT-defined ground truth. The mean absolute difference between predicted and true volume was approximately 95 mL, with a median of about 42 mL. Relative volume errors were substantial: the mean relative error was about 206%, the median was 131%, the 25th percentile was approximately 12%, and the 75th percentile was roughly 269%. The distribution of these relative errors is shown in Figure 5.2 and is clearly skewed toward positive values.

Figure 5.1​ Training and Validation Loss and corresponding pseudo Dice and exponential moving average Dice From a systems perspective, the training run behaved predictably. Early epochs required roughly 612–726 seconds each, stabilizing to approximately 437 seconds per epoch as data

408

Made with FlippingBook - Share PDF online