AAI_2025_Capstone_Chronicles_Combined
6
detection. The model we selected has a ResNet-50 convolutional backbone followed by a transformer encoder-decoder architecture. We utilized a pre-trained version from the Hugging Face transformers library (Carion et al., 2020). Our images were preprocessed to normalize pixel intensities, which created 1-channel grayscale images. Since the pre-trained ResNet expects 3-channel RGB input images, we had to adapt the input pipeline to duplicate the grayscale channel three times. Since the CNN backbone was originally trained on the COCO dataset with 91 classes, we override the number of target classes to be 2 (Fracture or No-fracture). To reduce training time and encourage learning in the later encoder layers for detection, the weights of the early backbone CNN layers were frozen. A batch size of 100 was used to further reduce training time. The optimizer used was AdamW, with a learning rate of 1e-4 for the Transformer parameters, and a weight decay of 1e-3. Mixed-precision training was also applied with a GradScaler for faster and more memory-efficient training (Micikevicius et al., 2018). We fine-tuned the model over 40 epochs, saving the best model based on fracture f1-score evaluated on the validation set. We included an early stopping patience of 20 epochs based on the validation f1-score. After training, we evaluate the best model on the test set. For optimization, we tune the prediction confidence threshold using the validation set, and then evaluate on the test set using the newfound optimal threshold. Results are discussed in the following section. 4.3 Faster R-CNN Our Faster R-CNN implementation follows the standard two-stage detection pipeline introduced by Ren et al. (2015). This architecture is well suited for detecting small or irregular targets (common patterns in cervical fractures)
because its proposal-based pipeline often outperforms single-stage detectors on fine grained medical abnormalities (Paik et al., 2024; Zhao et al. 2022). We use a ResNet-50 backbone with a Feature Pyramid Network (FPN), a configuration shown to improve multi-scale medical image detection while maintaining computational efficiency (Le et al. 2020). The Region Proposal Network (RPN) produces candidate regions, and the second-stage head classifies proposals and refines bounding boxes. We use PyTorch’s default anchor scales and aspect ratios, along with the built-in RPN and ROI heads, to preserve the canonical design of Faster R-CNN and reduce the risk of training instability. Training is conducted using a batch size of 8 over an initial 10-epoch schedule. Optimization follows the standard Faster R-CNN multi-task objective (RPN objectness, RPN regression, ROI classification, and Smooth L1 ROI regression) with losses computed directly through PyTorch. We used AdamW (learning rate 1e-4; weight decay 1e-4), a configuration commonly adopted in medical imaging tasks for improved convergence stability (Zhao et al., 2022). Evaluation relies on validation loss and custom precision/recall at IoU ≥ 0.5, with mAP@0.5 and mAP@0.75 planned for later stages. These evaluation procedures mirror prior CT-based fracture detection workflows (Paik et al., 2024; Lin et al., 2023). A critical methodological requirement for Faster R-CNN is proper dataset formatting. Unlike architectures that predict normalized bounding boxes, Faster R-CNN requires strict [x1, y1, x2, y2] coordinates, scaled to the resized image, and packaged into target dictionaries containing labels, area, iscrowd, and image_id fields (Ren et al., 2015). Misalignment in these
309
Made with FlippingBook - Share PDF online