AAI_2025_Capstone_Chronicles_Combined
5
classification in a single step. YOLOv11 incorporates advanced feature processing which allows it to preserve fine-grained details (e.g., hairline fractures). It can also generate a pixel-level mask of the fracture, which leads to high accuracy predictions. In addition to spinal fracture detection, the YOLO model has also helped in applications such as smart city traffic management (Darabi, 2024), construction site safety, and advanced drone technology (Bhattacharya & Nowak, 2025). The Detection Transformer (DETR), introduced by Carion et al. (2020), emerged with a new approach to object detection. Unlike traditional CNN-based detectors, DETR utilizes a Transformer encoder-decoder architecture. The approach simplifies the model pipeline by achieving final predictions (bounding box and class) in a single pass (Carion et al., 2020; Yu et al., 2025). While the original DETR model struggled with slow convergence and small object detection, later variants like Deformable DETR (Zhu et al., 2021) have addressed these efficiency issues through focused attention mechanisms. In clinical application settings, these DETR models have shown high potential for such tasks such as lesion localization and multi-organ segmentation in high-resolution CT and MRI scans (Carion et al., 2020; Yu et al., 2025). Open source pre-trained models such as Facebook’s DETR have a CNN backbone such as ResNet-50 (Carion et al., 2020). This backbone model was originally trained on the Common Objects in Context (COCO) dataset (Lin et al., 2014), which means it has already learned visual feature representations from millions of images. This pre-trained knowledge can be applied as a form of transfer learning, where new models can be fine-tuned to other tasks such as cervical spine fracture detection. This approach would allow us to achieve high performance with a smaller
medical dataset compared to training a large transformer model from scratch (He et al., 2019). 4 Methodology For preparation for model implementation, we created a data loading pipeline that each of our models will follow. This includes loading the same curated data splits for carefully balanced train, validation, and test sets. Out of our total sample size of 28,868 images, we created a split of 70% train (20,207 images), 15% validation (4,330 images), and 15% test (4,331 images). The models we chose to explore start with a baseline CNN model built from scratch, a DETR model, and a Faster R-CNN model. images that are resized to 256 by 256 pixels. These images pass through three convolutional layers, each followed by a rectified linear activation and a max pooling step. Together, these layers allow the model to learn important visual features such as edges, bone contours, and changes in texture that may indicate a fracture. As the image moves deeper through the network, the extracted features become more abstract and informative. After this feature extraction stage, the output is flattened and passed into two fully connected layers, with a dropout layer included to reduce overfitting. The final layer produces two values that correspond to the model’s confidence in predicting whether the image represents a normal cervical spine or one with a fracture. 4.2 DETR We implement a pre-trained Detection Transformer (DETR) model for fracture 4.1 CNN To build a clear starting point for the cervical spine fracture detection task, we implemented a Simple Convolutional Neural Network. The model takes grayscale CT
308
Made with FlippingBook - Share PDF online