AAI_2025_Capstone_Chronicles_Combined

Capstone Chronicles 2025 Selections MS-Applied Artificial Intelligence University of San Diego

Image generated with OpenAI's DALL·E, facilitated by ChatGPT. 1

Dear Reader,

It is my great pleasure to welcome you to the 2025 edition of Capstone Chronicles , our second annual publication showcasing exemplary projects from the MS in Applied Artificial Intelligence program at the University of San Diego. This volume reflects not only another year of exceptional student achievement, but also the continued evolution of a program dedicated to preparing AI professionals for meaningful, responsible, and impactful work. The University of San Diego’s online Master of Science in Applied Artificial Intelligence is committed to training current and future leaders in this transformative field. Our program places strong emphasis on real-world applications, ethical responsibility, and the pursuit of social good in the design and deployment of AI-enabled systems. Developed by AI experts in close collaboration with industry and government stakeholders, the curriculum provides rigorous technical preparation grounded in practical implementation. Each graduating cohort represented in this edition—Spring, Summer, and Fall 2025—includes approximately 30-35 students. In our Capstone course, students synthesize the knowledge and skills acquired throughout the program to design and build AI-enabled systems that address real-world problems. Projects may be completed individually or in teams of up to four, offering flexibility while maintaining the high standards of collaboration, innovation, and accountability that define the Capstone experience. Throughout the course, students identify a meaningful problem, develop a formal project proposal, implement a technical solution, and rigorously evaluate their results. Every project must demonstrate original work, including data identification and preparation, thoughtful selection of tools and algorithms, and the development and training of at least one neural network or deep learning model from scratch using frameworks such as PyTorch, Keras, or TensorFlow. Beyond technical excellence, students are expected to integrate ethical, moral, and social considerations directly into their design process—an essential hallmark of our program. We extend our sincere appreciation to the students whose hard work and dedication fill these pages, as well as to the faculty mentors who guide them and the industry partners who help ensure our curriculum remains relevant and forward-looking. Thank you for joining us in celebrating the achievements of our 2025 graduates. We hope this edition of Capstone Chronicles inspires current and future students, collaborators, and leaders in artificial intelligence.​

Sincerely, The 2025 Capstone Chronicles Editorial Team

​ ​

Anna Marbut

Ebrahim Tarshizi

This letter was composed with the assistance of OpenAI’s ChatGPT.

2

Table of Contents Spring 2025 NIH Chest X-rays Classifier with Deep Learning …………..…………………………………………………………….. 6 ​ Daniel Arday, Will Kencel, Ksenia Kold Draw, Detect, Navigate: Transforming Doodles into Actionable Navigation Plans and Beyond ..…….. 28 ​ Elan Wilkinson, Parker Christenson, Gabriel Emanuel Colón, Dominic Fanucchi ResolveAI: AI-Driven IT Support Ticket Resolution .………………………………………………………………….. 49 ​ Arin DeLoatch, Kenneth DeVoe, Jabali Shah Evaluating Deep Learning Model Convergence in Chess via Nash Equilibria …………………………………. 74 ​ Philip Felizarta GAN and CNN Models to Improve X-Ray Diagnostic Accuracy …………………………………………………. 94 ​ Ned Kost, Pawan Tahiliani, Kim Vierczhalek TurbaNet: Efficient Parallel Training of Lightweight Neural Networks .………………………………………… 125 ​ Ethan Schmitt Summer 2025 Rapid ICU Mortality Prediction …………………….………………………………………………………………………….. 147 ​ Jeevan Gullinkala, Laxmi Sulakshana Rapolu, Subhabrata Ganguli Cinema Analytics and Prediction System ……………………………………………………………………………………. 168 ​ Rene Ortiz, Seema Mittal Surveying the Landscape of Mental Health: Machine Learning for Early Risk Detection ……………….. 201 ​ Prema Mallikarjunan, Aaron Ramirez, and Outhai Xayavongsa Unsupervised Learning in Education: Deep Embedded Clustering for Student-Chatbot Conversation Analysis …………………………………………………………………………………………….……. 225 ​ Vivian Perng, Brett Payton, and Douglas Code Breast Tumor Classification Using Quantum Neural Networks ….………………………………………….…… 254 ​ Matt Purkeypile WildScan: A Semi-Automated AI Pipeline for Wildlife Detection, Classification, and Continuous Learning .………………………………………………………………………………………….. 279 ​ Tyler Clinscales, Geoffrey Fadera, Edwin Merchan Fall 2025 Cervical Spine Fracture Detection Using Computer Vision ………………………………………………………… 304 ​ Andy Malinsky, Christopher Alleyne, Devin Eror, Jory Hamilton

3

Automated Triage of Disaster Communications: Leveraging NLP for Real-Time Emergency Message Categorization ….…………………………………………………………………………………………… 317 Gurleen Virk, Victor Hsu SoundSearch: A Machine Learning System for Timbre Based Audio Retrieval …………………………….. 333 Kevin Pooler Deepfake Detection with Convolutional Models and Vision Transformers ………………………………….. 358 Priscilla Marquez BitePulse AI: Real-Time Eating-Pace Feedback from Meal Video ………………….…………………………….. 383 Aktham Almomani Enhancing Lung Tumor Volume Accuracy on CT: A 3D Deep Segmentation and Reconstruction Pipeline with Clinical RTSTRUCT Integration ……..……………………………. 398 Laurentius von Liechti

4

Spring 2025

Image generated with OpenAI's DALL·E, facilitated by ChatGPT. 5

​ NIH Chest X-rays Classifier with Deep Learning

Daniel Arday, Will Kencel, Ksenia Kold Shiley-Marcos School of Engineering, University of San Diego AAI 590: Capstone Project Mar 31, 2025

6

Introduction

Chest radiography is one of the most widely used tools in modern medicine for screening thoracic conditions. In this project, we developed two deep learning–based classifiers for NIH chest X-rays with the goal of assisting clinicians in identifying common lung diseases more efficiently. Our primary thesis was whether a convolutional neural network (CNN), trained on the NIH Chest X-ray dataset, could accurately detect multiple thoracic pathologies in new, unseen images. A successful system would need to deliver two core capabilities: (1) consistent, explainable classification of thoracic pathologies, and (2) visual tools that help clinicians interpret model predictions. We hypothesized that with sufficient data and careful preprocessing, CNN-based models could reach or even exceed benchmark performance on the NIH dataset for multi-label classification. The intended end users of this AI system include radiologists, medical specialists, and healthcare administrators. In a production setting, this model would process chest X-ray images from clinical workflows, return probabilistic predictions for each condition, and integrate with Picture Archiving and Communication Systems (PACS) or other imaging infrastructure. We ultimately built, tested and compared two distinct model architectures. The first was a hybrid custom CNN designed to incorporate both image and tabular data. The second was built around EfficientNet, a state-of-the-art CNN, pretrained on ImageNet (Tan & Le, 2019). Rather than

7

attempting to automate diagnosis, our intent was to design a recall-focused support tool that flags potentially important findings for clinical review. Because missed diagnoses can have serious consequences, especially in fast-paced or resource-limited environments, we tuned the models to be deliberately overinclusive, prioritizing sensitivity over precision. We believe this approach allows the system to act as a second set of eyes, helping healthcare professionals avoid overlooking subtle or ambiguous abnormalities. Our dataset is sourced from the NIH ChestX-ray repository, which contains over 100,000 frontal-view chest radiographs labeled with up to 14 distinct thoracic pathologies (Wang et al., 2017). Each image is associated with patient metadata including patient ID, age, gender, and original image dimensions. The labels were generated through automated keyword extraction from radiology reports, which introduced some known noise into the dataset. The original dataset contains 14 diagnostic labels include Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema, Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening, Pneumonia, and Pneumothorax. Additionally, the dataset also includes a 15th label, “No Finding,” used to indicate the absence of all other conditions. After inspecting the label distribution and performing exploratory data analysis (see Fig. 1–3), we identified major class imbalance as a significant challenge. “No Finding” accounted for nearly 40% of the dataset, while rarer conditions such as “Hernia” had fewer than 200 labeled images. To address this, we downsampled “No Finding” to 10,000 images to reduce its overwhelming influence on the model thereby preventing class imbalance. We also grouped the original 15 labels into 7 broader diagnostic categories based on clinical similarity. This label Dataset Summary

8

generalization helped reduce sparsity, balance class frequencies, and improve learnability. The categories are as follows:

●​ Cardiac Issues: Cardiomegaly ●​ Fluid-Related Issues: Edema, Effusion, Pleural Thickening ●​ Hernia: Hernia ●​ Infection/Infiltration: Pneumonia, Consolidation, Infiltration ●​ Lung Structure Issues: Atelectasis, Pneumothorax, Fibrosis, Emphysema ●​ Nodule/Mass: Nodule, Mass ●​ No Finding: No Finding

This 7-class mapping was informed by both clinical rationale and co-occurrence trends in the data. For example, conditions like effusion and pleural thickening often appear together and are related to fluid accumulation within the chest cavity, thereby justifying their grouping. Our co-occurrence matrix (Fig. 4) confirmed several label pairs that commonly overlap, underscoring the need for a multi-label prediction strategy. In terms of data quality, we found and removed thousands of exact or near-duplicate image entries to avoid inflating model performance. We also allowed for limited missingness in metadata fields like age or gender, since these variables were not critical for our image-based models. Demographic variables were primarily used in our hybrid CNN (see Experimental Methods), which could optionally incorporate tabular input.

We observed wide variability in image resolution, with most images far exceeding the 1024×1024 range (see Fig. 3). To ensure consistent training speed and adequate compute

9

resources, we tested different sizing configurations for the images and ultimately resized all the images to 1024×1024 pixels prior to training.

By standardizing our label space and curating a cleaner, more balanced dataset, we created more favorable learning conditions for both of our CNN models. These steps helped mitigate overfitting to dominant classes, improved recall on rare categories, and reduced noise introduced by uncertain or ambiguous label annotations.

Fig 1. Multilabel classifications (after reducing to 7 classes)

10

Fig 2. Least common labels

Fig 3. Image size distribution

11

Fig 4. Co-Occurrence Matrix

Finally, our correlation plots and disease frequency analyses revealed strong associations between certain pathologies; for example, pleural effusion frequently co-occurred with atelectasis. These patterns confirmed the need to treat the task as a multi-label classification problem, since many chest X-rays were presented with more than one condition. As a result, both of our CNN architectures were designed with a single output layer using sigmoid activation, enabling the model to predict multiple conditions in parallel. This approach was very important because it reflected the clinical reality that differing medical diseases often appear together and allow the system to flag co-occurring pathologies within a single inference, which would not be possible if the task were broken into separate binary classifiers.

12

Background Information

Automating pathology detection in chest X-rays is a well-established focus within medical AI, with both academic and commercial groups actively developing deep learning–based classifiers for this purpose. Many of these efforts have leveraged the NIH ChestX-ray14 dataset as a benchmark, applying convolutional neural networks (CNNs) such as ResNet, DenseNet, CheXNet, and more recently EfficientNet. These studies consistently demonstrate that CNN-based models, particularly when fine-tuned with domain-specific techniques, can match or exceed radiologist-level accuracy on more common thoracic pathologies (Wang et al., 2017; Rajpurkar et al., 2018; Kufel et al., 2023). CNNs are especially well-suited to this task because of their ability to hierarchically learn spatial features in medical images. Low-level layers extract general patterns like edges and textures, while deeper layers encode increasingly abstract features, such as the structure and density patterns seen in pathological lung tissue. When paired with transfer learning from models pretrained on ImageNet, CNNs can generalize effectively even when the training data has class imbalance or noisy labels, both of which are common in ChestX-ray14. To explore the design space, we implemented two distinct model families: a custom hybrid CNN that accepts both image and tabular data, and a transfer learning–based model using EfficientNet. The hybrid CNN allowed us to incorporate demographic information (such as patient age or gender), which some studies have shown to improve classification in edge cases (Baltruschat et al., 2019). The EfficientNet model, on the other hand, builds on recent advances in CNN scaling. Its compound scaling strategy optimizes depth, width, and resolution which allow it to outperform deeper architectures like ResNet while still maintaining faster inference

13

and fewer parameters. Studies like those by Kufel et al. (2023) and Nawaz et al. (2023) have shown that EfficientNet variants deliver competitive or superior results in multi-label chest X-ray classification tasks. Previous projects like CheXNet (Rajpurkar et al., 2018) and ChestNet (Pham et al., 2020) have demonstrated that DenseNet-121–based architectures can also achieve high AUCs (~0.80–0.82) for many thoracic conditions. Our work builds on these foundations by applying more modern architectures and by explicitly modeling multi-label correlations; recognizing that real-world chest radiographs often present multiple co-occurring pathologies. In addition, this dual-model approach also allowed us to explore the trade-offs between handcrafted feature inclusion (in the hybrid model) and state-of-the-art image-based generalization (in EfficientNet). Together, these strategies reflect current best practices in academic and applied machine learning for radiographic classification. This project implemented and compared two distinct machine learning pipelines for multi-label classification of chest pathologies. The first was a fine-tuned EfficientNetB0 model leveraging transfer learning. The second was a custom-built convolutional neural network (CNN) that incorporated both image and tabular metadata using a hybrid architecture. Both models were trained on the same curated version of the NIH Chest X-ray dataset, using the exact same initial preprocessing steps and seven consolidated diagnostic labels. Experimental Methods

14

EfficientNet Model Training Methodology

The EfficientNet-based pipeline utilized the EfficientNetB0 architecture with pre-trained ImageNet weights as the feature extractor. To adapt it for multi-label classification, we removed the original classification head and added a global average pooling layer, a 128-unit dense layer with ReLU activation, a dropout layer (rate = 0.3), and a final sigmoid-activated output layer with seven neurons, one for each consolidated pathology category. This design allowed the model to independently predict the presence of multiple conditions within the same chest radiograph. (Fig A1 demonstrates our full EfficientNet architecture with added layers.) All input images were resized to 1024×1024 to balance resolution fidelity with memory constraints. Though the original dataset contained variable image sizes (as high as 3000×3000 pixels), resizing enabled the use of smaller batch sizes and accelerated training while preserving key clinical features. Images were normalized and augmented with horizontal flips, contrast shifts, and brightness jitter to improve generalizability. The data was split into training (72%), validation (18%), and testing (10%) subsets using stratified sampling, and labels were encoded as binary vectors. The model was trained using a phased approach over five total stages: ●​ Phase 1 (1 epoch): With the EfficientNet base frozen, we trained the classification head using binary cross-entropy loss with label smoothing (0.05) to prevent the model from getting stuck in local minima where it predicted all-zero outputs. Initially we found the model would predict all 0’s without this “warm-up” layer, so this was essential to our architecture.

15

●​ Phases 2–5 (60 epochs total): All EfficientNet layers were unfrozen. We switched to focal loss (α = 0.8) to address class imbalance, with γ set to 2.0 initially and then reduced to 1.5 in later phases to soften the penalty on confident predictions and stabilize convergence. We trained for 61 total epochs using the Adam optimizer with a learning rate of 2e-6 and a batch size of 8. Each phase used early stopping based on validation binary accuracy. Performance metrics included binary accuracy and recall, with validation scores improving well into the final epochs. During model optimization, we tuned the focal loss hyperparameters (γ and α), froze/unfroze layers progressively, and introduced per-class threshold calibration based on validation F1 scores. This threshold tuning stage proved especially impactful for recall, allowing us to shift model sensitivity on underrepresented classes like hernia and cardiac conditions. Hybrid CNN Model Training Methodology We developed a custom hybrid CNN architecture capable of processing both image data and tabular metadata from the NIH Chest X-ray dataset. The tabular features, such as patient age and gender, were included based on clinical relevance, as these variables can meaningfully influence radiographic presentation. For example, women tend to have smaller lungs on average, which can complicate diagnostic interpretation. Our training process followed an iterative approach, experimenting with multiple model configurations and evaluation routines to identify a design that yielded the most reliable performance across all diagnostic categories.

16

Multi-task vs Single Tasks Combination After performing initial iterations for a simple hybrid classifier training, we recognized the challenge of class imbalance in the provided dataset. Despite generating class weights, the model underperformed for minority classes, such as “Hernia”, potentially for not having enough data representation for that class. We decided to train a model composed of 7 single task models to understand if addressing class imbalance in a more robust way would improve model performance. Each training task corresponded to the data prepared for that task with fully addressed class imbalance. Important distinction is that a negative class did not mean “No Finding”. For a task, a negative class meant an image belonging to any other than the given class. All models shared the same architecture, and were trained separately. Figure A2 demonstrates hybrid classifier architecture. The model consisted of 2 branches, the image branch processed grayscale images through four convolutional blocks, each consisting of Conv2D layers with ReLU activation, each followed by batch normalization, max pooling, and dropout layers. We increased the number of filters progressively (32 → 64 → 128 → 256) to capture both low- and high-level features. A global average pooling layer reduces the spatial dimensions, followed by a fully connected dense layer with 128 units and L2 regularization to extract high-level features. We applied a dropout layer to prevent overfitting. The tabular branch consists of two dense layers with 32 units each and ReLU activation, with batch normalization applied after the first layer to stabilize training. Finally, we added the fusion layer to concatenate outputs of image and tabular branches. The fusion layed is followed by a dense layer with 128 units and dropout for joint feature learning. The final output layer uses a sigmoid activation function to predict probabilities for binary multi-label classification tasks.

17

The model is compiled with the Adam optimizer (learning rate 1e-4), binary cross entropy loss, and metrics such as accuracy, AUC, precision, and recall. This architecture fitted well the task of medical diagnosis with both image and tabular data contributing to predictions.

Results/Conclusion

We evaluated the performance of both the EfficientNet-based model and the custom hybrid CNN using a validationset of chest radiographs labeled across seven consolidated diagnostic categories. Both approaches were designed to prioritize recall over precision, with the goal of minimizing false negatives in clinical decision support scenarios. This tradeoff reflected a core priority in medical imaging: missing a diagnosis, such as failing to flag a nodule or mass that could indicate early-stage lung cancer, can result in delayed intervention and significantly worsened outcomes. In contrast, a false positive is far less consequential, as it typically leads to additional review by a trained clinician rather than direct harm. As Oakden-Rayner (2020) explains, radiology AI systems should err on the side of caution by ensuring potentially abnormal cases are surfaced for review, even if some prove to be benign, because the clinical cost of a missed finding is far higher than the cost of an unnecessary follow-up. We believe this also blends well with ethical considerations for the AI since a human will always be the final review before a patient diagnosis.

EfficientNet Results

The EfficientNetB0 model was trained over five progressive phases, beginning with a one-epoch warm-up using binary cross-entropy with label smoothing. followed by 60 epochs of fine-tuning using focal loss. Gamma was initially set to 2.0 and later reduced to 1.5 to ease the penalization of confident predictions. Despite the extended training cycle (61 total epochs),

18

validation performance continued to improve throughout, which justified the continued fine-tuning.

After applying per-class threshold tuning (based on validation-set F1 scores), the model achieved a macro-averaged F1 score of 0.43, a micro-averaged F1 of 0.47, and a micro recall of 0.70 on the final test set, substantially outperforming the untuned baseline (which had a macro-F1 near 0.23). These metrics, shown in Figure 8 , suggest the model learned clinically meaningful patterns while maintaining generalization.

Fig.5 EfficientNetB0 Fine-Tuning results

Performance was strongest for categories with larger class support:

●​ Fluid Related Issues : F1 = 0.60 at threshold 0.40​ Lung Structure Issues : F1 = 0.57 at threshold 0.40 ●​ Infection/Infiltration : F1 = 0.46 at threshold 0.35

●​ No Finding : F1 = 0.56 at threshold 0.40 ●​ Nodule/Mass : F1 = 0.42 at threshold 0.35

While underrepresented conditions like Hernia yielded very low precision (0.01), threshold tuning helped raise recall to 0.57. This tradeoff reflects our deliberate design choice to

19

prioritize sensitivity, especially in medical imaging tasks where catching all possible positives is more valuable than reducing false alarms.

Hybrid CNN Results

The hybrid CNN model, which fused grayscale chest X-ray images with tabular metadata (e.g., patient age and gender), was trained using both a multi-task architecture and a collection of seven single-task classifiers. Each model targeted one of the consolidated diagnostic categories, and training was guided by class-balanced data splits to mitigate label imbalance. AUC scores above 0.5 and steadily decreasing loss curves (see Figure 5) confirmed that the models were learning meaningful patterns rather than memorizing noise. The single-task classifier combination consistently outperformed the multi-task approach in terms of accuracy, recall, and overall stability. This outcome suggests that training independent models allowed each classifier to specialize more effectively in its respective task. In contrast, the multi-task model appeared to struggle with shared representation learning, showing slower and noisier convergence (see Figure 6).

Fig.6 Lung Structure Issues classifier's loss

20

Final performance metrics (see Figure 7) reflected this distinction. The single-task models achieved strong recall values—often above 0.80—for well-represented categories like Fluid Related Issues (F1 = 0.52), Infection/Infiltration (F1 = 0.43), and Lung Structure Issues (F1 = 0.40). For low-support classes like Cardiac Issues and Hernia , recall remained high, but precision dropped significantly (as low as 0.00), resulting in high false-positive rates. This tradeoff aligned with our recall-first objective: in clinical applications, it is often safer to raise false alarms than to miss a true pathology. One notable limitation of the hybrid model was its inability to enforce mutual exclusivity of the No Finding label, which occasionally co-occurred with other diagnoses despite being intended as a stand-alone class. Still, the system proved effective as a sensitive screening mechanism, surfacing even subtle or borderline findings for clinical review.

Fig.7 Multitask classifier loss

21

Fig. 8 Classification report for Single Tasks Combination

Model Challenges and Unexpected Findings During early experimentation, we encountered a surprising failure mode where the EfficientNet model predicted all-zero outputs across most batches. This behavior persisted for multiple epochs and produced deceptively high binary accuracy (~81%), due to the imbalance in label distribution, many X-rays had no findings or only one active label. We interpreted this as the model falling into a local minimum, optimizing too aggressively for the majority class by minimizing false positives. To overcome this, we introduced a warm-up phase using binary cross-entropy with label smoothing, which effectively nudged the model out of its lazy baseline. Once we transitioned to focal loss, the model began making confident, non-zero predictions that aligned more closely with true pathology distributions.

Clinical Implications and Application​ ​

Both models—EfficientNet and the hybrid CNN—showed clear potential for clinical use as decision-support tools. Their ability to flag likely positive radiographs with high recall means they could function as triage assistants, helping radiologists prioritize studies more efficiently. In

22

high-throughput or under-resourced environments, where radiologist fatigue or time constraints increase the risk of overlooked abnormalities, these systems could serve as a valuable second read. While neither model was perfectly calibrated, both were effective at identifying pathologies warranting additional attention, and could be integrated into PACS systems with human oversight. Importantly, false positives in this context are acceptable and expected, as they feed into an existing workflow of clinical verification rather than acting autonomously. If we were to continue this project, our next steps would focus on model interpretability and broader generalization. For interpretability, integrating techniques like Grad-CAM could offer visual explanations for model predictions, making the system more transparent and trustworthy for clinical users. On the modeling side, we would consider ensemble techniques, combining EfficientNet with other backbone architectures or even fusing predictions with the hybrid CNN outputs. Additionally, augmenting the dataset with external sources or conducting domain adaptation for other hospital systems could improve robustness. To prepare the system for production, we would also need to conduct more extensive validation on unseen clinical data, assess latency and scalability constraints, and address ethical/legal compliance issues related to medical AI deployment. Future Work and Production Considerations​ ​

23

Appendix A

Fig. A1 EfficientNet with added-layers Fine-Tuning Model Architecture

24

Fig. A2 Hybrid Classifier Model Architecture

25

References

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature , 521(7553), 436–444.​ Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., ... & Ng, A. Y. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint , arXiv:1711.05225.

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) .

Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., … Ng, A. Y. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint , arXiv:1711.05225. Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 6105–6114). Baltruschat, I. M., Nickisch, H., Grass, M., Knopp, T., & Saalbach, A. (2019). Comparison of deep learning approaches for multi-label chest X-ray classification. Scientific Reports, 9(1), 6381. https://doi.org/10.1038/s41598-019-42521-2 Kufel, J., Bielówka, M., Rojek, M., Mitręga, A., Lewandowski, P., & Nawrat, Z. (2023). Multi-label classification of chest X-ray abnormalities using transfer learning techniques. Journal of Personalized Medicine, 13 (10), 1426. https://doi.org/10.3390/jpm13101426

26

Nawaz, M., Nazir, T., Baili, J., Khan, M. A., Kim, Y. J., & Cha, J. H. (2023). CXray-EffDet: Chest disease detection and classification from X-ray images using the EfficientDet model. Diagnostics, 13 (2), 248. https://doi.org/10.3390/diagnostics13020248 Rajpurkar, P., Irvin, J., Ball, R. L., Zhu, K., Yang, B., Mehta, H., ... & Lungren, M. P. (2018). Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine, 15 (11), e1002686. https://doi.org/10.1371/journal.pmed.1002686 Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M. (2017). ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 3462–3471. https://doi.org/10.1109/CVPR.2017.369

Oakden-Rayner, L. (2020). Exploring large-scale public medical image datasets. Academic Radiology, 27 (1), 106–112. https://doi.org/10.1016/j.acra.2019.09.013

27

Draw, Detect, Navigate:

Transforming Doodles into Actionable Navigation Plans and Beyond

Elan Wilkinson, Parker Christenson, Gabriel Emanuel Colón, Dominic Fanucchi

Shiley-Marcos School of Engineering, University of San Diego

AAI-590-02: Capstone Project

Roozbeh Sadeghian, Ph.D

April 14, 2025

28

2

Draw, Detect, Navigate ​

Abstract

This capstone project extends traditional sketch classification by incorporating real-time object

detection and augmented reality (AR) integration. While convolutional neural networks (CNNs)

have demonstrated high accuracy in identifying hand-drawn pictograms, most approaches omit

spatial localization and real-time interaction. Our system addresses these gaps by using

bounding box predictions and edge-optimized models capable of identifying multiple doodles

in real time within an AR environment. To overcome limitations in existing datasets, we

developed a synthetic data generation pipeline using Unity and Python, producing randomized,

annotated images that mirror real-world drawing variability. We trained and evaluated both

Faster R-CNN and YOLOv8 variants, ultimately selecting the YOLOv8 nano model for deployment

due to its speed, size, and high accuracy. The model achieved an F1 score of 0.96 and

processed images at 28 frames per second (FPS), enabling seamless AR integration. The final

application uses a webcam feed and ArUco marker tracking to detect hand-drawn symbols,

anchor 3D models to the drawings, and compute navigation routes using the A* algorithm. The

system combines symbolic vision with spatial reasoning to support interactive use cases such

as route planning, adversarial simulations, and strategic modeling. The model’s real-time

performance on new inputs, along with its successful deployment through Unity, supports its

use in live scenarios. These results show that sketch-based simulations can support quick

decision making in settings where fast, visual input is needed.

29

3

Draw, Detect, Navigate ​

Table of Contents

Abstract ​

2 4 5 8 9

Introduction ​

Background Information ​

Data Summary ​

Experimental Methods ​

Results ​

13 16 20

Conclusion ​ References ​

30

4

Draw, Detect, Navigate ​

Introduction

The recognition of hand-drawn symbolic representations plays a pivotal role in various

domains such as data capture, note taking, architecture and design planning, document

markup, wargaming, and strategy simulation. Translating abstract, low fidelity pictograms into

digital elements is typically done manually or through static classification models. These

approaches lack the spatial awareness or interactivity needed for dynamic computer vision

applications. The project addresses this gap by developing a system capable of detecting,

localizing, and classifying multiple hand-drawn symbols in real time.

This project builds on the success of classification models like Sketchnet (Zhang et al.,

2016) and transformer based architectures that use vectorized stroke data (Xu et al., 2022).

While these models perform well in identifying individual symbols, bounding boxes for the

symbol within the image are not typically used. They do not analyze spatial relationships

between symbols either. Our hypothesis is that a lightweight object detection model, supported

with synthetically generated data, can provide real-time sketch recognition with spatial

awareness on readily available devices.

The intended users of this system are individuals or teams who would benefit from

quick, visual input. These can include emergency responders mapping navigation paths.

Additionally, new opportunities for interactive applications in augmented reality (AR) can

include AI-enabled pathfinding, and adversarial simulations. The output of the model is meant

to support quick decision making by interpreting doodles as functional elements in a digital

environment.

This project used a subset of “doodles” from the publicly available Quick, Draw! dataset

(Jongejan et al.), which contains millions of user-submitted doodles labeled by class. From this

subset, we created synthetic data by combining multiple randomly placed doodles with

31

5

Draw, Detect, Navigate ​

bounding box annotations. When the application is deployed, live input comes from a webcam

or USB connected camera.

The final product is an AR application that uses a YOLO model to detect doodles, place

3D representations on each symbol, and compute a navigation path using the A* algorithm.

Using a sheet of paper printed with OpenCV’s augmented reality university of cordova (ArUco)

marker, the user draws doodles of different objects, and the application will calculate the best

path for the starting doodle to reach the goal doodle (OpenCV: Detection of ArUco Markers,

n.d.). For example, a helicopter doodle routes to a hospital while avoiding obstacles, which are

the other remaining classes in our dataset.

Background Information

The original drawing data used in this project comes from user data from the website

game Quick, Draw! where users made small rapid drawings based on a given specific prompt

within 20 seconds for the purpose of training a classification model with labeled data. While

Google Creative Labs has not disclosed the model architecture of the classifier currently live on

the site, they do link to Tensorflow guidance on the construction of a recurrent neural network

for this purpose (Tensorflow, 2024). Niu et al. had robots use hand drawn maps and object

recognition to navigate unfamiliar environments (Niu et al., 2019). While these and other

projects have demonstrated high levels of accuracy in classification, bounding boxes are not

used, real-time continuous performance is not prioritized, the live surface of the drawing is not

incorporated, and in most implementations, the classification is the end goal itself, not a tool

that feeds into subsequent tasks.

While prior pictogram classification work and similar classification projects primarily

focus on image recognition, our project incorporates multiple layers of complexity by

combining CNN-based object classification with bounding box identification and dynamic path

optimization using A*. The A* algorithm is a graph traversal and path search algorithm that

estimates the most efficient path by using a heuristic function to guide the search toward the

32

6

Draw, Detect, Navigate ​

destination. The algorithm calculates a score that combines the distance traveled from the

starting point with an estimate of the remaining distance to the destination, allowing it to

prioritize nodes that are closer to the goal. This approach makes A* particularly effective in

applications where efficient route planning navigation is required, such as robotics and video

games where there are often scenarios requiring rapid and efficient routing between two points,

as well flight path planning for challenging environments (Luo et al., 2023; Rubio, 2023).

​ CNNs are a standard choice for many image classification tasks. One widely used

dataset is MNIST, which evaluates CNN performance in classifying handwritten digits. While

MNIST demonstrates the potential of CNNs in classification tasks, our project involves

recognizing freehand doodles from Quick, Draw!, which presents a more complex classification

challenge. CNN-based models have proliferated across image classification research, but the

architectures of YOLO and Faster R-CNN have dominated applications targeting real-time use

due to their speed and performance.

​ The machine learning model Faster Regional Convolutional Neural Network (Faster

R-CNN) creates region proposals from the image, computes the CNN features, and then

classifies these regions (Ren et al., 2017). Given correctly identified regions or bounding boxes

coupled with a label, while simplistic, facilitates a dense possibility space of conveyances given

labeled objects in various proximities to one another such as strategic planning or rescue

planning, quick rapid spatial conveyance, wargaming, or concept conveyance. As a

demonstration of capabilities that become available with pictogram positioning coupled with

identification, a pathfinding implementation using the A* algorithm was developed to route a

starting object, a helicopter, to an objective, a hospital, that recommends the most efficient

path avoiding detected obstacle pictograms. Other approaches evaluated during explorative

phases include YOLO and custom convolutional neural network implementations.

33

7

Draw, Detect, Navigate ​

Figure 1

A* algorithm

This project uses A* and there are numerous projects that have leveraged the algorithm

in three dimensional spaces to determine the best route for an agent. Research comparing A*

and Dijkstra’s Algorithm highlights that A* achieves the same shortest path outcome while

drastically minimizing the search space by guiding the search with a heuristic, making it a more

efficient choice for our application (Malacad, 2022). A key example would be video games;

which often use the algorithm when creating bot-like characters in three dimensional

representations. Games like “Call of Duty” feature bot-like characters in certain game modes

that navigate towards specific points, based on user input, or objective. While the “bot logic” is

considered by Activision to be a trade secret, not having open sourced their code for the logic,

there are examples of this algorithm being used (git-amend, 2025) in other games, and in

practical examples for game developers.

34

8

Draw, Detect, Navigate ​

Data Summary

In training for prediction based on human-drawn symbolic representations, the Google

Quick, Draw! Doodle dataset was used (Jongejan et al.). The dataset consisted of human-drawn

representations of the word prompt, drawn under the time constraint of 20 seconds to force

simplicity, with 3,000 samples per class. All samples included were those that were correctly

identified, and were closely cropped to the area of the drawing. Each labeled grayscale image

.png file was accompanied by the country code of the user who generated the image, a

vectorized version of the image preserving stroke data, and a unique identifier. The dataset was

created through self-selected user participation in the Quick, Draw! Game, where users

volunteered their drawings as training data (Jongejan et al.). The dataset contained no missing

values except for country codes, which were likely unavailable due to factors such as VPN usage

or IP detection failures. While the cause of the missing country code was not disclosed in the

source of the dataset, due to the automatic detection using the user’s internet protocol (IP)

address and the potential for some users to use virtual private network (VPNs) multiple

technical sources of this are possible. In the chosen modeling approach, the country code was

discarded as non-relevant data.

​ The 345 image classes present in the dataset were of a wide assortment of those known

to a general audience, ranging from “bat” to “The Eiffel Tower” to “diving board”. Images were

all stretched to even squares if not already in that form, with an even 36 pixels of padding of

white space on each side. For the purposes of computational resource management and scope,

this project uses twelve of the 345 categories present in the Quick, Draw! dataset. The number

of necessary classes when taken into business contexts would likely be highly dependent on the

domain and usage. Data augmentation was performed to further pad images for the purpose of

bounding box prediction and providing the mode bounding boxes which shifted from image to

image, but this was found to be insufficient to help all models attempting to learn accurate box

predictions.

35

9

Draw, Detect, Navigate ​

Experimental Methods

Prior drawing classification work has historically employed a number of different

approaches, including the use of strokes, contour-based approaches, transformer architectures,

deep and convolution neural networks, but the requirement for real-time and continuous use

creates limitations for those usable embedded into an augmented reality application.

Convolutional Neural networks with large pooling layers capture some of the more undefined

features within the doodles, but do not support the detection of bounding boxes in their

standard implementation. Pictograms also pose a unique classification challenge, being

typically simple, abstract, and with limited numbers of features. Unlike characters or digits in

other data, there is no single, standardized and agreed upon way of the way in which two

people draw a firetruck in twenty seconds. In approaching model selection, it was necessary to

balance drawing classification, bounding box detection, inference speed, and support for

deployment in lightweight, real-time AR systems.

There are multiple architectures that a team can pick when considering the task of

image detection and classification. Hidden layers and pooling layers, might be a part of other

projects while developing models that are developed from the ground up. After evaluating

multiple models including a Convolutional Neural Network (CNN), Faster Regional Convolutional

Neural Network, and YOLO models, YOLOv8 nano was chosen for its performance on the task.

Layers were not frozen to enable the model to better learn the unique features of the doodles

given their substantial difference from photograph-quality images.

When training the CNN and Faster R-Cnn models, 60% random shuffle of the available

training data was used, but even with image augmentation to pad around the image to aid in

bounding box detection, performance on larger images composed of multiple doodles was

poor. Synthetically generated images were created both through manual generation by the

researchers as well as through the creation and use of a synthetic data generation pipeline built

in the Unity game engine (Unity 6, n.d.). 400 images were created manually for training, labeled

36

10

Draw, Detect, Navigate ​

with the tool Label Studio (Label Studio, n.d.). The automatically generated images from Unity

were created by adding an alpha channel turning the white portions of the drawing into a

transparency mask, then using C# placing a randomized number of drawings in randomized

positions with randomized scales over a randomized background. Start and end drawings, in

this case helicopters and hospitals, were prioritized with all other obstacle classes randomized.

Drawings were selected by randomly pulling from one of the first 2,200 example drawings from

that class, with the remaining 800 preserved for later testing. 10,000 images were generated

for testing in the span of approximately 30 minutes, and an additional 1,000 images were

created for validation and testing. The pipeline would support exchanges of class types or

increased class numbers with minimal modification. When either training images or live images

are processed and passed to the model, images are resized to the expected input of 640x640

and then converted to tensors. The choice to avoid further preprocessing was deliberate to

prioritize performance speed.

Initial modeling approaches used Convolutional Neural Networks created through the

use of TorchVision on the original single class images. Two CNNs were developed. The first

used the following two convolution layers, Rectified Linear Unit (ReLU) activations, max pooling

layers of 2x2, and was flattened to a linear layer to ultimately predict one of ten classes. An

overall F1 score of 0.94 was achieved with generally even performance across the classes. The

second model used twice as many convolutional layers, batch normalization, dropout, and use

of Leaky ReLU on twelve classes and achieved a validation accuracy of 89%. Neither CNN

supported bounding-box predictions; however, the classification results validate that the

selected classes are learnable from Quick, Draw! Images. In preliminary tests on hand-drawn

inputs, the models correctly classified most samples, motivating the transition to detection

architectures for localization.

37

11

Draw, Detect, Navigate ​

Figure 2

Original training image

Figure 3

Synthetically generated combinations of doodles from data

38

Made with FlippingBook - Share PDF online