AAI_2025_Capstone_Chronicles_Combined

9

Draw, Detect, Navigate ​

Experimental Methods

Prior drawing classification work has historically employed a number of different

approaches, including the use of strokes, contour-based approaches, transformer architectures,

deep and convolution neural networks, but the requirement for real-time and continuous use

creates limitations for those usable embedded into an augmented reality application.

Convolutional Neural networks with large pooling layers capture some of the more undefined

features within the doodles, but do not support the detection of bounding boxes in their

standard implementation. Pictograms also pose a unique classification challenge, being

typically simple, abstract, and with limited numbers of features. Unlike characters or digits in

other data, there is no single, standardized and agreed upon way of the way in which two

people draw a firetruck in twenty seconds. In approaching model selection, it was necessary to

balance drawing classification, bounding box detection, inference speed, and support for

deployment in lightweight, real-time AR systems.

There are multiple architectures that a team can pick when considering the task of

image detection and classification. Hidden layers and pooling layers, might be a part of other

projects while developing models that are developed from the ground up. After evaluating

multiple models including a Convolutional Neural Network (CNN), Faster Regional Convolutional

Neural Network, and YOLO models, YOLOv8 nano was chosen for its performance on the task.

Layers were not frozen to enable the model to better learn the unique features of the doodles

given their substantial difference from photograph-quality images.

When training the CNN and Faster R-Cnn models, 60% random shuffle of the available

training data was used, but even with image augmentation to pad around the image to aid in

bounding box detection, performance on larger images composed of multiple doodles was

poor. Synthetically generated images were created both through manual generation by the

researchers as well as through the creation and use of a synthetic data generation pipeline built

in the Unity game engine (Unity 6, n.d.). 400 images were created manually for training, labeled

36

Made with FlippingBook - Share PDF online