M.S. AAI Capstone Chronicles 2024

First page Table of contents Previous page 121 Next page Last page

Alternatively, computer vision methods can be used as a solution to the SAA problem outlined.

A convolutional neural network (CNN) is a computer vision method most commonly applied to the

analysis of images and is inspired by the visual process used by humans. Using this method a model can

be developed to determine whether helicopters, airplanes, birds, airborne, drones, or flocks are present

in the images it is fed from the UAV.

The Vision Transformer (ViT) pretrained model is pretrained on the ImageNet and ImageNet-21k

datasets which are large-scale object detection, segmentation, and captioning datasets designed for

research in a wide variety of object categories and is considered a benchmark for computer vision

models (Dosovitskiy, et al., 2021). The dataset consists of over 200,000 categories including people, cars,

animals, and food each with several hundred images. The ViT model was introduced in 2021 as a more

computational efficient and accurate model than the CNN. The transformer is designed for computer

vision by breaking down an input image into a series of patches which are flattened into vectors and

mapped into a smaller dimension. Transfer learning allows the knowledge gained from the pretrained

ViT model to be applied to the SAA task without the need to train a new model from scratch. This

process further reduces the computational efforts and increases the accuracy of the performance.

A paper published in the 2023 IEEE International Conference on Sensors, Electronics and

Computer Engineering compared the performance of ViT and CNN based models on a classification task

using a UAV dataset composed of 1359 images (Zhang, 2023). The CNN based models used were a

traditional CNN with 14 layers, ResNet50, and VGG16 which were evaluated against the ViT-b16

pretrained model. The study concluded that the ViT model outperformed the CNN based models in the

classification and object detection tasks. However, the training process was more elaborate for the ViT

model and played a critical role in the overall performance. It should be noted that the dataset used

contained one object per image and the results achieved may not be generalized to a task with multi

object detection and classification.

121

Made with FlippingBook - professional solution for displaying marketing and sales documents online