M.S. AAI Capstone Chronicles 2024
7
Alternatively, computer vision methods can be used as a solution to the SAA problem outlined.
A convolutional neural network (CNN) is a computer vision method most commonly applied to the
analysis of images and is inspired by the visual process used by humans. Using this method a model can
be developed to determine whether helicopters, airplanes, birds, airborne, drones, or flocks are present
in the images it is fed from the UAV.
The Vision Transformer (ViT) pretrained model is pretrained on the ImageNet and ImageNet-21k
datasets which are large-scale object detection, segmentation, and captioning datasets designed for
research in a wide variety of object categories and is considered a benchmark for computer vision
models (Dosovitskiy, et al., 2021). The dataset consists of over 200,000 categories including people, cars,
animals, and food each with several hundred images. The ViT model was introduced in 2021 as a more
computational efficient and accurate model than the CNN. The transformer is designed for computer
vision by breaking down an input image into a series of patches which are flattened into vectors and
mapped into a smaller dimension. Transfer learning allows the knowledge gained from the pretrained
ViT model to be applied to the SAA task without the need to train a new model from scratch. This
process further reduces the computational efforts and increases the accuracy of the performance.
A paper published in the 2023 IEEE International Conference on Sensors, Electronics and
Computer Engineering compared the performance of ViT and CNN based models on a classification task
using a UAV dataset composed of 1359 images (Zhang, 2023). The CNN based models used were a
traditional CNN with 14 layers, ResNet50, and VGG16 which were evaluated against the ViT-b16
pretrained model. The study concluded that the ViT model outperformed the CNN based models in the
classification and object detection tasks. However, the training process was more elaborate for the ViT
model and played a critical role in the overall performance. It should be noted that the dataset used
contained one object per image and the results achieved may not be generalized to a task with multi
object detection and classification.
121
Made with FlippingBook - professional solution for displaying marketing and sales documents online