M.S. Applied Data Science - Capstone Chronicles 2025

First page Table of contents Previous page 113 Next page Last page

Analyzing YouTube Trends Using Metadata and NLP ∗

Jose Guarneros † Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego jguarnerospadilla@sandiego. edu

Tysir Shehadey Applied Data Science Master’s Program Shiley Marcos School of Engineering / University of San Diego tshehadey@sandiego.edu

ABSTRACT This study investigates whether YouTube video popularity can be predicted using only metadata and text-based features available at the time of upload. Metadata, including video title, channel name, description, publish time, and engagement ratios, was collected via the YouTube data api for 3,046 videos across various categories. Feature engineering produced temporal variables, textual measures, and normalized engagement metrics. Exploratory data analysis revealed strong correlations between views, likes, and comments, but a limited relationship between view counts and purely temporal or textural features. Four supervised learning algorithms: Linear regression, Support Vector Regression, XGBoost, and Catboost, were implemented to predict total view counts. The two tree-based gradient boosting models outperformed the linear and kernel-based models in root mean square error. However, none of the four models achieved a positive R-squared value. Thus, indicating the models failed to explain variance better than a naive mean predictor. These results suggest that surface-level metadata alone offers limited predictive power for video popularity, likely due to the influence of platform algorithms, audience networks, and content quality factors not captured in metadata.

While predictive performance was poor, the analysis provided descriptive insights. Videos published on Sundays and Mondays showed slightly higher average views, and normalized engagement metrics were more informative than raw counts. Future work should integrate richer pre-upload features such as channel statistics, thumbnail attributes, and advanced text embeddings, alongside broader datasets and cross-validation, to improve predictive capability. This research highlights both the constraints and potential of using publicly available metadata for early-stage content performance forecasting. KEYWORDS machine learning, feature engineering, YouTube, API, natural language processing, algorithm, data analysis, metadata, correlation, feature space, non-linearity, linearity, linear regression, support vector regressor, XGBoost, CatBoost 1 Introduction In the current digital age, video platforms such as YouTube have become a way for people to upload videos and engage with other videos. Despite a platform such as YouTube having a large number of people engaging daily, it has become a difficult challenge for users to drive engagement in their videos. If patterns can be predicted using only metadata, content creators can be more strategic about the videos they

113

Made with FlippingBook flipbook maker