M.S. Applied Data Science - Capstone Chronicles 2025
10
target variable significantly better than a naive mean-based predictor. This points to limitations in the predictive power of the selected features (title, keyword, published time, and derived metrics like title length and likes per view). Several factors are likely due to the high variance in view counts, limited feature scope, and sparse text features. Table 1 Test Set Performance Metrics for Each Model Model 2 Linear Regressi on 1463792 .84 793849. 72 -0.1345 Support Vector Regress or 1449451 .64 526671. 64 -0.1124 XGBoost 1413322 .12 752190. 86 -0.0576 CatBoost 1385687 .20 748777. 30 -.0167 6 Discussion The results of the predictive modeling process indicate that metadata alone is insufficient for accurately forecasting YouTube video popularity. Although models such as CatBoost and XGBoost demonstrated relatively lower root mean squared error values, none achieved a positive R-squared score. This suggests that the models failed to explain variation in view counts beyond what could be achieved with a simple average-based prediction.
These findings highlight the limited predictive power of features such as title length, publish time, and normalized engagement ratios when used in isolation. Viewer behavior, algorithmic recommendations, social influence, and video content are likely to play a more significant role in driving engagement. Temporal analysis revealed that videos published on Sundays and Mondays received higher average view counts, but these differences were not strong enough to serve as reliable predictors. Among the engineered features, normalized metrics, like likes per view and comments per view, showed greater relevance than raw counts. This supports earlier research, which has shown that ratios like likes per view are more useful than total counts when evaluating how videos perform. Although the models did not perform well in predicting view counts, using basic and publicly available metadata still provides value in identifying patterns and developing early ideas. It offers a starting point for understanding engagement trends without needing access to private or complex data. Future studies could improve prediction accuracy by including additional sources, such as viewer demographics, social media activity, or features pulled directly from the video content. 6.1 Conclusion This project set out to determine whether YouTube video engagement could be predicted using only metadata and text-based features available at the time of upload. After collecting and analyzing data from over 3,000 videos, several models were trained to estimate view counts based on features such as title, publish time, and engagement ratios. Although the models did not achieve high predictive accuracy, the process highlighted important patterns in how
122
Made with FlippingBook flipbook maker