M.S. Applied Data Science - Capstone Chronicles 2025
9
thorough comparison of predictive performance across different algorithmic approaches. 4.4.4.1 Dataset Preparation. The dataset used for training and evaluation was the processed version of the original YouTube metadata. After transformation, it included the following features: likes, comments, publish hour, likes per view, and comments per view. The target variable for all models was the number of views, which represented the total video views and served as the basis for prediction. 4.4.4.2 Feature Scaling. For models sensitive to feature scales, such as SVR, the predictor variables were standardized using StandardScaler. This was done in parallel to ensure that scaled and unscaled datasets could be passed to models as needed. Tree-based models like XGBoost and CatBoost do not require feature scaling, so the original values were retained for those pipelines. 4.4.4.1 Training and validation datasets. The full dataset was split into an 80/20 training and test set using a fixed random seed for reproducibility. An 80/20 split is a common choice in machine learning because it provides enough data to train the model effectively while still reserving a meaningful portion for evaluating performance on unseen data. This balance helps ensure that the model generalizes well without overfitting to the training set. No cross-validation was used at this stage, as initial runs were intended for baseline comparison and performance screening. Future work should include k-fold cross-validation to improve the reliability of the results and reduce the impact of variance from a single train-test split. 5 Results and Findings CatBoost yielded the best overall performance in terms of root mean squared error and the R-squared value. This suggests CatBoost had the
closest approximation to the actual view counts among all models. XGboost followed second, suggesting both gradient boosting models benefited from the parameter tuning done via grid search. Both models have an optimal learning rate set to 0.05 and rather shallow tree depths of three for XGboost and 4 for CatBoost. The support vector regression model, although not tuned, showed surprisingly strong results in terms of mean absolute error, outperforming all other models in this particular metric. However, the support vector regression’s root mean squared error and R-squared values were still below expectations, indicating this model struggled with larger errors and variance. The linear regression model performed the worst overall, which was expected as this particular model was only to serve as a baseline model. The linear regression model had a root mean square error of 1,463,793 and an R-squared value of -0.1345. This result reinforces the need for non-linear modeling techniques to capture the complex relationships presented in our data. This is likely due to the feature engineering performed with the title.
∑(
− )2 ^
= 1 =1 = 1 =1 ∑ 2 = 1 − =1
(1)
− ^ ||| ||| ∑( − ^ ) 2 =1 ∑( − ) 2 5.1 Model Performance
(2)
(3)
Although CatBoost and XGboost demonstrated the best performance in absolute terms, none of the models achieved a positive R-squared value, indicating they failed to explain variance in the
121
Made with FlippingBook flipbook maker