AAI_2025_Capstone_Chronicles_Combined

Cinema Analytics and Prediction System

12

can be beneficial when handling long-tail or underrepresented genres such as Documentary or

Foreign films as we have in our dataset.

Together, these methods form a hierarchy of NLP-based modeling, from sparse lexical

matching to deep semantic understanding, enabling more personalized and context-aware

movie recommendations and classification systems.

In a separate component of the project, revenue prediction and success classification

were addressed using machine learning models designed for regression and classification tasks.

Specifically, XGBRegressor and XGBClassifier from the XGBoost library were employed, which

implement the gradient boosting framework. Gradient boosting builds an ensemble of decision

trees in a sequential manner, with each tree trained to correct the residual errors of the

previous ones. XGBoost is particularly well-suited for these tasks due to its ability to capture

complex, nonlinear relationships and its robustness to outliers — an essential characteristic

when modeling movie revenue, where extreme values are common. Furthermore, its

compatibility with both numerical and encoded categorical features, along with built-in feature

importance metrics, enhances model interpretability.

Numerous studies have validated the use of gradient boosting in similar applications.

For example, Lee et al. (2020) demonstrated that gradient boosting outperforms linear models

for revenue prediction by capturing intricate interactions among predictors. Sharda and Delen

(2006) showed that a multi-layer perceptron neural network surpassed logistic regression and

decision trees in predicting box office success. Additionally, competitions like the Kaggle TMDB

Box Office Prediction challenge have showcased the practical utility of XGBoost in forecasting

179

Internal

Made with FlippingBook - Share PDF online