AAI_2025_Capstone_Chronicles_Combined
Cinema Analytics and Prediction System
12
can be beneficial when handling long-tail or underrepresented genres such as Documentary or
Foreign films as we have in our dataset.
Together, these methods form a hierarchy of NLP-based modeling, from sparse lexical
matching to deep semantic understanding, enabling more personalized and context-aware
movie recommendations and classification systems.
In a separate component of the project, revenue prediction and success classification
were addressed using machine learning models designed for regression and classification tasks.
Specifically, XGBRegressor and XGBClassifier from the XGBoost library were employed, which
implement the gradient boosting framework. Gradient boosting builds an ensemble of decision
trees in a sequential manner, with each tree trained to correct the residual errors of the
previous ones. XGBoost is particularly well-suited for these tasks due to its ability to capture
complex, nonlinear relationships and its robustness to outliers — an essential characteristic
when modeling movie revenue, where extreme values are common. Furthermore, its
compatibility with both numerical and encoded categorical features, along with built-in feature
importance metrics, enhances model interpretability.
Numerous studies have validated the use of gradient boosting in similar applications.
For example, Lee et al. (2020) demonstrated that gradient boosting outperforms linear models
for revenue prediction by capturing intricate interactions among predictors. Sharda and Delen
(2006) showed that a multi-layer perceptron neural network surpassed logistic regression and
decision trees in predicting box office success. Additionally, competitions like the Kaggle TMDB
Box Office Prediction challenge have showcased the practical utility of XGBoost in forecasting
179
Internal
Made with FlippingBook - Share PDF online