AAI_2025_Capstone_Chronicles_Combined
Cinema Analytics and Prediction System
4
while genres like Western or War were
underrepresented. These insights not only guided
downstream feature engineering decisions but
also informed our strategy to handle genre
imbalance, as we can see in the following plot
(Figure 2), using techniques like resampling and
Figure 2: Genre distribution
class weighting during model training.
We then performed a series of preprocessing steps to prepare the data for modeling.
Text-based fields such as genres and keywords were initially stored as stringified JSON and
required parsing to extract meaningful lists of terms. We also normalized metadata like
popularity, budget, and revenue using min-max scaling to ensure compatibility with deep
learning models. A new feature that combined free text features was engineered by merging
each film’s overview, genres, and keywords (Figure 3).
Figure 3: Combined text feature
This became the core input for multiple modeling approaches, including TF-IDF, BERT
embeddings, and LSTM with GloVe vectors. For classification, the same text features were
paired with normalized metadata and used in both unsupervised and supervised settings. The
BERT-based models consistently demonstrated superior semantic understanding in both
171
Internal
Made with FlippingBook - Share PDF online