AAI_2025_Capstone_Chronicles_Combined
5
Figure 2 Label Frequencies (Positive Examples)
The message text required transformation into numerical representations suitable for machine learning algorithms. Text data was vectorized using TF-IDF restricted to the top 5,000 unigrams and bigrams to capture both individual words and common two-word phrases that carry semantic meaning in disaster contexts. The genre variable, which provides useful contextual information about the message's source (direct, social, news), was one-hot encoded to add three binary features to the model. While semantic embeddings such as Word2Vec were considered for their ability to capture word relationships, TF-IDF was selected because the XGBoost model is more effective at utilizing the high-dimensional, sparse representations provided by frequency-based vectors compared to dense embeddings (Illa et al., 2024). This preprocessing pipeline resulted in approximately 5,003 total features per message. The resulting high dimensionality combined with a relatively small training set necessitates strong regularization to prevent overfitting. Furthermore, the severe class imbalance, as seen in Table 1, necessitates mitigation strategies like class-weight adjustments in the loss function and per-label threshold optimization to ensure the
Figure 1 Labels Correlation Matrix for top 20 labels
There are a number of messages which have more than one label, which can be seen in Figure 2. This strongly suggests that methods capable of modeling these dependencies, such as Classifier Chains, will likely outperform independent Binary Relevance approaches.
321
Made with FlippingBook - Share PDF online