AAI_2025_Capstone_Chronicles_Combined

12

challenging (0.365) due to extreme data scarcity (77 validation examples), though the optimized threshold of 0.75 provides the best available precision-recall balance. Table 5

performance decreased to F1=0.47 (from validation 0.513), and search and rescue dropped to F1=0.22 (from validation 0.365), confirming that extreme data scarcity for these rare but critical categories fundamentally limits model performance even with aggressive optimization. Table 6 presents the complete test set performance comparison.

Optimization Stage

F1 Micro

F1 Macro

Precision Micro

Recall Micro

Dataset

Baseline (threshold 0.5)

0.625

0.461

0.582

0.674

Validation

Table 6

+Threshold Tuning

0.682

0.499

0.657

0.708

Validation

+Category Hyperparamet er Tuning

0.682

0.499

0.657

0.708

Validation

Final Model (Test Set)

0.677

0.518

0.621

0.743

Test

Threshold tuning delivered 9.1% improvement in micro F1 with minimal computational cost. Category-specific hyperparameter tuning improved individual critical categories but did not substantially change aggregate metrics. The final optimized model was evaluated on the held-out test set of 2,629 messages to assess generalization to unseen data. Test set performance closely matched validation results, with micro F1=0.677 (vs. validation 0.682) and macro F1=0.518 (vs. validation 0.499), confirming that the optimization process did not result in overfitting to the validation set. Notably, macro F1 improved by 3.8% on the test set, suggesting that the cost-sensitive learning and threshold optimization strategies generalize effectively to novel disaster communications. Critical category performance on the test set demonstrated strong results for basic needs: food achieved F1=0.84 (exceeding the 0.80 target), water F1=0.81 (also exceeding target), and shelter F1=0.74. However, medical help

Test set performance confirms that basic needs categories meet or approach the F1 > 0.80 target, while rare critical categories remain challenging due to limited training examples.

Qualitative examination of the optimized XGBoost model's misclassified messages revealed four systematic error patterns. First, messages containing ambiguous language where identical surface-level words could indicate different underlying needs (e.g., "people are stuck" appearing in both search and rescue contexts and transportation contexts) presented significant challenges, as the TF-IDF bag-of-words representation lacks contextual awareness to disambiguate based on surrounding semantics. Second, messages containing multiple

328

Made with FlippingBook - Share PDF online