ADS Capstone Chronicles Revised
The rigorous and meticulous feature engineering process resulted in a dataset enriched with valuable indicators and features poised for effective time series modeling and forecasting. By incorporating these meticulously crafted features, the predictive power of the models was enhanced, achieving more accurate and reliable forecasts. 5.1 Model and Evaluation of Findings The modeling phase of the project was meticulously executed, following a structured approach to ensure precise and insightful results. The dataset was initially divided into training and test sets, maintaining the temporal order. This careful split was crucial for evaluating the model on unseen data and preserving the integrity of time series analysis, instilling confidence in the process. Specifically, the dataset was split into a training set of 8,584 rows and 24 columns and a test set of 2,180 rows and 24 columns. The training set covered data from July 16, 2021, to December 2, 2023, while the test set ranged from December 3, 2023, to July 8, 2024. 5.1.1 Data Preprocessing Data preprocessing is critical in preparing the dataset for modeling and forecasting. It involves several stages: handling multicollinearity, data splitting, scaling, and feature importance analysis. This meticulous approach ensures the highest data quality for our modeling and forecasting tasks. 5.1.2 Handling Multicollinearity The preprocessing journey began with meticulously examining potential multicollinearity within the dataset. A correlation heatmap for each product ID was created to visually identify pairs of highly correlated features. Features with a correlation more significant than 90% were considered for removal. The Variance Inflation Factor (VIF) for the numerical features was then calculated. Features with VIF values more significant than ten were iteratively removed, ensuring crucial
features like “volume,” ‘pct_change,’ ‘time,’ ‘day_of_week,’ ‘price_change,’ and ‘ volatility ’ were retained. 5.1.3 Addressing Missing Values Handling multicollinearity and VIF scores introduced additional missing values. A threshold of 35% for missing values was set, and columns that exceeded this threshold were dropped. The interpolation method was employed for the remaining columns to impute numerical missing values. Forward and backward fill techniques were used for categorical values, ensuring continuity and integrity by estimating missing values based on surrounding data points. 5.1.4 Data Splitting The dataset was split into training and testing sets, adhering to an 80% training and 20% testing split standard for time series data. The training set spanned from July 16, 2021, to December 2, 2023, while the test set ranged from December 3, 2023, to July 8, 2024. This split helps prevent data leakage and ensures accurate model performance evaluation. 5.1.5 Scaling and Normalization Advanced scaling and normalization techniques were applied using the TimeSeriesScalerMeanVariance tool from the tslearn library. This tool standardizes each time series to have a mean of zero and a variance of one, maintaining the statistical properties of the time series and ensuring consistent model performance. 5.1.6 Feature Importance Analysis Before moving to modeling, the importance of features was assessed using Ridge Regression. By performing a grid search to find the best alpha values for each product ID, the model was tuned for optimal performance. This step highlighted the most influential features, guiding the decision to retain all important features. These meticulous preprocessing steps, including handling
81
Made with FlippingBook - Online Brochure Maker