ADS Capstone Chronicles Revised
7
4.2 Data Cleaning Comprehensive data cleaning and preprocessing were conducted to ensure the accuracy and reliability of the subsequent analyses. The initial dataset comprised various columns: ‘ time ,’ ‘low,’ ‘high,’ ‘open,’ ‘close,’ ‘volume,’ ‘price_change,’ ‘average_price,’ ‘product_id,’ and ‘load_dt.’ The first step was to convert the time column to datetime format, facilitating accurate time series analysis. Additionally, the ‘ day_of_week ’ column was mapped to appropriate day names, ensuring the correct representation of each day (e.g., 0 for Monday, 1 for Tuesday). 4.2.1 Time Conversion The columns were then separated into numerical and categorical types, and their data types were corrected accordingly. This step was crucial to maintaining the accuracy of further analyses and feature engineering. Given the presence of ten unique product IDs, each was handled separately in all subsequent steps, ensuring data integrity and avoiding potential biases, thereby instilling confidence in the reliability of the data. 4.2.2 Handling Missing Values Handling missing values was a significant challenge. To address this, forward and backward fill methods were employed, which were particularly suitable for time series data. Forward fill propagates the last observed value forward, while backward fill propagates the next observed value backward. This approach is beneficial for time series data as it maintains the continuity and trends in the dataset, which are crucial for accurate analysis and modeling. For categorical columns, forward and backward fill was applied to ensure no missing values, treating each product ID independently to preserve its unique characteristics. This demonstrates the robustness of the data handling techniques.
4.2.3 Replacement of Zero Values Zero values in numerical columns, excluding ‘ day_of_week ’ was replaced with a small value close to zero. This replacement was necessary to avoid potential issues in subsequent modeling steps, as zero values could lead to inaccuracies in calculations and model predictions. Columns like ‘ day_name ’ were dropped since ‘ day_of_week ’ provided sufficient information. 4.2.4 Final dataset preparation The rigorous data-cleaning process ensured the the dataset was free from inconsistencies, making it suitable for sophisticated analyses and predictive modeling. Efficiently addressing missing and zero values laid a solid foundation for accurate and reliable analysis. By meticulously addressing the data cleaning requirements, the dataset was prepared for subsequent phases, ensuring the integrity and accuracy of the analysis. 4.3 Data Preparation Data preparation involves transforming the cleaned dataset into a format suitable for analysis. This step includes creating new features, normalizing or scaling data, and splitting the dataset into training and testing sets if necessary. For example, additional metrics such as moving averages or percentage changes based on the ‘ close ’ prices might be calculated, or products might be categorized based on their volatility. Data preparation also involves encoding categorical variables, such as ‘ product_id, ’ into numerical formats if required for modeling. This step ensures the dataset is ready for in-depth exploratory analysis, visualization, and subsequent machine-learning tasks. Heat Map (Correlation) Understanding the correlation between the different variables extracted and created is imperative to avoid issues during modeling. The 4.4
74
Made with FlippingBook - Online Brochure Maker