ADS Capstone Chronicles Revised

First page Table of contents Previous page 136 Next page Last page

A target variable, ‘potential_fwa’ was created to identify potential cases of fraud, waste, and abuse in healthcare service utilization data. It was derived by applying logical conditions based on anomalies in ‘number_of_fee_for_service_beneficiaries_ dual_color’ and unusually high ‘total_payment_dual_color_values,’ aiming to flag instances warranting further investigation for potentially fraudulent activities. The non fraudulent cases greatly outweighed the fraudulent cases (75% and 25%). After splitting the data into training and test sets, the training data were then rebalanced using the Synthetic Minority Oversampling Technique (SMOTE) to a 50/50 split of non-fraudulent and fraudulent cases. In addition to creating the target variable, other features were engineered and selected based on their relevance to the analysis objectives, ensuring that the model is trained on the most informative and impactful attributes. The one-hot encoding method was used to convert categorical variables into numerical format for them to be suitable for machine learning models. Binary encoding was also used to encode the categorical variables that would have otherwise introduced a plethora of columns. This was applied before the training and test split. The RobustScaler method was used to standardize numerical features to ensure that each feature contributed equally to the performance of the model. This was implemented after the training and test split.

Chi-square tests were employed to identify statistically significant features in the dataset. Features with a p-value of 1.00 such as ‘refrence_period’ and ‘moratorium,’ were excluded from further analysis due to their lack of statistical significance. In contrast, columns with low p-values were retained for further analysis as they exhibited a meaningful association with the target variable ‘potential_fwa.’ The variance inflation factor (VIF) for each feature was calculated to handle multicollinearity. The features with high VIF scores were removed from the data frame. The VIF score was compared before and after the transformation of the dataset. The correlation matrix (see Figure 3) was another tool used to perform feature selection. The features that were highly correlated with one another (>.70) were removed from the data frame. After feature engineering, the data now has 1,044,354 rows and 62 features; an additional 18 features have been introduced. Dimensionality reduction was then considered to reduce computational complexity and mitigate potential overfitting in the modeling process. During multivariate graphical analysis, principal component analysis (PCA) was performed. PCA was used to visualize the principal components of the scaled training dataset and reduce its dimensionality. To determine the optimal number of principal components, a scree plot was used (see Figure 6).

136

Made with FlippingBook - Online Brochure Maker