ADS Capstone Chronicles Revised
10
allow240requestsperminutewithamaxof 120,000 requests per day per key. The data isupdatedquarterlyandstoredinJavaScript Object Notation (J SON) format. The exact date of data updates have historically not landed on the exact same dates and willbe monitored to keep the trigger system up-to-date. The FAERS database is accessed via the OpenFDA Drug Adverse Event API (>28 million records; FDA,2023b).Thereare42 fields - numerical, dates, categorical, free text - and some of these fields are nested dictionary lists that were expanded into additional dataframes. The API request contained parameters to return datawithout anymissingvaluesforage,sex,andweight, and only reports from healthcare professionals, thereby returning a complete dataset for the latest three months of data (January 23, 2024 to April 23, 2024 at the time of this project). The historical record API contains press releasesandpublicannouncementsfromthe FDA and its predecessors (3 fields), the OpenFDA pharmaceutical drug label API contains drug marketing and label information (140 fields), the National Drug CodeAPIcontainsactivedrugmanufacturer information and actively marketing drugs which is updated daily (FDA, n.d.a). The Data.Medicaid API was used for national average drug acquisition costs and is updated weekly (12 variables; CMMS, 2024). RxNorm is a database system that contains standardized information on drug compounds and their respective classifications (NIH, n.d.). openFDA and Data.Medicaidusedifferentversionsofdrug codes. Thus, to link information between openFDAandData.Medicaid,alldrugcodes
fromcompoundsreportedFAERSdatawere requested from RxNorm’s ndcproperties API endpoint. 4.2 Preprocessing The raw data from the API endpoints went throughaseriesofcleaningstepsdepending on data types. First,theJSONoutputsfrom the API requests were converted to dataframes.Thedataframeswereexamined for data structure, quality, and redundancy. 4.2.1 Functions. Multiple custom functions (“Functions.ipynb”) were created to aid in general preprocessing (Staggs & van der Wagt, n.d.). These include functions for natural language processing (NLP) pipelines, adding index columns, exploring and handling null fields, removing duplicates, standardizing age based on unit of measurement, imputing missing value, examining text token length outliers, and calculating summary statistics for model performance and input features. Each text variable was processed with different cleaning steps based on the underlying structure and inconsistencies. 4.2.1.1 Optimization Wrapper Function . A wrapper function was created to allow for parallel processing on all available CPU cores for all custom functions. This improved processingtimesfortextcleaning and text transformation functions. 4.2.2 Cleaning and Data Quality by Table 4.2.2.1DocumentsTable. Thedataextracted from the historical documents API released by the FDA contained no missing values. The text contents of the documents were scannedtofindanydrugnamesandadverse drug reactions. Drug names were matched against brand and generic names from the labels dataframe, and drug reactions were matched against all reactions in the patient
160
Made with FlippingBook - Online Brochure Maker