ADS Capstone Chronicles Revised
13
to generate binary reaction indicator columns. 4.3.1 Binary Reaction Indicator Columns. Binary reaction indicator columns were added to both the documents and events dataframe. Each unique value in the reactions field from the events dataframe was stored in a list.Thislistwasfilteredto limitthecharacterlengthofeachreactionto 64(MySQL’scolumnnamecharacterlimit). Additionally, only the 1000 most frequent reactions were in the list to aid in dimensionalityreduction.Thesevalueswere then each added as anewbinarycolumnto the documentsandeventsdataframe,witha default value of 0. In the documents dataframe, if any text matched against unique drugs from the labels dataframe, it wasextractedandsavedasanewdataframe rowinanewlycreateddrugscolumn.Ifthe associated article contained text matching anyoftheuniquereactions,thebinaryvalue was changed to 1. 4.3.2 National Drug Code Standardizing. National Drug Codes (NDCs) can come in various formats, which are all related, but vary in structure (Table 1). The first set of numbers refer to manufacturer code, the middle set refers to the drug formulation code,andthelasttwonumbersinthe11and 10 version refer to the specific packaging version. Version 9 does not contain information on packaging and is the most common version found acrossdatasources. To match NDCs across data sources, string cleaning functions were used to make version 11 into 9 when needed, if both numbers were not available from the same source. Version 10 was not used. Table 1 National Drug Code Standardization Version String Format Standardized 11 12345006789 Remove “89”
10
12345-067-89
Not used
9
12345-067
Replace “-” with “0”
4.4 Database Creation A new database was created called pharma_db using a local SQL connection (Figure 6). Table parameters werespecified basedonthedatatypecharacteristicsofeach respective processed dataframe. Each data table has a unique, primary index: -doc_id -event_id -patient_reaction_id -patient_drug_id Additional indices were created for rxcui and ndc codes across the prices, patient drugs, and labels tables. Constraints were placed on the patient_drugs and patient_reactions whereby any updates made to the parent table, adverse_events , would also update the two child tables. VARCHARlimitsfortextdataandBIGINT limits were determined based on the text lengths and NDC code lengths. -label_id -price_id -manu_id
163
Made with FlippingBook - Online Brochure Maker