Chapter 2 Data sources

2.1 Datasets

The data set used for the project are:

https://www.datafiles.samhsa.gov/dataset/treatment-episode-data-set-admissions-2000-2019-teds-2000-2019-ds0001. This data was published by the U.S. department of Health & Human services and can be found under Substance Abuse & Mental Health Data Archive. It contains data about people admitted to hospitals in various states due to alcohol/drug overdose. This data serves as a good proxy to understand the underlying trends of drug abuse.
https://www2.census.gov/programs-surveys/popest/datasets/2010/2010-eval-estimates/sc-est2010-alldata5.csv. This data was published by the United states census Bureau. This contains information about each states population across various dimensions such as age, gender, race. The data is from 2000 to 2010.
https://www2.census.gov/programs-surveys/popest/datasets/2010/2010-eval-estimates/sc-est2010-alldata5.csv. This data was published by the United states census Bureau. This contains information about each states population across various dimensions such as age, gender, race. The data is from 2010 to 2020.

We looked and explored for other data sources as well. There is no source which provides well documents Drug abuse trends directly. There are two other notable data sources which are:

Monitoring the Futures Survey data (https://www.icpsr.umich.edu/web/NAHDAP/series/35). This survey was designed to explore changes in important values, behaviors, and lifestyle orientations of contemporary American youth. The surveys began in 1975 with 12th-grade students only. Eighth- and 10th-grade student surveys were added in 1991. It is an ongoing survey.

Issue: The data in this dataset were in form of 3000 questions. Each question was represented by a column in the dataset. As we had limited span, we could not use this as our dataset as we would have to undersatnd all the 3000 questions to make use of this dataset.

CDC (Center for Disease Control) Underlying causes of death data (https://wonder.cdc.gov/controller/datarequest/D76).

Issue: This data did not have a lot of variables and we would not be able to understand the underlying trends between demographics and drug abuse.

2.2 Data Description

All the data collected was given in the csv format.

The treatment dataset initially contained 37558015 rows and 62 columns. After the preprocessing it got reduced to 9493737 rows and 14 columns. The data transformations done to achieve this can be found here. The data dictionary along with an understanding of variables can be found here.

The combined population data contained 394740 rows and 21 columns. The data dictionary along with an understanding of variables can be found here.

2.3 Issues with the data

The treatment data does not contain data about all admissions. It is a huge dataset and is meant to be used to understand the characteristics of the substance abuse, form hypothesis and understand the underlying trends.
Some of the data have missing values. But fortunately due to abundance of data, we can ignore these rows and move ahead with rows that have data. Though, this strategy does not work for columns with huge amount of missing data and these are kept as it is in our analysis.
Because the size is data is enormous, it takes a lot of time to preprocess the data. We cannot directly work with delimited data but had to convert file to feather format to ensure quick reading and writing of files.