Date

Data Gathering

In this notebook, we'll

  • create the directory structure for our project
  • download the raw data for our project
  • write functions and classes to transform raw data (e.g in XML, jason, etc.) into data frames
  • explain what we are doing

Collection of HIV Data

We used R functions to download the data at first, the but the website generates the data dynamically with POST requests. This following section of code will download empty files, but it shows the URLs we used. We ended up downloading the CSV files manually from our own files on Dropbox.

In [40]:
%reload_ext rmagic
In [41]:
%%R
data_dir = './data'
raw_dir = paste(data_dir, '/raw', sep='')
cleaned_dir = paste(data_dir, '/cleaned', sep='')

dir.create(data_dir)
dir.create(raw_dir)
dir.create(cleaned_dir)

First, we gathered data for the infection rate per 100000 of HIV diagnosis, adults and adolescents. This data came from http://kff.org/state-category/hivaids/.

In [42]:
protcol = 'https://'
website = 'dl.dropboxusercontent.com/u/30426053/hiv_data/'

csv_key = 'HIV_Diagnoses.csv'
csv_url = protcol + website + csv_key
print csv_url
!curl $csv_url > './data/raw/HIV_Diagnoses.csv'
https://dl.dropboxusercontent.com/u/30426053/hiv_data/HIV_Diagnoses.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1147  100  1147    0     0   1747      0 --:--:-- --:--:-- --:--:--  1745

Supplementary data was available at another URL.

In [1]:
protcol = 'https://'
website = 'dl.dropboxusercontent.com/u/30426053/hiv_data/'

#Citizenship Status
cvs_key = 'Citizenship_Status.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Citizenship_Status.csv'

#Population distribution by the Federal Poverty Level
cvs_key = 'fpl.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/fpl.csv'

#Median Annual Income
cvs_key = 'Median_Annual_Income.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Median_Annual_Income.csv'

#Unemployment Rate
cvs_key = 'unemployment_rate.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/unemployment_rate.csv'

#Total Population that is Insured
cvs_key = 'Has_Insurance.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Has_Insurance.csv'

#Hospital Ownership Type
cvs_key = 'Hospital_Owners.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Hospital_Owners.csv'

#Federal Funding for Aids prevention
cvs_key = 'Federal_Funds.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Federal_Funds.csv'

#Hiv Testing Rate
cvs_key = 'Test_Rate.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Test_Rate.csv'

#Sex Ed Grades 6-8
cvs_key = 'Sex_Ed_6_8.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Sex_Ed_6_8.csv'

#Sex Ed Grades 9-12
cvs_key = 'Sex_Ed_9_12.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/Sex_Ed_9_12.csv'

#Number of Reported Chlamydia Cases
cvs_key = 'chlamydia_cases.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/chlamydia_cases.csv'

#Number of Reported Gonorrhea Cases
cvs_key = 'gonorrhea_cases.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/gonorrhea_cases.csv'

#Number of Reported Syphilis Cases
cvs_key = 'syphilis_cases.csv'
csv_url = protcol + website + cvs_key
!curl $csv_url > './data/raw/syphilis_cases.csv'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2099  100  2099    0     0   2234      0 --:--:-- --:--:-- --:--:--  2235
100  2099  100  2099    0     0   2194      0 --:--:-- --:--:-- --:--:--  2195
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4023  100  4023    0     0   5961      0 --:--:-- --:--:-- --:--:--  5960
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1045  100  1045    0     0   1611      0 --:--:-- --:--:-- --:--:--  1610
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1394  100  1394    0     0   1464      0 --:--:-- --:--:-- --:--:--  1464
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4514  100  4514    0     0   5837      0 --:--:-- --:--:-- --:--:--  5847
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1816  100  1816    0     0   2840      0 --:--:-- --:--:-- --:--:--  2864
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3335  100  3335    0     0   4350      0 --:--:-- --:--:-- --:--:--  4353
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1122  100  1122    0     0   1579      0 --:--:-- --:--:-- --:--:--  1582
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1553  100  1553    0     0   2563      0 --:--:-- --:--:-- --:--:--  2566
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1563  100  1563    0     0   2408      0 --:--:-- --:--:-- --:--:--  2412
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1071  100  1071    0     0   1682      0 --:--:-- --:--:-- --:--:--  1686
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1022  100  1022    0     0   1193      0 --:--:-- --:--:-- --:--:--  1193
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   963  100   963    0     0   1542      0 --:--:-- --:--:-- --:--:--  1543

Loading the HIV Data

In this step, we define a script to be reused that will load each of the CSV files into data frames. Each one is treated separately because loading each one is different.

In [44]:
%%file ./scripts/hiv_loader.R

options(stringsAsFactors=FALSE)

get_hiv_diagnoses_df = function() {
    HIV_Diagnoses = read.csv('./data/raw/HIV_Diagnoses.csv', header = F)
    col_name_HIV = HIV_Diagnoses[4,1:2]
    loaded.HIV_Diagnoses = (HIV_Diagnoses[5:60,1:2])
    names(loaded.HIV_Diagnoses) = unlist(col_name_HIV)
    row.names(loaded.HIV_Diagnoses) = 1:nrow(loaded.HIV_Diagnoses)
    
    return(loaded.HIV_Diagnoses)
}

get_citizenship_status_df = function() {
    Citizenship_Status = read.csv('./data/raw/Citizenship_Status.csv',header = F)
    col_names_citizen = Citizenship_Status[4,1:4]
    loaded.Citizenship_Status = (Citizenship_Status[5:55,1:4])
    names(loaded.Citizenship_Status) = unlist(col_names_citizen)
    row.names(loaded.Citizenship_Status) = 1:nrow(loaded.Citizenship_Status)
    
    return(loaded.Citizenship_Status)
}

get_fpl_df = function() {
    fpl = read.csv('./data/raw/fpl.csv',header = F)
    col_name_fpl = fpl[4,1:7]
    loaded.fpl = (fpl[5:56,1:7])
    names(loaded.fpl) = unlist(col_name_fpl)
    row.names(loaded.fpl) = 1:nrow(loaded.fpl)
    
    return(loaded.fpl)
}

get_income_df = function() {
    Median_Annual_Income = read.csv('./data/raw/Median_Annual_Income.csv',header = F)
    col_name_medIncome = Median_Annual_Income[4,1:2]
    loaded.Median_Annual_Income = (Median_Annual_Income[5:56,1:2])
    names(loaded.Median_Annual_Income) = unlist(col_name_medIncome)
    row.names(loaded.Median_Annual_Income) = 1:nrow(loaded.Median_Annual_Income)
    
    return(loaded.Median_Annual_Income)
}

get_unemployment_df = function() {
    unemployment_rate = read.csv('./data/raw/unemployment_rate.csv',header = F)
    col_name_unempRate = unemployment_rate[4,1:3]
    loaded.unemployment_rate = (unemployment_rate[5:57,1:3])
    names(loaded.unemployment_rate) = unlist(col_name_unempRate)
    row.names(loaded.unemployment_rate) = 1:nrow(loaded.unemployment_rate)     

    return(loaded.unemployment_rate)
}

get_insured_df = function() {
    pop_insured = read.csv('./data/raw/Has_Insurance.csv',header = F)
    col_name_insured = pop_insured[4,1:8]
    loaded.pop_insured = (pop_insured[5:56,1:8])
    names(loaded.pop_insured) = unlist(col_name_insured)
    row.names(loaded.pop_insured) = 1:nrow(loaded.pop_insured)
    
    return(loaded.pop_insured)
}

get_hospital_df = function() {
    Hospital_Owners = read.csv('./data/raw/Hospital_Owners.csv',header = F)
    col_name_HospOwners = Hospital_Owners[4,1:5]
    loaded.Hospital_Owners = (Hospital_Owners[5:56,1:5])
    names(loaded.Hospital_Owners) = unlist(col_name_HospOwners)
    row.names(loaded.Hospital_Owners) = 1:nrow(loaded.Hospital_Owners)
    
    return(loaded.Hospital_Owners)
}

get_federal_funds_df = function() {
    Federal_Funds = read.csv('./data/raw/Federal_Funds.csv',header = F)
    col_name_FedFunds = Federal_Funds[4,1:7]
    loaded.Federal_Funds = (Federal_Funds[5:64,1:7])
    names(loaded.Federal_Funds) = unlist(col_name_FedFunds)
    row.names(loaded.Federal_Funds) = 1:nrow(loaded.Federal_Funds)
    
    return(loaded.Federal_Funds)
}

get_test_rate_df = function() {
    Test_Rate = read.csv('./data/raw/Test_Rate.csv',header = F)
    col_name_TestRate = Test_Rate[4,1:2]
    loaded.Test_Rate = (Test_Rate[5:58,1:2])
    names(loaded.Test_Rate) = unlist(col_name_TestRate)
    row.names(loaded.Test_Rate) = 1:nrow(loaded.Test_Rate)
    
    return(loaded.Test_Rate)
}

get_sex_ed_6_8_df = function() {
    Sex_Ed_6_8 = read.csv('./data/raw/Sex_Ed_6_8.csv',header = F)
    col_name_SexEd_6_8 = Sex_Ed_6_8[4,1:3]
    loaded.Sex_Ed_6_8 = (Sex_Ed_6_8[5:56,1:3])
    names(loaded.Sex_Ed_6_8) = unlist(col_name_SexEd_6_8)
    row.names(loaded.Sex_Ed_6_8) = 1:nrow(loaded.Sex_Ed_6_8)
    
    return(loaded.Sex_Ed_6_8)
}

get_sex_ed_9_12_df = function() {
    Sex_Ed_9_12 = read.csv('./data/raw/Sex_Ed_9_12.csv',header = F)
    col_name_SexEd_9_12 = Sex_Ed_9_12[4,1:3]
    loaded.Sex_Ed_9_12 = (Sex_Ed_9_12[5:56,1:3])
    names(loaded.Sex_Ed_9_12) = unlist(col_name_SexEd_9_12)
    row.names(loaded.Sex_Ed_9_12) = 1:nrow(loaded.Sex_Ed_9_12)
    
    return(loaded.Sex_Ed_9_12)
}

get_chlamydia_df = function() {
    chlamydia_cases = read.csv('./data/raw/chlamydia_cases.csv',header = F)
    col_name_chlamydia = chlamydia_cases[4,1:2]
    loaded.chlamydia_cases = (chlamydia_cases[5:59,1:2])
    names(loaded.chlamydia_cases) = unlist(col_name_chlamydia)
    row.names(loaded.chlamydia_cases) = 1:nrow(loaded.chlamydia_cases)
    
    return(loaded.chlamydia_cases)
}

get_gonorrhea_df = function() {
    gonorrhea_cases = read.csv('./data/raw/gonorrhea_cases.csv',header = F)
    col_name_gonorrhea = gonorrhea_cases[4,1:2]
    loaded.gonorrhea_cases = (gonorrhea_cases[5:59,1:2])
    names(loaded.gonorrhea_cases) = unlist(col_name_gonorrhea)
    row.names(loaded.gonorrhea_cases) = 1:nrow(loaded.gonorrhea_cases)
    
    return(loaded.gonorrhea_cases)
}

get_syphilis_df = function() {
    syphilis_cases = read.csv('./data/raw/syphilis_cases.csv',header = F)
    col_name_syphilis = syphilis_cases[4,1:2]
    loaded.syphilis_cases = (syphilis_cases[5:59,1:2])
    names(loaded.syphilis_cases) = unlist(col_name_syphilis)
    row.names(loaded.syphilis_cases) = 1:nrow(loaded.syphilis_cases)
    
    return(loaded.syphilis_cases)
}
Overwriting ./scripts/hiv_loader.R