Date

Data Analysis

In this notebook, we'll perform the actual analysis of the data stored in the directory

    ./data/cleaned

We store our simulated data into the directory

    ./data/simulated

The graphics produced as a result of our analysis are stored in the directory:

    ./visualization

Setup

To run the maps in this notebook, there are dependencies for the IPyD3 module we load. To install them, run:

sudo pip install titlecase
sudo apt-get install phantomjs

NumPy should already be installed.

In [31]:
%load_ext rmagic
The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

Creating a Data Frame combining all the Variables

We take the relevant variables from our previous cleaning process and place them all in a single data frame.

In [9]:
%%R
source('./scripts/hiv_loader_clean.R')

all_data = get_all_data()
DF_Analysis = data.frame(
    uninsured=all_data$insured$"Uninsured",
    has_sex_ed_6_9=all_data$sex_ed_6_9$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
    has_sex_ed_9_12=all_data$sex_ed_9_12$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
    citizenship=all_data$citizenship$"Citizen",
    federal_funds=all_data$federal_funds$"Total",
    test_rate=all_data$test_rate$"HIV.Testing.Rate...Ever.Tested",
    under_fpl=all_data$fpl$"Under.100.",
    income=all_data$income$"Median.Annual.Income",
    unemployment=all_data$unemployment$"October.2013",
    hospital_for_profit=all_data$hospital$"For.Profit",
    chlamydia=all_data$chlamydia$"Chlamydia.Cases", #These following are all per-100,000 numbers
    syphilis=all_data$syphilis$"Syphilis.Cases",
    gonorrhea=all_data$gonorrhea$"Gonorrhea.Cases",
    
    target=all_data$hiv_diagnoses$"Total"
)

row.names(DF_Analysis) = all_data$hiv_diagnoses$Location

print(head(DF_Analysis))
all_data = DF_Analysis
[1] "Loaded: hiv_diagnoses, citizenship_status, fpl, income, unemployment, insured, hospital, federal_funds, test_rate, sex_ed_6_9, sex_ed_9_12, chlamydia, gonorrhea, syphilis"
   uninsured has_sex_ed_6_9 has_sex_ed_9_12 citizenship federal_funds test_rate
AL 0.1395612          0.758           0.827   0.9802152    0.43885230     0.420
AK 0.1884869          0.446           0.659   0.9586563    0.03095699     0.443
AZ 0.1774917          0.323           0.716   0.9140282    0.37920656     0.341
AR 0.1794378          0.819           0.969   0.9656608    0.13691627     0.311
CA 0.1883828          0.760           0.933   0.8606448    4.33469436     0.392
CO 0.1470530             NA              NA   0.9442321    0.40883331     0.361
   under_fpl income unemployment hospital_for_profit chlamydia syphilis
AL 0.2196644  42245        0.065               0.382     29626      758
AK 0.1981696  60566        0.065               0.087      5739       11
AZ 0.2269776  48319        0.082               0.271     29251      906
AR 0.2431548  39806        0.075               0.274     16052      464
CA 0.2312397  56074        0.087               0.214    166773     6782
CO 0.1662836  59803        0.068               0.207     21811      367
   gonorrhea target
AL      9132   20.9
AK       984    4.6
AZ      4564   13.3
AR      4687   10.0
CA     27516   19.2
CO      2363    9.6

It's difficult to glean much information from the scatterplots, but it's interesting to note that there are some correlations. In particular, our variable of interest, the number of HIV diagnoses per 100,000 people, does seem to have a correlation with a few of the variables we've indicated.

In [10]:
%%R -r 110 -w 2000 -h 2000
plot(DF_Analysis[3:14])

Linear Regression Analysis

To begin our analysis, we remove the rows (states) with NA values. The 40 states left will serve as our training data for the modeler, and we'll test the results against the 10 removed states.

In [11]:
%%R
removeNA = function(data){
  for(i in 3:ncol(data)){
    ind = which(is.na(data[,i]))

    if(length(ind)>0){                        
      data = data[-ind,]
    }
  }
  
  for(k in 3:ncol(data)){                     
    ind2 = which(is.nan(data[,k]))     

    if(length(ind2)>0){
      data = data[-ind2,]                     
    }
  }

  print(dim(data))                            
  return(data)          
}

DF_Analysis = removeNA(DF_Analysis)

abbreviations = c('AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY')
removed_states = abbreviations[!(abbreviations %in% row.names(DF_Analysis))]
print(removed_states)
[1] 40 14
 [1] "CO" "GA" "HI" "IL" "LA" "NM" "NY" "RI" "VT" "WY"

We've removed Colorado, Georgia, Hawaii, Illinois, Louisiana, New Mexico, New York, Rhode Island, Vermont, and Wyoming from creating the model because of NA values in their rows.

Next, we run the regression, settling on 10 iterations. The function does a linear regression, and removes the variable with the highest P value greater than 0.05. After 10 regressions, or if no variables have P value greater than 0.05, we consider the algorithm terminated having determined the variables with the greatest correlation to the target, or HIV Diagnoses.

In [12]:
%%R
run_regression = function(DF_Analysis, remove_vec) {
    for(i in 1:10){
        lm.out = lm(DF_Analysis$target~., data = DF_Analysis[,-remove_vec])
        stat_temp = summary(lm.out)
        stat = data.frame(stat_temp[[4]])
        stat = stat[-1,]
        if(max(stat$Pr...t..)>0.05) {
            stat = stat[order(stat$Pr...t..),]
            least_exp = rownames(stat)[nrow(stat)]
            cat(sprintf("The least explanatory variable in linear model fitting test %d is %s\n", i, least_exp))
            col.num = which(colnames(DF_Analysis)==least_exp)
            remove_vec = c(remove_vec, col.num)
        }
        else{break}
    }

    return(remove_vec)
}

removed = run_regression(DF_Analysis, 14)

cat("\n\n")

lm.out8 = lm(DF_Analysis$target~., data = DF_Analysis[,-removed])
stat_temp8 = summary(lm.out8)
stat8 = data.frame(stat_temp8[[4]]) 
cat("The result of final model fitting test\n\n")
print(stat8)

model_str = sprintf("target = %f", stat8$Estimate[1])
for (i in 2:(15-length(removed))) {
    model_str = paste(model_str, sprintf(' + %f*%s',stat8$Estimate[i],rownames(stat8)[i]))
}

cat("\n\n")
cat("The model that we found is\n\n")
cat(model_str)
The least explanatory variable in linear model fitting test 1 is has_sex_ed_6_9
The least explanatory variable in linear model fitting test 2 is syphilis
The least explanatory variable in linear model fitting test 3 is income
The least explanatory variable in linear model fitting test 4 is has_sex_ed_9_12
The least explanatory variable in linear model fitting test 5 is unemployment
The least explanatory variable in linear model fitting test 6 is under_fpl
The least explanatory variable in linear model fitting test 7 is citizenship


The result of final model fitting test

                         Estimate   Std..Error   t.value     Pr...t..
(Intercept)         -1.151656e+01 4.631387e+00 -2.486632 1.812971e-02
uninsured           -5.202591e+01 2.052507e+01 -2.534750 1.617754e-02
federal_funds        5.499537e+00 1.846258e+00  2.978748 5.393396e-03
test_rate            8.020801e+01 1.291732e+01  6.209339 5.217136e-07
hospital_for_profit  1.339964e+01 5.959283e+00  2.248533 3.134033e-02
chlamydia           -2.416097e-04 8.926969e-05 -2.706514 1.068060e-02
gonorrhea            8.811796e-04 3.079856e-04  2.861106 7.271932e-03


The model that we found is

target = -11.516558  + -52.025914*uninsured  + 5.499537*federal_funds  + 80.208011*test_rate  + 13.399644*hospital_for_profit  + -0.000242*chlamydia  + 0.000881*gonorrhea