Data Analysis¶
In this notebook, we'll perform the actual analysis of the data stored in the directory
./data/cleaned
We store our simulated data into the directory
./data/simulated
The graphics produced as a result of our analysis are stored in the directory:
./visualization
Setup¶
To run the maps in this notebook, there are dependencies for the IPyD3
module we load. To install them, run:
sudo pip install titlecase
sudo apt-get install phantomjs
NumPy
should already be installed.
%load_ext rmagic
Creating a Data Frame combining all the Variables¶
We take the relevant variables from our previous cleaning process and place them all in a single data frame.
%%R
source('./scripts/hiv_loader_clean.R')
all_data = get_all_data()
DF_Analysis = data.frame(
uninsured=all_data$insured$"Uninsured",
has_sex_ed_6_9=all_data$sex_ed_6_9$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
has_sex_ed_9_12=all_data$sex_ed_9_12$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
citizenship=all_data$citizenship$"Citizen",
federal_funds=all_data$federal_funds$"Total",
test_rate=all_data$test_rate$"HIV.Testing.Rate...Ever.Tested",
under_fpl=all_data$fpl$"Under.100.",
income=all_data$income$"Median.Annual.Income",
unemployment=all_data$unemployment$"October.2013",
hospital_for_profit=all_data$hospital$"For.Profit",
chlamydia=all_data$chlamydia$"Chlamydia.Cases", #These following are all per-100,000 numbers
syphilis=all_data$syphilis$"Syphilis.Cases",
gonorrhea=all_data$gonorrhea$"Gonorrhea.Cases",
target=all_data$hiv_diagnoses$"Total"
)
row.names(DF_Analysis) = all_data$hiv_diagnoses$Location
print(head(DF_Analysis))
all_data = DF_Analysis
It's difficult to glean much information from the scatterplots, but it's interesting to note that there are some correlations. In particular, our variable of interest, the number of HIV diagnoses per 100,000 people, does seem to have a correlation with a few of the variables we've indicated.
%%R -r 110 -w 2000 -h 2000
plot(DF_Analysis[3:14])
Linear Regression Analysis¶
To begin our analysis, we remove the rows (states) with NA values. The 40 states left will serve as our training data for the modeler, and we'll test the results against the 10 removed states.
%%R
removeNA = function(data){
for(i in 3:ncol(data)){
ind = which(is.na(data[,i]))
if(length(ind)>0){
data = data[-ind,]
}
}
for(k in 3:ncol(data)){
ind2 = which(is.nan(data[,k]))
if(length(ind2)>0){
data = data[-ind2,]
}
}
print(dim(data))
return(data)
}
DF_Analysis = removeNA(DF_Analysis)
abbreviations = c('AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY')
removed_states = abbreviations[!(abbreviations %in% row.names(DF_Analysis))]
print(removed_states)
We've removed Colorado, Georgia, Hawaii, Illinois, Louisiana, New Mexico, New York, Rhode Island, Vermont, and Wyoming from creating the model because of NA values in their rows.
Next, we run the regression, settling on 10 iterations. The function does a linear regression, and removes the variable with the highest P value greater than 0.05. After 10 regressions, or if no variables have P value greater than 0.05, we consider the algorithm terminated having determined the variables with the greatest correlation to the target
, or HIV Diagnoses.
%%R
run_regression = function(DF_Analysis, remove_vec) {
for(i in 1:10){
lm.out = lm(DF_Analysis$target~., data = DF_Analysis[,-remove_vec])
stat_temp = summary(lm.out)
stat = data.frame(stat_temp[[4]])
stat = stat[-1,]
if(max(stat$Pr...t..)>0.05) {
stat = stat[order(stat$Pr...t..),]
least_exp = rownames(stat)[nrow(stat)]
cat(sprintf("The least explanatory variable in linear model fitting test %d is %s\n", i, least_exp))
col.num = which(colnames(DF_Analysis)==least_exp)
remove_vec = c(remove_vec, col.num)
}
else{break}
}
return(remove_vec)
}
removed = run_regression(DF_Analysis, 14)
cat("\n\n")
lm.out8 = lm(DF_Analysis$target~., data = DF_Analysis[,-removed])
stat_temp8 = summary(lm.out8)
stat8 = data.frame(stat_temp8[[4]])
cat("The result of final model fitting test\n\n")
print(stat8)
model_str = sprintf("target = %f", stat8$Estimate[1])
for (i in 2:(15-length(removed))) {
model_str = paste(model_str, sprintf(' + %f*%s',stat8$Estimate[i],rownames(stat8)[i]))
}
cat("\n\n")
cat("The model that we found is\n\n")
cat(model_str)