The Report¶
In this notebook, we will sumarize our findings and present our results.
The HIV Pandemic¶
As of 2012, approximately 35.3 million people are living with HIV globally and of these, approximately 3.4 million of these people are children less than 15 years old. According to the Center of Disease Control, as of today in the U.S., 1.2 million people are infected with HIV and this number is still increasing. In order to get a handle on the HIV pandemic, it is important to understand what factors are associated with high levels of HIV. The goal of our project is to determine what factors are correlated with low levels of HIV diagnosis and what factors lead to high levels of HIV diagnosis.
Using data provided by the Kaiser Family Foundation, we will analyze each of these risk factors and analyze how it correlates with HIV diagnosis rate (per 100,000). We will look at household income, gender, race, citizenship status, the amount of federal funding given for HIV, the type of medical insurance individuals have, the types of hospital available, self-reported testing rates, and the availability of sexual education at middle and high school levels. We will then use R to run linear regression analysis to determine the correlation between each factor and HIV diagnosis rate (per 100,000). Then, we will also do a comparison between state and rank them according to a weighted model. To create a score which we can use to rank the states, we will take the factors that are highly correlated to HIV diagnosis rate (per 100,000) and include them in the model. The equation for our model will be “HIV_Risk_Value = w1x1 + w2x2 + w3*x3 +…” where each w represents weights for inputs, and each x can be HIV incidences and quantified population features. To determine appropriate weights, we will utilize logistic regression package on R.
The Data¶
Provided by the Kaiser Family Foundation: http://kff.org/statedata/
The Kaiser Family Foundation provides us with data about citizenship status, racial demography, gender, median income, the insurance type, the ownership of the hospital, and the percentage of schools that mandate sexual education for middle school and high school.
One of the major problems we faced was uploading the data to ipython notebook for analysis. The data is easily downloaded through a browser. However, the website does not facilitate easy download with ipython notebook. After using a creative solution, we were able to obtain the data and use it for our analysis.
This is the aggregate data frame for our Cleaned Data.
%load_ext rmagic
%%R
source('./scripts/hiv_loader_clean.R')
all_data = get_all_data()
DF_Analysis = data.frame(
uninsured=all_data$insured$"Uninsured",
has_sex_ed_6_9=all_data$sex_ed_6_9$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
has_sex_ed_9_12=all_data$sex_ed_9_12$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
citizenship=all_data$citizenship$"Citizen",
federal_funds=all_data$federal_funds$"Total",
test_rate=all_data$test_rate$"HIV.Testing.Rate...Ever.Tested",
under_fpl=all_data$fpl$"Under.100.",
income=all_data$income$"Median.Annual.Income",
unemployment=all_data$unemployment$"October.2013",
hospital_for_profit=all_data$hospital$"For.Profit",
chlamydia=all_data$chlamydia$"Chlamydia.Cases", #These following are all per-100,000 numbers
syphilis=all_data$syphilis$"Syphilis.Cases",
gonorrhea=all_data$gonorrhea$"Gonorrhea.Cases",
target=all_data$hiv_diagnoses$"Total"
)
row.names(DF_Analysis) = all_data$hiv_diagnoses$Location
print(head(DF_Analysis))
This is a plot that captures the correlation of each of the variables to each other.
%%R -r 110 -w 1600 -h 1600
plot(DF_Analysis[3:14])
Analysis¶
To begin our regression analysis, we remove the rows of states with NA.
%%R
removeNA = function(data){
for(i in 3:ncol(data)){
ind = which(is.na(data[,i]))
if(length(ind)>0){
data = data[-ind,]
}
}
for(k in 3:ncol(data)){
ind2 = which(is.nan(data[,k]))
if(length(ind2)>0){
data = data[-ind2,]
}
}
print(dim(data))
return(data)
}
DF_Analysis = removeNA(DF_Analysis)
print(head(DF_Analysis))
We removed 10 states by subsetting using which(). We will use 40 states against which we will test our model.
Now we want to discard variables that don't fit, or the least explanatory variables. We wrote a function using the linear regression.
%%R
run_regression = function(DF_Analysis, remove_vec) {
for(i in 1:10){
lm.out = lm(DF_Analysis$target~., data = DF_Analysis[,-remove_vec])
stat_temp = summary(lm.out)
stat = data.frame(stat_temp[[4]])
stat = stat[-1,]
if(max(stat$Pr...t..)>0.05) {
stat = stat[order(stat$Pr...t..),]
least_exp = rownames(stat)[nrow(stat)]
cat(sprintf("The least explanatory variable in linear model fitting test %d is %s\n", i, least_exp))
col.num = which(colnames(DF_Analysis)==least_exp)
remove_vec = c(remove_vec, col.num)
}
else{break}
}
return(remove_vec)
}
removed = run_regression(DF_Analysis, 14)
cat("\n\n")
lm.out8 = lm(DF_Analysis$target~., data = DF_Analysis[,-removed])
stat_temp8 = summary(lm.out8)
stat8 = data.frame(stat_temp8[[4]])
cat("The result of final model fitting test\n\n")
print(stat8)
model_str = sprintf("target = %f", stat8$Estimate[1])
for (i in 2:(15-length(removed))) {
model_str = paste(model_str, sprintf(' + %f*%s',stat8$Estimate[i],rownames(stat8)[i]))
}
cat("\n\n")
cat("The model that we found is\n\n")
cat(model_str)
Below are two bar plots - one for the p-values for each variable we tested and one for the p-values for each variable we used in the model
To plot our graphs, we used the ggplot2 library in R. To install it, run: install.packages('ggplot2')
%%R
library(ggplot2)
lm.out1 = lm(DF_Analysis$target~., data = DF_Analysis[,-c(14)]) # the first linear model fitting is executed
stat_temp1 = summary(lm.out1) # use command summary to see important statistics
stat1 = data.frame(stat_temp1[[4]]) # We want to fous on coefficients part
stat1 = stat1[-1,] # getting rid of row named (Intecept)
stat1 = stat1[order(stat1$Pr...t..),] # ordering the dataframe according to p-values
pv=data.frame(variables=rownames(stat1), p_values=stat1$Pr...t..)
x <- ggplot(data=pv, aes(x=variables, y=p_values)) +
geom_bar(colour="black", fill="#3BB9FF", width=.7, stat="identity") +
guides(fill=FALSE) +
xlab("Explanatory Variables") + ylab("p-values") +
ggtitle("p-values in the first model fitting") + coord_flip()
print(x)
%%R
library(ggplot2)
lm.out8 = lm(DF_Analysis$target~., data = DF_Analysis[,-removed])
stat_temp8 = summary(lm.out8)
stat8 = data.frame(stat_temp8[[4]])
stat8_ = stat8[2:(15-length(removed)),]
pv=data.frame(variables=rownames(stat8_), p_values=stat8_$Pr...t..)
x <- ggplot(data=pv, aes(x=variables, y=p_values)) +
geom_bar(colour="black", fill="#3BB9FF", width=.7, stat="identity") +
guides(fill=FALSE) +
xlab("Explanatory Variables") + ylab("p-values") +
ggtitle("p-values in the eighth model fitting")
print(x)
We were not surprised to see HIV diagnosis rate was correlated to the diagnosis rates of other STDs, the amount of Federal Funds, the amount of persons living without insurance, and number of people who self reported HIV testing. We were surprised to fine a significant correlation between HIV Diagnosis rate and the numbers Hospitals for Profit. A possible explanation for the correlation is that these Hopsitals could have the resources for performing HIV tests and treat individuals who have been diagnosed with HIV. Further anaylsis can be done to explore this relationship.
D3 Mapping of HIV Data¶
Below are two D3 maps of HIV diagnosis rates. The more blue a state is the higher the rate of HIV diagnosis is. The first map is a plot using the actual HIV diagnosis rate as reported by the Kaiser Family Foundation. The second map is a plot using the values generated by our model.
To run the maps in this notebook, there are dependencies for the IPyD3
module we load. To install them, run:
sudo pip install titlecase
sudo apt-get install phantomjs
NumPy
should already be installed.
First, we need to import a few things.
from pandas import *
from urllib import urlopen
from numpy import *
# Importing IPyD3
exec(urlopen('https://raw.githubusercontent.com/z-m-k/ipyD3/master/ipyD3.py').read())
Now, we map the actual, measured values for HIV Diagnoses per 100,000 people in the US, based on the KFF data. We'll be able to compare that map with the calculated values from our model!
df = read_csv("./data/cleaned/HIV_Diagnoses.csv")
d3 = d3object(width=960,
height=500,
style='JFFigure',
number=1,
title='HIV Diagnosis')
d3.addCss('''
path.domain { display: none; } // This is the ugly black bar that shows up on the scale otherwise
''')
The mapping code uses D3.js. We create a color scale, then grab XML from Wikipedia to create an SVG map of the US. Finally, we color the map based on state ID attributes and the color scale.
d3.addJs('''
var svg = d3.select("#"+d3ObjId)
.append("svg")
.attr("width", width)
.attr("height", height)
.append("g")
.attr("id", "#"+d3ObjId+'InnerG')
var umap = [];
data.map(function(d) {umap[d[1]]=Number(d[2])});
var v = Object.keys(umap).map(function(k){return umap[k]});
var colorScale = d3.scale.linear()
.domain(d3.extent(v))
.interpolate(d3.interpolateHcl)
.range(["white", "blue"]);
svg.selectAll("rect")
.data(d3.range(d3.min(v), d3.max(v), (d3.max(v)-d3.min(v))/50.0))
.enter()
.append("rect")
.attr({width: 25,
height: 5,
y: function(d,i) { return height-25-i*5 },
x: width-50,
fill: function(d,i) { return colorScale(d); } });
svg.append("g")
.attr("transform", "translate(" + (width-25) + "," + (height-25-5*49) + ")")
.call(d3.svg.axis()
.scale(d3.scale.linear().domain(d3.extent(v)).range([5*50,0]))
.orient("right"));
d3.xml("http://upload.wikimedia.org/wikipedia/commons/3/32/Blank_US_Map.svg",
"image/svg+xml", function(xml) {
svg.append("g").attr("transform", "scale(" + (width/1200.0) + ")")
.node().appendChild(document.importNode(xml.documentElement, true));
d3.selectAll(".state").each(function() {
d3.select(this).style("fill",colorScale(umap[this.id]))
});
});
''')
d3.addVar(data=array(df.to_records().tolist()))
html=d3.render(mode=('show','html'))
from IPython.display import display
display(html)
Now, let's see what our model predicted.
%%R
#Model: -11.516558 + -52.025914*uninsured + 5.499537*federal_funds + 80.208011*test_rate + 13.399644*hospital_for_profit + -0.000242*chlamydia + 0.000881*gonorrhea
coeff = c(-52.025914, 5.499537, 80.208011, 13.399644, -0.000242, 0.000881, -11.516558)
modeled_data = all_data[,c(1, 5, 6, 10, 11, 13)]
modeled_data$constant = rep(1, 50)
dotproduct = function(dataf, v2) {
apply(t(t(as.matrix(dataf)) * v2),1,sum)
}
calculated = dotproduct(modeled_data, coeff)
%%R
file.create('./data/predicted.csv')
write.csv(calculated, './data/predicted.csv', row.names=F)
df = read_csv('./data/cleaned/HIV_Diagnoses.csv')
d3 = d3object(width=960,
height=500,
style='JFFigure',
number=2,
title='Calculated HIV Diagnoses per 100,000 people, from KFF')
d3.addCss('''
path.domain { display: none; } // This is the ugly black bar that shows up on the scale otherwise
''')
d3.addJs('''
var svg = d3.select("#"+d3ObjId)
.append("svg")
.attr("width", width)
.attr("height", height)
.append("g")
.attr("id", "#"+d3ObjId+'InnerG')
var umap = [];
data.map(function(d) {umap[d[1]]=Number(d[2])});
var v = Object.keys(umap).map(function(k){return umap[k]});
var colorScale = d3.scale.linear()
.domain(d3.extent(v))
.interpolate(d3.interpolateHcl)
.range(["white", "blue"]);
svg.selectAll("rect")
.data(d3.range(d3.min(v), d3.max(v), (d3.max(v)-d3.min(v))/50.0))
.enter()
.append("rect")
.attr({width: 25,
height: 5,
y: function(d,i) { return height-25-i*5 },
x: width-50,
fill: function(d,i) { return colorScale(d); } });
svg.append("g")
.attr("transform", "translate(" + (width-25) + "," + (height-25-5*49) + ")")
.call(d3.svg.axis()
.scale(d3.scale.linear().domain(d3.extent(v)).range([5*50,0]))
.orient("right"));
d3.xml("http://upload.wikimedia.org/wikipedia/commons/3/32/Blank_US_Map.svg",
"image/svg+xml", function(xml) {
svg.append("g").attr("transform", "scale(" + (width/1200.0) + ")")
.node().appendChild(document.importNode(xml.documentElement, true));
d3.selectAll(".state").each(function() {
d3.select(this).style("fill",colorScale(umap[this.id]))
});
});
''')
d3.addVar(data=array(df.to_records().tolist()))
html=d3.render(mode=('show','html'))
from IPython.display import display
display(html)
As we can see, there is very little difference between the calculated graph and the one measured by KFF! As quick justification, we'll calculate the errors.
%%R
#testing our model against each US state
states_df <- as.data.frame(cbind(DF_Analysis[,1], DF_Analysis[,5], DF_Analysis[,6], DF_Analysis[,10], DF_Analysis[,11], DF_Analysis[,13], DF_Analysis[,14]))
colnames(states_df) <- c("uninsured", "federal_funds", "test_rate", "hospital_for_profit", "chlamydia", "gonorrhea", "target")
rownames(states_df) <- row.names(DF_Analysis)
states_df$target = as.numeric(as.character(states_df$target))
#TEST RESULTS
states_df["estimation"] = -11.516558 + -52.025914*DF_Analysis$uninsured + 0.000000*DF_Analysis$federal_funds + 80.208011*DF_Analysis$test_rate + 13.399644*DF_Analysis$hospital_for_profit + -0.000242*DF_Analysis$chlamydia + 0.000881*DF_Analysis$gonorrhea
states_df["error"] = abs(states_df$target - states_df$estimation) / states_df$target
states_df = states_df[order(states_df$error),]
states_df["rank"] = 1:40
print(head(states_df))
%%R
errorplot_hist <- ggplot(data=states_df, aes(x=rownames(states_df),y=states_df$error)) +
geom_bar(colour="black", fill="#7FFFD4", width=.7, stat="identity") +
guides(fill=FALSE) +
xlab("Errors") + ylab("States") +
ggtitle("Error") + coord_flip()
print(errorplot_hist)
After looking at the D3 Maps and the Standard Error Bar Chart, our model is a good predicter for the majority of the states. However, we find that our model does not work as well for Alaska.
Summary¶
In summary, after gathering and cleaning data for 14 potential explanatory variables of the 50 states, and putting them in a comprehensible data frame, we created a function that does a a linear regression and removes the column vectors of variables that were least significant and determined the variables with significant correlation to the target
, or HIV Diagnoses. We tested the model on 40 states (after removing 10 states for their NA values), and concluded the model is a good predictor of HIV diagnosis rate.