Date

The Report

In this notebook, we will sumarize our findings and present our results.

The HIV Pandemic

As of 2012, approximately 35.3 million people are living with HIV globally and of these, approximately 3.4 million of these people are children less than 15 years old. According to the Center of Disease Control, as of today in the U.S., 1.2 million people are infected with HIV and this number is still increasing. In order to get a handle on the HIV pandemic, it is important to understand what factors are associated with high levels of HIV. The goal of our project is to determine what factors are correlated with low levels of HIV diagnosis and what factors lead to high levels of HIV diagnosis.

Using data provided by the Kaiser Family Foundation, we will analyze each of these risk factors and analyze how it correlates with HIV diagnosis rate (per 100,000). We will look at household income, gender, race, citizenship status, the amount of federal funding given for HIV, the type of medical insurance individuals have, the types of hospital available, self-reported testing rates, and the availability of sexual education at middle and high school levels. We will then use R to run linear regression analysis to determine the correlation between each factor and HIV diagnosis rate (per 100,000). Then, we will also do a comparison between state and rank them according to a weighted model. To create a score which we can use to rank the states, we will take the factors that are highly correlated to HIV diagnosis rate (per 100,000) and include them in the model. The equation for our model will be “HIV_Risk_Value = w1x1 + w2x2 + w3*x3 +…” where each w represents weights for inputs, and each x can be HIV incidences and quantified population features. To determine appropriate weights, we will utilize logistic regression package on R.

The Data

Provided by the Kaiser Family Foundation: http://kff.org/statedata/

The Kaiser Family Foundation provides us with data about citizenship status, racial demography, gender, median income, the insurance type, the ownership of the hospital, and the percentage of schools that mandate sexual education for middle school and high school.

One of the major problems we faced was uploading the data to ipython notebook for analysis. The data is easily downloaded through a browser. However, the website does not facilitate easy download with ipython notebook. After using a creative solution, we were able to obtain the data and use it for our analysis.

This is the aggregate data frame for our Cleaned Data.

In [11]:
%load_ext rmagic
In [12]:
%%R
source('./scripts/hiv_loader_clean.R')

all_data = get_all_data()
DF_Analysis = data.frame(
    uninsured=all_data$insured$"Uninsured",
    has_sex_ed_6_9=all_data$sex_ed_6_9$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
    has_sex_ed_9_12=all_data$sex_ed_9_12$"Required.course.covers.how.to.prevent.HIV..other.STDs..and.pregnancy",
    citizenship=all_data$citizenship$"Citizen",
    federal_funds=all_data$federal_funds$"Total",
    test_rate=all_data$test_rate$"HIV.Testing.Rate...Ever.Tested",
    under_fpl=all_data$fpl$"Under.100.",
    income=all_data$income$"Median.Annual.Income",
    unemployment=all_data$unemployment$"October.2013",
    hospital_for_profit=all_data$hospital$"For.Profit",
    chlamydia=all_data$chlamydia$"Chlamydia.Cases", #These following are all per-100,000 numbers
    syphilis=all_data$syphilis$"Syphilis.Cases",
    gonorrhea=all_data$gonorrhea$"Gonorrhea.Cases",
    
    target=all_data$hiv_diagnoses$"Total"
)

row.names(DF_Analysis) = all_data$hiv_diagnoses$Location

print(head(DF_Analysis))
[1] "Loaded: hiv_diagnoses, citizenship_status, fpl, income, unemployment, insured, hospital, federal_funds, test_rate, sex_ed_6_9, sex_ed_9_12, chlamydia, gonorrhea, syphilis"
   uninsured has_sex_ed_6_9 has_sex_ed_9_12 citizenship federal_funds test_rate
AL 0.1395612          0.758           0.827   0.9802152    0.43885230     0.420
AK 0.1884869          0.446           0.659   0.9586563    0.03095699     0.443
AZ 0.1774917          0.323           0.716   0.9140282    0.37920656     0.341
AR 0.1794378          0.819           0.969   0.9656608    0.13691627     0.311
CA 0.1883828          0.760           0.933   0.8606448    4.33469436     0.392
CO 0.1470530             NA              NA   0.9442321    0.40883331     0.361
   under_fpl income unemployment hospital_for_profit chlamydia syphilis
AL 0.2196644  42245        0.065               0.382     29626      758
AK 0.1981696  60566        0.065               0.087      5739       11
AZ 0.2269776  48319        0.082               0.271     29251      906
AR 0.2431548  39806        0.075               0.274     16052      464
CA 0.2312397  56074        0.087               0.214    166773     6782
CO 0.1662836  59803        0.068               0.207     21811      367
   gonorrhea target
AL      9132   20.9
AK       984    4.6
AZ      4564   13.3
AR      4687   10.0
CA     27516   19.2
CO      2363    9.6

This is a plot that captures the correlation of each of the variables to each other.

In [13]:
%%R -r 110 -w 1600 -h 1600
plot(DF_Analysis[3:14])

Analysis

To begin our regression analysis, we remove the rows of states with NA.

In [14]:
%%R
removeNA = function(data){
  for(i in 3:ncol(data)){
    ind = which(is.na(data[,i]))

    if(length(ind)>0){                        
      data = data[-ind,]
    }
  }
  
  for(k in 3:ncol(data)){                     
    ind2 = which(is.nan(data[,k]))     

    if(length(ind2)>0){
      data = data[-ind2,]                     
    }
  }

  print(dim(data))                            
  return(data)          
}

DF_Analysis = removeNA(DF_Analysis)
print(head(DF_Analysis))
[1] 40 14
   uninsured has_sex_ed_6_9 has_sex_ed_9_12 citizenship federal_funds test_rate
AL 0.1395612          0.758           0.827   0.9802152    0.43885230     0.420
AK 0.1884869          0.446           0.659   0.9586563    0.03095699     0.443
AZ 0.1774917          0.323           0.716   0.9140282    0.37920656     0.341
AR 0.1794378          0.819           0.969   0.9656608    0.13691627     0.311
CA 0.1883828          0.760           0.933   0.8606448    4.33469436     0.392
CT 0.0833760          0.740           0.952   0.9320703    0.42966408     0.341
   under_fpl income unemployment hospital_for_profit chlamydia syphilis
AL 0.2196644  42245        0.065               0.382     29626      758
AK 0.1981696  60566        0.065               0.087      5739       11
AZ 0.2269776  48319        0.082               0.271     29251      906
AR 0.2431548  39806        0.075               0.274     16052      464
CA 0.2312397  56074        0.087               0.214    166773     6782
CT 0.1471107  67165        0.079               0.029     13649      189
   gonorrhea target
AL      9132   20.9
AK       984    4.6
AZ      4564   13.3
AR      4687   10.0
CA     27516   19.2
CT      2449   14.2

We removed 10 states by subsetting using which(). We will use 40 states against which we will test our model.

Now we want to discard variables that don't fit, or the least explanatory variables. We wrote a function using the linear regression.

In [15]:
%%R
run_regression = function(DF_Analysis, remove_vec) {
    for(i in 1:10){
        lm.out = lm(DF_Analysis$target~., data = DF_Analysis[,-remove_vec])
        stat_temp = summary(lm.out)
        stat = data.frame(stat_temp[[4]])
        stat = stat[-1,]
        if(max(stat$Pr...t..)>0.05) {
            stat = stat[order(stat$Pr...t..),]
            least_exp = rownames(stat)[nrow(stat)]
            cat(sprintf("The least explanatory variable in linear model fitting test %d is %s\n", i, least_exp))
            col.num = which(colnames(DF_Analysis)==least_exp)
            remove_vec = c(remove_vec, col.num)
        }
        else{break}
    }
    
    return(remove_vec)
}

removed = run_regression(DF_Analysis, 14)

cat("\n\n")
lm.out8 = lm(DF_Analysis$target~., data = DF_Analysis[,-removed])
stat_temp8 = summary(lm.out8)
stat8 = data.frame(stat_temp8[[4]]) 
cat("The result of final model fitting test\n\n")
print(stat8)

model_str = sprintf("target = %f", stat8$Estimate[1])
for (i in 2:(15-length(removed))) {
    model_str = paste(model_str, sprintf(' + %f*%s',stat8$Estimate[i],rownames(stat8)[i]))
}

cat("\n\n")
cat("The model that we found is\n\n")
cat(model_str)
The least explanatory variable in linear model fitting test 1 is has_sex_ed_6_9
The least explanatory variable in linear model fitting test 2 is syphilis
The least explanatory variable in linear model fitting test 3 is income
The least explanatory variable in linear model fitting test 4 is has_sex_ed_9_12
The least explanatory variable in linear model fitting test 5 is unemployment
The least explanatory variable in linear model fitting test 6 is under_fpl
The least explanatory variable in linear model fitting test 7 is citizenship


The result of final model fitting test

                         Estimate   Std..Error   t.value     Pr...t..
(Intercept)         -1.151656e+01 4.631387e+00 -2.486632 1.812971e-02
uninsured           -5.202591e+01 2.052507e+01 -2.534750 1.617754e-02
federal_funds        5.499537e+00 1.846258e+00  2.978748 5.393396e-03
test_rate            8.020801e+01 1.291732e+01  6.209339 5.217136e-07
hospital_for_profit  1.339964e+01 5.959283e+00  2.248533 3.134033e-02
chlamydia           -2.416097e-04 8.926969e-05 -2.706514 1.068060e-02
gonorrhea            8.811796e-04 3.079856e-04  2.861106 7.271932e-03


The model that we found is

target = -11.516558  + -52.025914*uninsured  + 5.499537*federal_funds  + 80.208011*test_rate  + 13.399644*hospital_for_profit  + -0.000242*chlamydia  + 0.000881*gonorrhea

Below are two bar plots - one for the p-values for each variable we tested and one for the p-values for each variable we used in the model

To plot our graphs, we used the ggplot2 library in R. To install it, run: install.packages('ggplot2')

In [16]:
%%R
library(ggplot2)
lm.out1 = lm(DF_Analysis$target~., data = DF_Analysis[,-c(14)])     # the first linear model fitting is executed 
stat_temp1 = summary(lm.out1)                                       # use command summary to see important statistics  
stat1 = data.frame(stat_temp1[[4]])                                 # We want to fous on coefficients part 
stat1 = stat1[-1,]                                                  # getting rid of row named (Intecept) 
stat1 = stat1[order(stat1$Pr...t..),]                               # ordering the dataframe according to p-values       
pv=data.frame(variables=rownames(stat1), p_values=stat1$Pr...t..)

x <- ggplot(data=pv, aes(x=variables, y=p_values)) + 
        geom_bar(colour="black", fill="#3BB9FF", width=.7, stat="identity") + 
        guides(fill=FALSE) +
        xlab("Explanatory Variables") + ylab("p-values") +
        ggtitle("p-values in the first model fitting") + coord_flip()
print(x)    
Error in library(ggplot2) : there is no package called ‘ggplot2’
In addition: Warning message:
In all_data$hiv_diagnoses = hiv_diagnoses : Coercing LHS to a list
Error in library(ggplot2) : there is no package called ‘ggplot2’

In [17]:
%%R
library(ggplot2)
lm.out8 = lm(DF_Analysis$target~., data = DF_Analysis[,-removed])
stat_temp8 = summary(lm.out8)
stat8 = data.frame(stat_temp8[[4]]) 
stat8_ = stat8[2:(15-length(removed)),]
pv=data.frame(variables=rownames(stat8_), p_values=stat8_$Pr...t..)

x <- ggplot(data=pv, aes(x=variables, y=p_values)) + 
        geom_bar(colour="black", fill="#3BB9FF", width=.7, stat="identity") + 
        guides(fill=FALSE) +
        xlab("Explanatory Variables") + ylab("p-values") +
        ggtitle("p-values in the eighth model fitting")
print(x)
Error in library(ggplot2) : there is no package called ‘ggplot2’

We were not surprised to see HIV diagnosis rate was correlated to the diagnosis rates of other STDs, the amount of Federal Funds, the amount of persons living without insurance, and number of people who self reported HIV testing. We were surprised to fine a significant correlation between HIV Diagnosis rate and the numbers Hospitals for Profit. A possible explanation for the correlation is that these Hopsitals could have the resources for performing HIV tests and treat individuals who have been diagnosed with HIV. Further anaylsis can be done to explore this relationship.

D3 Mapping of HIV Data

Below are two D3 maps of HIV diagnosis rates. The more blue a state is the higher the rate of HIV diagnosis is. The first map is a plot using the actual HIV diagnosis rate as reported by the Kaiser Family Foundation. The second map is a plot using the values generated by our model.

To run the maps in this notebook, there are dependencies for the IPyD3 module we load. To install them, run:

sudo pip install titlecase
sudo apt-get install phantomjs

NumPy should already be installed.

First, we need to import a few things.

In [18]:
from pandas import *
from urllib import urlopen
from numpy import *
# Importing IPyD3
exec(urlopen('https://raw.githubusercontent.com/z-m-k/ipyD3/master/ipyD3.py').read())

Now, we map the actual, measured values for HIV Diagnoses per 100,000 people in the US, based on the KFF data. We'll be able to compare that map with the calculated values from our model!

In [19]:
df = read_csv("./data/cleaned/HIV_Diagnoses.csv")
In [20]:
d3 = d3object(width=960,
              height=500,
              style='JFFigure',
              number=1,
              title='HIV Diagnosis')
In [21]:
d3.addCss('''
        path.domain { display: none; } // This is the ugly black bar that shows up on the scale otherwise
''')

The mapping code uses D3.js. We create a color scale, then grab XML from Wikipedia to create an SVG map of the US. Finally, we color the map based on state ID attributes and the color scale.

In [22]:
d3.addJs('''
    var svg = d3.select("#"+d3ObjId)
        .append("svg")
        .attr("width", width)
        .attr("height", height)
        .append("g")
        .attr("id", "#"+d3ObjId+'InnerG')

    var umap = [];
    data.map(function(d) {umap[d[1]]=Number(d[2])});
    var v = Object.keys(umap).map(function(k){return umap[k]});

    var colorScale = d3.scale.linear()
        .domain(d3.extent(v))
        .interpolate(d3.interpolateHcl)
        .range(["white", "blue"]);

    svg.selectAll("rect")
        .data(d3.range(d3.min(v), d3.max(v), (d3.max(v)-d3.min(v))/50.0))
      .enter()
        .append("rect")
        .attr({width: 25,
              height: 5,
              y: function(d,i) { return height-25-i*5 },
              x: width-50,
              fill: function(d,i) { return colorScale(d); } });

    svg.append("g")
        .attr("transform", "translate(" + (width-25) + "," + (height-25-5*49) + ")")
        .call(d3.svg.axis()
                .scale(d3.scale.linear().domain(d3.extent(v)).range([5*50,0]))
                .orient("right"));

    d3.xml("http://upload.wikimedia.org/wikipedia/commons/3/32/Blank_US_Map.svg",
            "image/svg+xml", function(xml) {
        svg.append("g").attr("transform", "scale(" + (width/1200.0) + ")")
            .node().appendChild(document.importNode(xml.documentElement, true));

        d3.selectAll(".state").each(function() {
            d3.select(this).style("fill",colorScale(umap[this.id]))
        });
    });
''')
In [23]:
d3.addVar(data=array(df.to_records().tolist()))
html=d3.render(mode=('show','html'))
from IPython.display import display
display(html)
5101520253035
Figure 1. HIV Diagnosis.

Now, let's see what our model predicted.

In [24]:
%%R
#Model: -11.516558  + -52.025914*uninsured  + 5.499537*federal_funds  + 80.208011*test_rate  + 13.399644*hospital_for_profit  + -0.000242*chlamydia  + 0.000881*gonorrhea
coeff = c(-52.025914, 5.499537, 80.208011, 13.399644, -0.000242, 0.000881, -11.516558)
modeled_data = all_data[,c(1, 5, 6, 10, 11, 13)]
modeled_data$constant = rep(1, 50)
dotproduct = function(dataf, v2) { 
    apply(t(t(as.matrix(dataf)) * v2),1,sum)
}
calculated = dotproduct(modeled_data, coeff)
Error in all_data[, c(1, 5, 6, 10, 11, 13)] : 
  incorrect number of dimensions

In [25]:
%%R
file.create('./data/predicted.csv')
write.csv(calculated, './data/predicted.csv', row.names=F)
Error in is.data.frame(x) : object 'calculated' not found

In [26]:
df = read_csv('./data/cleaned/HIV_Diagnoses.csv')
In [27]:
d3 = d3object(width=960,
              height=500,
              style='JFFigure',
              number=2,
              title='Calculated HIV Diagnoses per 100,000 people, from KFF')
In [28]:
d3.addCss('''
        path.domain { display: none; } // This is the ugly black bar that shows up on the scale otherwise
''')
In [29]:
d3.addJs('''
    var svg = d3.select("#"+d3ObjId)
        .append("svg")
        .attr("width", width)
        .attr("height", height)
        .append("g")
        .attr("id", "#"+d3ObjId+'InnerG')

    var umap = [];
    data.map(function(d) {umap[d[1]]=Number(d[2])});
    var v = Object.keys(umap).map(function(k){return umap[k]});

    var colorScale = d3.scale.linear()
        .domain(d3.extent(v))
        .interpolate(d3.interpolateHcl)
        .range(["white", "blue"]);

    svg.selectAll("rect")
        .data(d3.range(d3.min(v), d3.max(v), (d3.max(v)-d3.min(v))/50.0))
      .enter()
        .append("rect")
        .attr({width: 25,
              height: 5,
              y: function(d,i) { return height-25-i*5 },
              x: width-50,
              fill: function(d,i) { return colorScale(d); } });

    svg.append("g")
        .attr("transform", "translate(" + (width-25) + "," + (height-25-5*49) + ")")
        .call(d3.svg.axis()
                .scale(d3.scale.linear().domain(d3.extent(v)).range([5*50,0]))
                .orient("right"));

    d3.xml("http://upload.wikimedia.org/wikipedia/commons/3/32/Blank_US_Map.svg",
            "image/svg+xml", function(xml) {
        svg.append("g").attr("transform", "scale(" + (width/1200.0) + ")")
            .node().appendChild(document.importNode(xml.documentElement, true));

        d3.selectAll(".state").each(function() {
            d3.select(this).style("fill",colorScale(umap[this.id]))
        });
    });
''')
In [30]:
d3.addVar(data=array(df.to_records().tolist()))
html=d3.render(mode=('show','html'))
from IPython.display import display
display(html)
5101520253035
Figure 2. Calculated HIV Diagnoses Per 100,000 People, From KFF.

As we can see, there is very little difference between the calculated graph and the one measured by KFF! As quick justification, we'll calculate the errors.

In [31]:
%%R
#testing our model against each US state
states_df <- as.data.frame(cbind(DF_Analysis[,1], DF_Analysis[,5], DF_Analysis[,6], DF_Analysis[,10], DF_Analysis[,11], DF_Analysis[,13], DF_Analysis[,14]))
colnames(states_df) <- c("uninsured", "federal_funds", "test_rate", "hospital_for_profit", "chlamydia", "gonorrhea", "target")
rownames(states_df) <- row.names(DF_Analysis)

states_df$target = as.numeric(as.character(states_df$target))
#TEST RESULTS
states_df["estimation"] = -11.516558  + -52.025914*DF_Analysis$uninsured  + 0.000000*DF_Analysis$federal_funds + 80.208011*DF_Analysis$test_rate + 13.399644*DF_Analysis$hospital_for_profit + -0.000242*DF_Analysis$chlamydia  + 0.000881*DF_Analysis$gonorrhea
states_df["error"] = abs(states_df$target - states_df$estimation) / states_df$target
states_df = states_df[order(states_df$error),]
states_df["rank"] = 1:40
print(head(states_df))
   uninsured federal_funds test_rate hospital_for_profit chlamydia gonorrhea
AL 0.1395612    0.43885230     0.420               0.382     29626      9132
WI 0.1005970    0.19222320     0.298               0.032     24619      4789
ID 0.1643142    0.04395342     0.286               0.100      4699       162
MO 0.1410797    0.40624203     0.332               0.183     27887      7802
VA 0.1304832    0.57236415     0.413               0.225     36314      6518
KY 0.1509814    0.20139443     0.314               0.189     16629      4521
   target estimation        error rank
AL   20.9  20.904473 0.0002140138    1
WI    5.7   5.841876 0.0248904509    2
ID    3.0   3.219865 0.0732884213    3
MO   11.2  10.349742 0.0759159272    4
VA   16.2  14.790135 0.0870287268    5
KY    9.4   8.305126 0.1164759667    6

In [32]:
%%R
errorplot_hist <- ggplot(data=states_df, aes(x=rownames(states_df),y=states_df$error)) + 
        geom_bar(colour="black", fill="#7FFFD4", width=.7, stat="identity") + 
        guides(fill=FALSE) +
        xlab("Errors") + ylab("States") +
        ggtitle("Error") + coord_flip()

print(errorplot_hist)
Error in eval(expr, envir, enclos) : could not find function "ggplot"

After looking at the D3 Maps and the Standard Error Bar Chart, our model is a good predicter for the majority of the states. However, we find that our model does not work as well for Alaska.

Summary

In summary, after gathering and cleaning data for 14 potential explanatory variables of the 50 states, and putting them in a comprehensible data frame, we created a function that does a a linear regression and removes the column vectors of variables that were least significant and determined the variables with significant correlation to the target, or HIV Diagnoses. We tested the model on 40 states (after removing 10 states for their NA values), and concluded the model is a good predictor of HIV diagnosis rate.

In [32]: