Election Data Summary

 

The Florida presidential election data can be downloaded here.

 

The columns in the data include:

 

Field Name

Description

county

Florida county name

b2000

Raw votes for Bush, 2000

g2000

Raw votes for Gore, 2000

b2004

Raw votes for Bush, 2004

k2004

Raw votes for Kerry, 2004

etouch

Dummy variable 1 if Electronic Voting; 0 otherwise

median_income

Median income in county

hispanic

Total Hispanic population in county

b00pc

% voted for Bush 2000 (b2000/votes00)

b04pc

% voted for Bush 2004 (b2004/votes04)

b_change

Change in % voted for Bush between 2000 and 2004 (b04pc - b00pc)

b00pc_sq

% voted for Bush squared

b00pc_e

Interaction effect between b00pc and etouch (b00pc*etouch)

b00pcsq_e

Interaction effect between b00pc_sq and etouch (b00pc_sq*etouch)

votes00

Total votes cast in 2000

votes04

Total votes cast in 2004

v_change

Change in voter turnout from 2000 to 2004 (votes04 – votes00)

 

//read the data in election.txt into an R dataframe object called election

>election <- read.table("election.txt", header=TRUE)

 

Consider, “Change in % voted for Bush between 2000 and 2004” (b_change) to be the dependant variable in the data. The factors affection this change in votes may be attributed to the following factors (independent variables): etouch, median_income, hispanic, v_change


A summary of the election data for the variables of interest is shown below.

>summary(election)

 

 median_income     

Min.   :26032  

1st Qu.:30029

Median :33779  

 Mean   :35385  

3rd Qu.:40249  

 Max.   :52244  

 

hispanic     

Min.   :0.01500 

1st Qu.:0.02700 

Median :0.04900 

Mean   :0.08528 

3rd Qu.:0.09400 

Max.   :0.57300 

 

b_change          

Min.   :-0.02957  

1st Qu.: 0.01833  

Median : 0.03506  

Mean   : 0.03702   

3rd Qu.: 0.05388   

Max.   : 0.10710  

v_change    

Min.   :   663 

1st Qu.:  2250 

Median : 11056 

Mean   : 24236 

3rd Qu.: 35385 

Max.   :116327 

 

 

Plotting the raw distributions of these variables gives the following results:

 

 

 

 

 


------------------------------------------------------------------------------------------------------------

 

Title:

 

Computing the mean with confidence intervals

Definition and Motivation:

 

The arithmetic mean of a collection is the sum of the elements in that collection divided by the number of elements in that collection (including duplicates). The goal of computing the mean of a randomly selected subset S of a larger set S' is to estimate the mean of S'. By computing confidence intervals on the mean of S, we attempt to bound the interval in which the mean of S' is likely to lie.

Pseudo-Code:
 

Let n be the size of S (including duplicates).
BagOfResults:= empty collection                       // note that duplicates will be allowed

 

do LARGENUMBER times                             // LARGENUMBER should be something                                                                     //like 1000 or more
E := take n elements of S with replacement
add mean(E) to the BagOfResults

end

SequenceOfResults := sort BagOfResults from smallest to largest

low:= the element at 0.025*LARGENUMBER of SequenceOfResults

// low is low end of 95% confidence interval

high:= the element at 0.975*LARGENUMBER of SequenceOfResults

// high is high end of 95% confidence interval

return (low, high, mean(S))

 

 

R-Code:

 

//define a mean function to be used by the call to boot

> mean_fun <- function(data, indices){

+ mean(data[indices])

+ }

//make the call to boot. This function will resample with replacement from the data and //apply the mean_fun function to each sample

//but first load the boot library

> library(boot)

> boot_b_change_mean <- boot(data=election[,"b_change"],statistic=mean_fun, R=999)

> boot_b_change_mean

 

 

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election[, "b_change"], statistic = mean_fun, R = 999)

Bootstrap Statistics :

      original                   bias                              std. error

t1* 0.03702145           -0.0001289764            0.003587013

 

The bootstrap mean results show that the mean is 0.037021, exactly the same as the observed mean. To visualize the results we can use the plot function in R, that accepts the result of the boot call.

 

>plot(boot_b_change_mean)

 

Finally we compute the confidence interval of the mean.

>boot.ci(boot.out = boot_mean_diff, conf = 0.95, type = "perc")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 999 bootstrap replicates

CALL :

boot.ci(boot.out = boot_b_change_mean, conf = 0.95, type = "perc")

Intervals :

Level     Percentile    

95%   ( 0.0297,  0.0442 ) 

Calculations and Intervals on Original Scale


------------------------------------------------------------------------------------------------------------

 

Title:

 

Calculating the difference between the means of two groups

Definition and Motivation:

 

In this case we can split the data set into two groups. One group being the counties that used electronic voting and the other that didn’t (i.e. etouch = 1 vs etouch  = 0).  We then calculate the bootstrap difference in the means of the two groups. The 95% confidence interval will give a bound to the difference of the means.


Pseudo-Code:
 

Let n1 be the size of S1                                    // S1 is the group with etouch = 0

Let n2 be the size of S2                                    // S1 is the group with etouch = 1
BagOfResults:= empty collection                      

 

do LARGENUMBER times                             // LARGENUMBER should be something //                                                                  //like 1000 or more
E := take n1 elements of S1 with replacement

F := take n2 elements of S2 with replacement


add mean(E) – mean(F) to the BagOfResults

end

SequenceOfResults := sort BagOfResults from smallest to largest

low:= the element at 0.025*LARGENUMBER of SequenceOfResults

// low is low end of 95% confidence interval

high:= the element at 0.975*LARGENUMBER of SequenceOfResults

// high is high end of 95% confidence interval

return (low, high, mean(S1) – mean(S2))

 

 

R-Code:

 

//for ease of computation and analysis of results create a data set with the dependant and independent variables  only(i.e. etouch, median_income, Hispanic, v_change and b_change)

> election_subset <- data.frame(county=election[,"county"],etouch=election[,"etouch"],median_income=election[,"median_income"],hispanic=election[,"hispanic"],v_change=election[,"v_change"],b_change=election[,"b_change"])

 

//Define a function that will calculate the difference between the means of two groups. This function can then be passed to the call to boot

> mean_diff_fun <- function(data, indices){

+ m1 <- mean(data[indices,"b_change"][data[indices,"etouch"]==0])

+ m2 <- mean(data[indices,"b_change"][data[indices,"etouch"]==1])

+ diff = m1 - m2

+ }

>

 

//finally make the call to boot

> boot_mean_diff <- boot(data=election_subset, statistic=mean_diff_fun, R=999)

> boot_mean_diff

 

 

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election_subset, statistic = mean_diff_fun, R = 999)

Bootstrap Statistics :

      original       bias    std. error

t1* 0.01646996 0.0001910999 0.006707707

 

> plot(boot_mean_diff)

 

The first plot above shows the distribution of the difference of the means in each of the 999 bootstrapped samples. The bootstrap difference in the means is estimated to be 0.01646996. Now we calculate the confidence interval as follows:

 

> boot_mean_diff_conf <- boot.ci(boot_mean_diff, conf=0.95, type="perc")

> boot_mean_diff_conf

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 999 bootstrap replicates

 

CALL :

boot.ci(boot.out = boot_mean_diff, conf = 0.95, type = "perc")

 

Intervals :

Level     Percentile    

95%   ( 0.0035,  0.0295 ) 

Calculations and Intervals on Original Scale

 

 

This says that with a confidence of 95%, the difference between the means is between 0.0035 and 0.0295.

 

 


------------------------------------------------------------------------------------------------------------

 

Title:

 

Calculating the bootstrap-based and permutation-based empirical p-value for the difference of the means of two groups


Definition and Motivation:

 

The problem of finding the difference between the means of two groups can also be described using a hypothesis testing system. The null hypothesis is that the difference between the means of the two groups (etouch =0 and etouch =1) is zero. The alternative hypothesis is that the difference is greater than zero. i.e.

H0: diff = 0

Ha: diff > 0

The probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test statistic (in this case the difference between the means) as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0, is true. Small p-values suggest that the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the rejection of the null hypothesis.

Pseudo-Code (Bootstrap-based version):
 

Let n1 be the size of S1                                    // S1 is the group with etouch = 0

Let n2 be the size of S2                                    // S1 is the group with etouch = 1
BagOfResults:= empty collection                      

 

do LARGENUMBER times                             // LARGENUMBER should be something //                                                                  //like 1000 or more
E := take n1 elements of S1 with replacement

F := take n2 elements of S2 with replacement


add mean(E) – mean(F) to the BagOfResults

end

Iterate over BagOfResults

            if current BagOfResults value <= 0

                        count := count + 1

 

pvalue := count/ LARGENUMBER

return(pvalue)

 

 

R-Code (Bootstrap-based version):

 

> boot_mean_diff <- boot(data=election_subset, statistic=mean_diff_fun, R=999)

> boot_mean_diff

 

 

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election_subset, statistic = mean_diff_fun, R = 999)

Bootstrap Statistics :

      original       bias    std. error

t1* 0.01646996 0.0001910999 0.006707707

 

Calculating the p-value

 

> sum(boot_mean_diff$t <= 0)/999

[1] 0.008008008

 

The above p-value is very small. This suggests that there is very strong evidence against the null hypothesis that the difference in the means of the two groups (etouch=0 and etouch=1) is zero.

 

 

Pseudo-Code (Permutation-based version):

 

Let n1 be the size of S1                                    // S1 is the group with etouch = 0

Let n2 be the size of S2                                    // S1 is the group with etouch = 1
BagOfResults:= empty collection                      

 

do LARGENUMBER times                             // LARGENUMBER should be something //                                                                  //like 1000 or more
E := sample n1 elements of S1 by permuting the etouch column

//Equivalent to resampling without replacement. b_change column is kept //stationary and etouch is randomly permuted. Therefore relationship between the //b_change and etouch columns is broken

F := sample n2 elements of S2 by permuting the etouch column
add mean(E) – mean(F) to the BagOfResults

end

Iterate over BagOfResults

            if current BagOfResults value <= 0

                        count := count + 1

 

pvalue := count/ LARGENUMBER

return(pvalue)

 

 

R-Code (Permutation version):

 

#The following function will do all the work and return the result of the permutation #experiment

> mean_diff_perm_fun <- function(data,m){

+ mean_diff <- mean(data[,"b_change"][data[,"etouch"]==0]) - mean(data[,"b_change"][data[,"etouch"]==1])

+ pdata <- data.frame("b_change"=data[,"b_change"],"etouch"=sample(data[,"etouch"],length(data[,"etouch"]), replace = FALSE))

+ pmean_diff <- mean(pdata[,"b_change"][pdata[,"etouch"]==0]) - mean(pdata[,"b_change"][pdata[,"etouch"]==1])

+ count <- (pmean_diff >= mean_diff)

+ for (i in 1:m-1)

+ {

+ newpdata <- data.frame("b_change"=data[,"b_change"],"etouch"=sample(data[,"etouch"],length(data[,"etouch"]), replace = FALSE))

+ newpmean_diff <- mean(newpdata[,"b_change"][newpdata[,"etouch"]==0]) - mean(newpdata [,"b_change"][newpdata[,"etouch"]==1])

+ count <- count + (newpmean_diff >= mean_diff)

+ pmean_diff <- c(pmean_diff, newpmean_diff)

+ }

+ pvalue <- count/m

+ list("permuted_mean_diff"=pmean_diff, "mean_diff"=mean_diff,"pvalue"=pvalue)

+ }

 

>#Call the above function 999 times

> mean_diff_perm_output <- mean_diff_perm_fun(election_subset,999)

>#The observed difference in mean is

> mean_diff_perm_output$mean_diff

[1] 0.01646996

> # the p-value is

> mean_diff_perm_output$pvalue

[1] 0.03003003

 

The above p-value is still very small (although not as small as obtained using the bootstrap - 0.008008008). However this still suggests strong evidence against the null hypothesis that the difference in the means of the two groups (etouch=0 and etouch=1) is zero and that this is not due to chance.

 

# a plot of the null distribution follows

> hist(mean_diff_perm_output$permuted_mean_diff, breaks=30, freq=FALSE, plot=TRUE)

 


------------------------------------------------------------------------------------------------------------

 

Title:

 

Analysis of Variance (ANOVA)

Definition and Motivation:

 

 


Pseudo-Code:
 

 

 

R-Code:


------------------------------------------------------------------------------------------------------------

 

Title:

 

Partial Correlation

Definition and Motivation:

 

 

 

 


> pcor_fun <- function(data){

+ conc <- solve(var(data))

+ resid.sd <- 1/sqrt(diag(conc))

+ partialcor <- sweep(sweep(conc,1,resid.sd, "*"),2,resid.sd,"*")

+ return(partialcor)

+ }

> pcor_fun(election_subset)

 

 

etouch

median_income  

hispanic

v_changepc

b_change

etouch

1.0000000   

-0.29885709 

-0.13328091

0.1322710

0.07162380

median_income  

-0.2988571

1.00000000

0.06206307

-0.2267372

0.36928376

hispanic

-0.1332809

0.06206307

1.0000000

0.2444096

0.25903921

v_changepc

0.1322710

-0.22673721

0.24440957

1.0000000

0.04788487

b_change

0.0716238

0.36928376

0.25903921

0.04788487

1.00000000

 

 


Pseudo-Code:
 

 

 

R-Code:

 

pcor.etouch <- vector(mode="numeric", length=1000)

pcor.median_income <- vector(mode="numeric", length=1000)

pcor.hispanic <- vector(mode="numeric", length=1000)

pcor.v_changepc <- vector(mode="numeric", length=1000)

 

 

> for(b in 1:1000) {

+ data.curr <- election_subset[sample(1:67,67,replace=T),]

+ tempres <- pcor_fun(data.curr)

+ pcor.etouch[b] <- tempres["b_change", "etouch"]

+ pcor.median_income[b] <- tempres["b_change", "median_income"]

+ pcor.hispanic[b] <- tempres["b_change", "hispanic"]

+ pcor.v_changepc[b] <- tempres["b_change", "v_changepc"]

+ }

 

Now we have 4 (independent variables) vectors that have 1000 bootstrapped partial correlations with b_change (dependant variable). Just printing out the frist 10 values in the pcor.etouch vector.

 

> pcor.etouch[1:10]

 [1]  0.124255924 -0.006567401  0.009424238  0.048729694  0.101077880  0.177484922  0.016398538  0.098933451

 [9]  0.273896994  0.027145799

 

> hist(pcor.etouch,probability=T)

> lines(density(pcor.etouch))

> quantile(pcor.etouch,c(.025,.975))

      2.5%      97.5%

-0.1108112  0.2565338

> quantile(pcor.etouch,c(0,0.95))

        0%        95%

-0.2495015  0.2328568

 

> hist(pcor.median_income,probability=T)

> lines(density(pcor.median_income))

> quantile(pcor.median_income,c(0,0.95))

        0%        95%

-0.1403044  0.5454804

> quantile(pcor.median_income,c(.025,.975))

     2.5%     97.5%

0.1167792 0.5900816

 

> hist(pcor.hispanic,probability=T)

> lines(density(pcor.hispanic))

> quantile(pcor.hispanic,c(.025,.975))

       2.5%       97.5%

-0.02675132  0.48105924

> quantile(pcor.hispanic,c(0,0.95))

        0%        95%

-0.2855233  0.4537965

 

> hist(pcor.v_changepc,probability=T)

> lines(density(pcor.v_changepc))

> quantile(pcor.v_changepc,c(.025,.975))

      2.5%      97.5%

-0.2385491  0.3350396

> quantile(pcor.v_changepc,c(0,0.95))

        0%        95%

-0.4213978  0.2973756


------------------------------------------------------------------------------------------------------------

 

Title:

 

Linear Regression

Definition and Motivation:

 

> election_lm <- lm(b_change ~ etouch + median_income + hispanic + v_changepc, data = election_subset)

> summary(election_lm)

 

Call:

lm(formula = b_change ~ etouch + median_income + hispanic + v_changepc,

    data = election_subset)

 

Residuals:

      Min            1Q                   Median             3Q                               Max

-0.075622        -0.013817        0.001171         0.015295         0.061787

 

Coefficients:

                                    Estimate           Std. Error         t value              Pr(>|t|)   

(Intercept)                    1.113e-01        2.196e-02        5.069               3.87e-06 ***

etouch                          -4.692e-03      8.298e-03        -0.565              0.57383   

median_income             -1.733e-06      5.539e-07        -3.129              0.00267 **

hispanic                        -7.116e-02      3.370e-02        -2.112              0.03874 * 

v_changepc                  -2.532e-04      6.708e-04        -0.377              0.70711   

---

Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

 

Residual standard error: 0.02596 on 62 degrees of freedom

Multiple R-Squared: 0.2438,     Adjusted R-squared: 0.195

F-statistic: 4.997 on 4 and 62 DF, p-value: 0.001481

 

 

> par(mfrow=c(2,2)); plot(election_lm); par(mfrow=c(1,1))

 

 

> as.vector(election_lm$coefficients)

[1]  1.112965e-01 -4.691730e-03 -1.733254e-06 -7.116254e-02 -2.532048e-04

 

> summary(election_lm)$r.squared

[1] 0.2437796



Pseudo-Code:
 

 

 

R-Code:

 

> regr_lm_fun <- function(data, indices) {

+ data <- data[indices,]

+ model <- lm(b_change ~ etouch + median_income + hispanic + v_changepc, data = election_subset)

+ coefficients(model)

+ }

> regr_lm_boot <- boot(election_subset, regr_lm_fun, 1000)

 

> regr_lm_boot <- boot(election_subset, regr_lm_fun, 999)

> regr_lm_boot

 

ORDINARY NONPARAMETRIC BOOTSTRAP

 

 

Call:

boot(data = election_subset, statistic = regr_lm_fun, R = 999)

 

 

Bootstrap Statistics :

         original  bias    std. error

t1*  1.112965e-01       0           0

t2* -4.691730e-03       0           0

t3* -1.733254e-06       0           0

t4* -7.116254e-02       0           0

t5* -2.532048e-04       0           0