Regression Analysis
13 Regression Analysis
Regression analysis forms a major part of the statisticians tool
box. This section discusses statistical inference for the regression
coefficients.
13.1 Simple linear regression model
R can be used to study the linear relationship between two
numerical variables. Such a study is called linear regression for
historical reasons.
The basic model for linear regression is that pairs of data,
(
x_{i},
y_{i}), are related through the equation
y_{i} = b_{0} + b_{1} x_{i} + e_{i}
The values of
b_{0} and
b_{1} are unknown and will
be estimated from the data. The value of
e_{i} is the amount
the
y observation differs from the straight line model.
In order to estimate
b_{0} and
b_{1} the method of least
squares is employed. That is, one finds the values of (
b_{0},
b_{1})
which make the differences
b_{0} +
b_{1} x_{i} 
y_{i} as small as
possible (in some sense). To streamline notation define
y_{i}^{^} =
b_{0} +
b_{1} x_{i} and
e_{i} =
y_{i}^{^} 
y_{i} be the residual amount of
difference for the
ith observation. Then the method of least squares
finds (
b_{0},
b_{1}) to make
åe_{i}^{2}
as small as possible. This
mathematical problem can be solved and yields values of
b_{1} = 

,


= b_{0} + b_{1} 

Note the latter says the line goes through the point
(
x^{},
y^{}) and has slope
b_{1}.
In order to make statistical inference about these values, one needs
to make assumptions about the errors
e_{i}. The standard
assumptions are that these errors are independent, normals with mean
0 and common variance
s^{2}. If these assumptions are valid
then various statistical tests can be made as will be illustrated
below.
Example: Linear Regression with R
The maximum heart rate of a person is often said to be related to
age by the equation
Max = 220 Age.
Suppose this is to be empirically proven and 15 people of varying
ages are tested for their maximum heart rate. The following
data
^{14}
is found:
Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37
Max Rate 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
In a previous section, it was shown how to use
lm to do a
linear model, and the commands
plot and
abline to
plot the data and the regression line. Recall, this could also be
done with the
simple.lm function. To review, we can plot the
regression line as follows
> x = c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37)
> y = c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178)
> plot(x,y) # make a plot
> abline(lm(y ~ x)) # plot the regression line
> lm(y ~ x) # the basic values of the regression analysis
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
210.0485 0.7977
Figure 50: Regression of max heart rate on age
Or with,
> lm.result=simple.lm(x,y)
> summary(lm.result)
Call:
lm(formula = y ~ x)
...
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 210.04846 2.86694 73.27 < 2e16 ***
x 0.79773 0.06996 11.40 3.85e08 ***
...
The result of the
lm function is of class lm and so the
plot and
summary commands adapt themselves to
that. The variable
lm.result contains the result. We
used
summary to view the entire thing. To view parts of it,
you can call it like it is a list or better still use the
following methods:
resid for residuals,
coef for
coefficients and
predict for prediction. Here are a few examples, the
former giving the coefficients
b_{0} and
b_{1}, the latter
returning the residuals which are then summarized.
> coef(lm.result) # or use lm.result[['coef']]
(Intercept) x
210.0484584 0.7977266
> lm.res = resid(lm.result) # or lm.result[['resid']]
> summary(lm.res)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.926e+00 2.538e+00 3.879e01 1.628e16 3.187e+00 6.624e+00
Once we know the model is appropriate for the data, we will begin to
identify the meaning of the numbers.
13.2 Testing the assumptions of the model
The validity of the model can be checked graphically via eda. The
assumption on the errors being i.i.d. normal random variables
translates into the residuals being normally distributed. They are not independent
as they add to 0 and their variance is not uniform, but they should
show no serial correlations.
We can test for normality with eda tricks: histograms, boxplots and
normal plots. We can test for correlations by looking if there are
trends in the data. This can be done with plots of the residuals vs.
time and order. We can test the assumption that the errors have
the same variance with plots of residuals
vs. time order and fitted values.
The
plot command will do these tests for us if we give it the
result of the regression
> plot(lm.result)
(It will plot 4 separate graphs unless you first tell
R to place
4 on one graph with the command
par(mfrow=c(2,2)).
Figure 51: Example of plot(lm(y ~ x))
Note, this is different from
plot(x,y) which produces a
scatter plot. These plots have:

Residuals vs. fitted
 This plots the fitted (y^{^}) values
against the residuals. Look for spread around the line y=0 and
no obvious trend.
 Normal qqplot
 The residuals are normal if this graph falls close
to a straight line.
 ScaleLocation
 This plot shows the square root of the
standardized residuals. The tallest points, are the largest
residuals.
 Cook's distance
 This plot identifies points which have a lot
of influence in the regression line.
Other ways to investigate the data could be done with the
exploratory data analysis techniques mentioned previously.
13.3 Statistical inference
If you are satisfied that the model fits the data, then statistical
inferences can be made. There are 3 parameters in the model:
s,
b_{0} and
b_{1}.
Recall,
s is the standard deviation of the error
terms. If we had the exact regression line, then the error terms
and the residuals would be the same, so we expect the residuals to
tell us about the value of
s.
What is true, is that
s^{2} = 

å( 

 y_{i})^{2} = 

åe_{i}^{2} .

is an unbiased estimator of
s^{2}. That is, its sampling
distribution has mean of
s^{2}. The division by
n2 makes
this correct, so this is not quite the sample variance of the
residuals. The
n2
intuitively comes from the fact that there are two values
estimated for this problem:
b_{0} and
b_{1}.
13.5 Inferences about b_{1}
The estimator
b_{1} for
b_{1}, the slope of the regression line,
is also an unbiased estimator
. The standard error is given by
SE(b_{1}) = 
s 

æ
ç
ç
è 
å(x_{i}  

)^{2} 
ö
÷
÷
ø 



where
s is as above.
The distribution of the normalized value of
b_{1} is
and it has the
tdistribution with
n2 d.f. Because of this, it is
easy to do a hypothesis test for the slope of the regression line.
If the null hypothesis is
H_{0}:
b_{1} =
a against the alternative
hypothesis
H_{A}:
b_{1} ¹ a then one calculates the
t statistic
and finds the
pvalue from the
tdistribution.
Example: Max heart rate (cont.)
Continuing our heartrate example, we can do a test to see if
the slope of 1 is correct. Let
H_{0} be that
b_{1}=1, and
H_{A} be that
b_{1} ¹ 1. Then we can create the test
statistic and find the
pvalue by hand as follows:
> es = resid(lm.result) # the residuals lm.result
> b1 =(coef(lm.result))[['x']] # the x part of the coefficients
> s = sqrt( sum( es^2 ) / (n2) )
> SE = s/sqrt(sum((xmean(x))^2))
> t = (b1  (1) )/SE # of course  (1) = +1
> pt(t,13,lower.tail=FALSE) # find the right tail for this value of t
# and 152 d.f.
[1] 0.0023620
The actual
pvalue is twice this as the problem is twosided.
We see that it is unlikely that for this data the slope is
1. (Which is the slope predicted by the formula 220  Age.)
R will automatically do a hypothesis test for the assumption
b_{1}=0 which means no slope. See how the
pvalue is included in the output
of the summary command in the column
Pr(>t)
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 210.04846 2.86694 73.27 < 2e16 ***
x 0.79773 0.06996 11.40 3.85e08 ***
13.6 Inferences about b_{0}
As well, a statistical test for
b_{0} can be made (and is). Again,
R includes the test for
b_{0} = 0 which tests to see if the
line goes through the origin. To do other tests, requires a
familiarity with the details.
The estimator
b_{0} for
b_{0} is also unbiased, and has standard
error given by
SE(b_{0}) = s 
æ
ç
ç
ç
ç
ç
è 

ö
÷
÷
÷
÷
÷
ø 

= s 
æ
ç
ç
ç
ç
ç
è 

+ 

ö
÷
÷
÷
÷
÷
ø 

Given this, the statistic
has a
tdistribution with
n2 degrees of freedom.
Example: Max heart rate (cont.)
Let's check if the data supports the intercept of 220. Formally,
we will test
H_{0}:
b_{0} = 220 against
H_{A}:
b_{0} < 220. We
need to compute the value of the test statistic and then look up
the onesided
pvalue. It is similar to the previous example and
we use the previous value of
s:
> SE = s * sqrt( sum(x^2)/( n*sum((xmean(x))^2)))
> b0 = 210.04846 # copy or use
> t = (b0  220)/SE # (coef(lm.result))[['(Intercept)']]
> pt(t,13,lower.tail=TRUE) # use lower tail (220 or less)
[1] 0.0002015734
We would reject the value of 220 based on this
pvalue as well.
13.7 Confidence intervals
Visually, one is interested in confidence intervals. The
regression line is used to predict the value of
y for a given
x, or the average value of
y for a given
x and one would
like to know how accurate this prediction is. This is the job of a
confidence interval.
There is a subtlety between the two points above. The mean value
of
y is subject to less variability than the value of
y and so
the confidence intervals will be different although, they are both
of the same form:
b_{0} + b_{1} ± t^{*} SE.
The mean or average value of
y for a given
x is often denoted
µ
_{y  x} and has a standard error of
SE = s 
æ
ç
ç
ç
ç
ç
è 

+ 

ö
÷
÷
÷
÷
÷
ø 

where
s is the sample standard deviation of the residuals
e_{i}.
If we are trying to predict a single value of
y, then
SE changes ever so slightly to
SE = s 
æ
ç
ç
ç
ç
ç
è 
1+ 

+ 

ö
÷
÷
÷
÷
÷
ø 

But this makes a big difference. The plotting of confidence intervals
in
R is aided with the
predict function. For convenience,
the function
simple.lm will plot both confidence intervals if
you ask it by setting
show.ci=TRUE.
Example: Max heart rate (cont.)
Continuing, our example, to find simultaneous confidence intervals
for the mean and an individual, we proceed as follows
## call simple.lm again
> simple.lm(x,y,show.ci=TRUE,conf.level=0.90)
This produces this graph (figure
52) with
both 90% confidence bands drawn. The wider set of bands is for the individual.
Figure 52: simple.lm with show.ci=TRUE
R Basics: The lowlevel R commands
The function
simple.lm will do a lot of the work for you,
but to really get at the regression model, you need to learn how to
access the data found by the
lm command. Here is a short
list.

To make a lm object
 First, you need use the
lm function and store the results. Suppose x and
y are as above. Then
> lm.result = lm(y ~ x)
will store the results into the variable lm.result.
 To view the results
 As usual, the summary method will
show you most of the details.
> summary(lm.result)
... not shown ...
 To plot the regression line
 You make a plot of the data, and
then add a line with the abline command
> plot(x,y)
> abline(lm.result)
 To access the residuals
 You can use the resid method
> resid(lm.result)
... output is not shown ...
 To access the coefficients
 The coef function will
return a vector of coefficients.
> coef(lm.result)
(Intercept) x
210.0484584 0.7977266
To get at the individual ones, you can refer to them by number, or
name as with:
> coef(lm.result)[1]
(Intercept)
210.0485
> coef(lm.result)['x']
x
0.7977266
 To get the fitted values
 That is to find y_{i}^{^} = b_{0}
+ b_{1} x_{i} for each i, we use the fitted.values command
> fitted(lm.result) # you can abbreviate to just fitted
... output is not shown ...
 To get the standard errors
 The values of s and SE(b_{0})
and SE(b_{1}) appear in the output of summary. To access them
individually is possible with the right know how. The key is that
the coefficients method returns all the numbers in a
matrix if you use it on the results of summary
> coefficients(lm.result)
(Intercept) x
210.0484584 0.7977266
> coefficients(summary(lm.result))
Estimate Std. Error t value Pr(>t)
(Intercept) 210.0484584 2.86693893 73.26576 0.000000e+00
x 0.7977266 0.06996281 11.40215 3.847987e08
To get the standard error for x then is as easy as taking the
2nd row and 2nd column. We can do this by number or name:
> coefficients(summary(lm.result))[2,2]
[1] 0.06996281
> coefficients(summary(lm.result))['x','Std. Error']
[1] 0.06996281
 To get the predicted values
 We can use the predict
function to get predicted values, but it is a little clunky to
call. We need a data frame with column names matching the
predictor or explanatory variable. In this example this is
x so we can do the following to get a prediction for a 50
and 60 year old we have
> predict(lm.result,data.frame(x= c(50,60)))
1 2
170.1621 162.1849
 To find the confidence bands
 The confidence bands would be a
chore to compute by hand. Unfortunately, it is a bit of a chore to get
with the lowlevel commands as well. The predict method
also has an ability to find the confidence bands if we learn how
to ask. Generally speaking, for each value of x we want a point
to plot. This is done as before with a data frame containing all
the x values we want. In addition, we need to ask for the
interval. There are two types: confidence, or prediction. The
confidence will be for the mean, and the prediction for the
individual. Let's see the output, and then go from there. This is
for a 90% confidence level.
> predict(lm.result,data.frame(x=sort(x)), # as before
+ level=.9, interval="confidence") # what is new
fit lwr upr
1 195.6894 192.5083 198.8705
2 195.6894 192.5083 198.8705
3 194.8917 191.8028 197.9805
... skipped ...
We see we get 3 numbers back for each value of x. (note we sorted
x first to get the proper order for plotting.) To plot the lower
band, we just need the second column which is accessed with
[,2]. So the following will plot just the lower. Notice, we
make a scatterplot with the plot command, but add the
confidence band with points.
> plot(x,y)
> abline(lm.result)
> ci.lwr = predict(lm.result,data.frame(x=sort(x)),
+ level=.9,interval="confidence")[,2]
> points(sort(x), ci.lwr,type="l") # or use lines()
Alternatively, we could plot this with the curve function as
follows
> curve(predict(lm.result,data.frame(x=x),
+ interval="confidence")[,3],add=T)
This is conceptually easier, but harder to break up, as the curve
function requires a function of x to plot.
13.8 Problems

13.1
 The cost of a home depends on the number of bedrooms in the
house. Suppose the following data is recorded for homes in a given
town
price (in thousands) 
300 
250 
400 
550 
317 
389 
425 
289 
389 
559 
No. bedrooms 
3 
3 
4 
5 
4 
3 
6 
3 
4 
5 
Make a scatterplot, and fit the data with a regression line. On the
same graph, test the hypothesis that an extra bedroom costs $60,000
against the alternative that it costs more.
 13.2
 It is well known that the more beer you drink, the more your
blood alcohol level rises. Suppose we have the following data on
student beer consumption
Student 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Beers 
5 
2 
9 
8 
3 
7 
3 
5 
3 
5 
BAL 
0.10 
0.03 
0.19 
0.12 
0.04 
0.095 
0.07 
0.06 
0.02 
0.05 
Make a scatterplot and fit the data with a regression line. Test
the hypothesis that another beer raises your BAL by 0.02 percent
against the alternative that it is less.
 13.3
 For the same Blood alcohol data, do a hypothesis test that the
intercept is 0 with a twosided alternative.
 13.4
 The lapse rate is the rate at which temperature drops as you
increase elevation. Some hardy students were interested in checking
empirically if the lapse rate of 9.8 degrees C/km was accurate for
their hiking. To investigate, they grabbed their thermometers and
their Suunto wrist altimeters and found the following data on their
hike
elevation (ft) 
600 
1000 
1250 
1600 
1800 
2100 
2500 
2900 
temperature (F) 
56 
54 
56 
50 
47 
49 
47 
45 
Draw a scatter plot with regression line, and investigate if the
lapse rate is 9.8C/km. (First, it helps to convert to the rate of
change in Fahrenheit per feet with is 5.34 degrees per 1000 feet.)
Test the hypothesis that the lapse rate is 5.34 degrees per 1000
feet against the alternative that it is less than this.
Copyright © John Verzani, 20012. All rights reserved.