Chapter 4

Making Predictions

Correlation

Interact

Measuring the relation between two variables

In the last chapter, we began developing a method that would allow us to estimate the birth weight of babies based on the number of gestational days. In this chapter, we develop a more general approach to measuring the relation between two variables and estimating the value of one variable given the value of another.

Our proposed estimate in the example of birth weights was simple:

  • Find the ratio of the birth weight to gestational days for each baby in the sample.
  • Find the median of the ratios.
  • For a new baby, multiply the number of gestational days by that median. This is our estimate of the baby's birth weight.

Here are the data; the column r_bwt_gd contains the ratio of birth weight to gestational days.

baby = Table.read_table('baby.csv')
baby['r_bwt_gd'] = baby['birthwt']/baby['gest_days']
baby
birthwt gest_days mat_age mat_ht mat_pw m_smoker r_bwt_gd
120 284 27 62 100 0 0.422535
113 282 33 64 135 0 0.400709
128 279 28 64 115 1 0.458781
108 282 23 67 125 1 0.382979
136 286 25 62 93 0 0.475524
138 244 33 62 178 0 0.565574
132 245 23 65 140 0 0.538776
120 289 25 62 125 0 0.415225
143 299 30 66 136 1 0.478261
140 351 27 68 120 0 0.39886

... (1164 rows omitted)

The median of the ratios is just about 0.429 ounces per day:

np.median(baby['r_bwt_gd'])
0.42907801418439717

So we can construct our proposed estimate by multiplying the number of gestational days of each baby by 0.429. The column est_wt contains these estimates.

baby['est_bwt'] = 0.429*baby['gest_days']
baby.drop('r_bwt_gd')
birthwt gest_days mat_age mat_ht mat_pw m_smoker est_bwt
120 284 27 62 100 0 121.836
113 282 33 64 135 0 120.978
128 279 28 64 115 1 119.691
108 282 23 67 125 1 120.978
136 286 25 62 93 0 122.694
138 244 33 62 178 0 104.676
132 245 23 65 140 0 105.105
120 289 25 62 125 0 123.981
143 299 30 66 136 1 128.271
140 351 27 68 120 0 150.579

... (1164 rows omitted)

Because of the way they are constructed – by multiplying the gestational days by a constant – The estimates all lie on a straight line.

bb0 = baby.select(['gest_days', 'est_bwt'])
bb0.scatter('gest_days')
plots.ylim(40,200)
plots.xlabel('gestational days')
plots.ylabel('estimated birth weight')

A natural question to ask is, "How good are these estimates?" To get a sense of the answer, let us visualize the data by drawing a scatter plot of birth weight versus gestational days. The scatter plot consists of one point for each of the 1,174 babies in the sample. The horizontal axis represents the number of gestational days and the vertical axis represents the birth weight.

The plot is generated by using a Table method called scatter. The argument is the label of the column that contains the values to be plotted on the horizontal axis. For each of the other columns in the table, a scatter diagram is produced with the corresponding variable on the vertical axis. To generate just one scatter plot, therefore, we start by selecting only the variables that we need.

gd_bwt = baby.select(['gest_days', 'birthwt'])
gd_bwt.scatter('gest_days')
plots.xlabel('gestational days')
plots.ylabel('birth weight')

How good are the estimates that we calculated? To get a sense of this, we will re-draw the scatter plot and overlay the line of estimates. This can be done by using the argument overlay=True when we call scatter.

gd_bwt_est = baby.select(['gest_days', 'birthwt', 'est_bwt'])
gd_bwt_est.scatter('gest_days', overlay=True)
plots.xlabel('gestational days')

The line appears to be roughly in the center of the scatter diagram; if we use the line for estimation, the amount of over-estimation will be comparable to the amount of under-estimation, roughly speaking. This is a natural criterion for determining a "good" straight line of estimates.

Let us see if we can formalize this idea and create a "best" straight line of estimates.

To see roughly where such a line must lie, we will start by attempting to estimate maternal pregnancy weight based on the mother's height. The heights have been measured to the nearest inch, resulting in the vertical stripes in the scatter plot.

ht_pw = baby.select(['mat_ht', 'mat_pw'])
plots.scatter(ht_pw['mat_ht'], ht_pw['mat_pw'], s=5, color='gold')
plots.xlabel('height (inches)')
plots.ylabel('pregnancy weight (pounds)')

Suppose we know that one of the women is 65 inches tall. What would be our estimate for her pregnancy weight?

We know that the point corresponding to this woman must be on the vertical strip at 65 inches. A natural estimate of her pregnancy weight is the average of the weights in that strip; the rough size of the error in this estimate will be the SD of the weights in the strip.

For a woman who is 60 inches tall, our estimate of pregnancy weight would be the average of the weights in the vertical strip at 60 inches. And so on.

Here are the average pregnancy weights for all the values of the heights.

v_means = ht_pw.group('mat_ht', np.mean)
v_means
mat_ht mat_pw mean
53 110
54 156
56 93
57 98
58 112.8
59 109.208
60 109.556
61 118.556
62 120.756
63 125.894

... (9 rows omitted)

In the figure below, these averages are overlaid on the scatter plot of pregnancy weight versus height, and appear as green dots at the averages of the vertical strips. The graph of green dots is called the graph of averages. A graph of averages shows the average of the variable on the vertical axis, for each value of the variable on the horizontal axis. It can be used for estimating the variable on the vertical axis, given the variable on the horizontal.

plots.scatter(baby['mat_ht'], baby['mat_pw'], s=5, color='gold')
plots.scatter(v_means['mat_ht'], v_means['mat_pw mean'], color='g')
plots.xlabel('height (inches)')
plots.ylabel('pregnancy weight (pounds)')
plots.title('Graph of Averages')

For these two variables, the graph of averages looks roughly linear: the green dots are fairly close to a straight line for much of the scatter. That straight line is the "best" straight line for estimating pregnancy weight based on height.

To identify the line more precisely, let us examine the oval or football shaped scatter plot.

First, we note that the position of the line is a property of the shape of the scatter, and is not affected by the units in which the heights and weights were measured. Therefore, we will measure both variables in standard units.

Identifying the line that goes through the graph of averages

x_demo = np.random.normal(0, 1, 10000)
z_demo = np.random.normal(0, 1, 10000)
y_demo = 0.5*x_demo + np.sqrt(.75)*z_demo
plots.scatter(x_demo, y_demo, s=10)
plots.xlim(-4, 4)
plots.ylim(-4, 4)
plots.axes().set_aspect('equal')
plots.plot([-4, 4], [-4*0.6,4*0.6], color='g', lw=1)
plots.plot([-4,4],[-4,4], lw=1, color='r')
plots.plot([1.5,1.5], [-4,4], lw=1, color='k')
plots.xlabel('x in standard units')
plots.ylabel('y in standard units')

Here is a football shaped scatter plot. We will follow the usual convention of calling the variable along the horizontal axis $x$ and the variable on the vertical axis $y$. Both variables are in standard units. This implies that the center of the football, corresponding to the point where both variables are at their average, is the origin (0, 0).

Because of the symmetry of the figure, resulting from both variables being measured in standard units, it is natural to see whether the 45 degree line is the best line for estimation. The 45 degree line has been drawn in red. For points on the red line, the value of $x$ in standard units is equal to the value of $y$ in standard units.

Suppose the given value of $x$ is at the average, in other words at 0 on the standard units scale. The points corresponding to that value of $x$ are on the vertical strip at $x$ equal to 0 standard units. The average of the values of $y$ in that strip can be seen to be at 0, by symmetry. So the straight line of estimates should pass through (0, 0).

A careful look at the vertical strips shows that the red line does not work for estimating $y$ based on other values of $x$. For example, suppose the value of $x$ is 1.5 standard units. The black vertical line corresponds to this value of $x$. The points on the scatter whose value of $x$ is 1.5 standard units are the blue dots on the black line; their values on the vertical scale range from about 2 to about 3. It is clear from the figure that the red line does not pass through the average of these points; the red line is too steep.

To get to the average of the vertical strip at $x$ equal to 1.5 standard units, we have to come down from the red line to the center of the strip. The green line is at that point; it has been drawn by connecting the center of the strip to the point (0, 0) and then extending the line on both sides.

The scatter plot is linear, so the green line picks off the centers of the vertical strips. It is the line that should be used to estimate $y$ based on $x$ when both variables are in standard units.

The slope of the red line is equal to 1. The green line is less steep, and so its slope is less than 1. Because it is sloping upwards in the figure, we have established that its slope is between 0 and 1.

Summary. When both variables are measured in standard units, the best line for estimating $y$ based on $x$ is less steep than the 45 degree line and thus has slope less than one. The discussion above was based on a scatter plot sloping upwards, and so the slope of the best line is a number between 0 and 1. For scatter diagrams sloping downwards, the slope would be a number between 0 and -1. For the slope to be close to -1 or 1, the red and green lines must be close, or in other words, the scatter diagram must be tightly clustered around a straight line.

The Correlation Coefficient

This number between -1 and 1 is called the correlation coefficient and is said to measure the correlation between the two variables.

  • The correlation coefficient $r$ is a number between $-1$ and 1.
  • When both the variables are measured in standard units, $r$ is the slope of the "best straight line" for estimating $y$ based on $x$.
  • $r$ measures the extent to which the scatter plot clusters around a straight line of positive or negative slope.

The function football takes a value of $r$ as its argument and generates a football shaped scatter plot with correlation roughly $r$. The red line is the 45 degree line $y=x$, corresponding to points that are the same number of standard units in both variables. The green line, $y = rx$, is a smoothed version of the graph of averages. Points on the green line correspond to our estimates of the variable on the vertical axis, given values on the horizontal.

Call football a few times, with different values of $r$ as the argument, and see how the football changes. Positive $r$ corresponds to positive association: above-average values of one variable are associated with above-average values of the other, and the scatter plot slopes upwards.

Notice also that the bigger the absolute value of $r$, the more clustered the points are around the green line of averages, and the closer the green line is to the red line of equal standard units.

When $r=1$ the scatter plot is perfectly linear and slopes upward. When $r=-1$, the scatter plot is perfectly linear and slopes downward. When $r=0$, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be uncorrelated.

# Football shaped scatter, both axes in standard units
# Argument of function: correlation coefficient r
# red line: slope = 1 (or -1)
# green line: smoothed graph of averages, slope = r
# Use the green line for estimating y based on x

football(0.6)
football(0.2)
football(0)
football(-0.7)

Calculating $r$

The formula for $r$ is not apparent from our observations so far; it has a mathematical basis that is outside the scope of this class. However, the calculation is straightforward and helps us understand several of the properties of $r$.

Formula for $r$:

  • $r$ is the average of the products of the two variables, when both variables are measured in standard units.

Here are the steps in the calculation. We will apply the steps to a simple table of values of $x$ and $y$.

t = Table([np.arange(1,7,1), [2, 3, 1, 5, 2, 7]], ['x','y'])
t
x y
1 2
2 3
3 1
4 5
5 2
6 7

Based on the scatter plot, we expect that $r$ will be positive but not equal to 1.

plots.scatter(t['x'], t['y'], s=30, color='r')
plots.xlim(0, 8)
plots.ylim(0, 8)
plots.xlabel('x')
plots.ylabel('y', rotation=0)

Step 1. Convert each variable to standard units.

t['x_su'] = (t['x'] - np.mean(t['x']))/np.std(t['x'])
t['y_su'] = (t['y'] - np.mean(t['y']))/np.std(t['y'])
t
x y x_su y_su
1 2 -1.46385 -0.648886
2 3 -0.87831 -0.162221
3 1 -0.29277 -1.13555
4 5 0.29277 0.811107
5 2 0.87831 -0.648886
6 7 1.46385 1.78444

Step 2. Multiply each pair of standard units.

t['su_product'] = t['x_su']*t['y_su']
t
x y x_su y_su su_product
1 2 -1.46385 -0.648886 0.949871
2 3 -0.87831 -0.162221 0.142481
3 1 -0.29277 -1.13555 0.332455
4 5 0.29277 0.811107 0.237468
5 2 0.87831 -0.648886 -0.569923
6 7 1.46385 1.78444 2.61215

Step 3. $r$ is the average of the products computed in Step 2.

# r is the average of the products of standard units

r = np.mean(t['su_product'])
r
0.61741639718977093

As expected, $r$ is positive but not equal to 1.

The calculation shows that:

  • $r$ is a pure number; it has no units. This is because $r$ is based on standard units.
  • $r$ is unaffected by changing the units on either axis. This too is because $r$ is based on standard units.
  • $r$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called $x$ and which $y$. Geometrically, switching axes reflects the scatter plot about the line $y=x$, but does not change the amount of clustering nor the sign of the association.
plots.scatter(t['y'], t['x'], s=30, color='r')
plots.xlim(0, 8)
plots.ylim(0, 8)
plots.xlabel('y')
plots.ylabel('x', rotation=0)

The NumPy method corrcoef can be used to calculate $r$. The arguments are an array containing the values of $x$ and another containing the corresponding values of $y$. The program evaluates to a correlation matrix, which is this case is a 2x2 table indexed by $x$ and $y$. The top left element is the correlation between $x$ and $x$, and hence is 1. The top right element is the correlation between $x$ and $y$, which is equal to the correlation between $y$ and $x$ displayed on the bottom left. The bottom right element is 1, the correlation between $y$ and $y$.

np.corrcoef(t['x'], t['y'])
array([[ 1.       ,  0.6174164],
       [ 0.6174164,  1.       ]])

For the purposes of this class, correlation matrices are unnecessary. We will define our own function corr to compute $r$, based on the formula that we used above. The arguments are the name of the table and the labels of the columns containing the two variables. The function returns the mean of the products of standard units, which is $r$.

def corr(table, column_A, column_B):
    x = table[column_A]
    y = table[column_B]
    return np.mean(((x-np.mean(x))/np.std(x))*((y-np.mean(y))/np.std(y)))

Here are examples of calling the function. Notice that it gives the same answer to the correlation between $x$ and $y$ as we got by using np.corrcoef and earlier by direct application of the formula for $r$. Notice also that the correlation between maternal age and birth weight is very low, confirming the lack of any upward or downward trend in the scatter diagram.

corr(t, 'x', 'y')
0.61741639718977093
corr(baby, 'birthwt', 'gest_days')
0.40754279338885108
corr(baby, 'birthwt', 'mat_age')
0.026982911002929499
plots.scatter(baby['mat_age'], baby['birthwt'])
plots.xlabel("mother's age")
plots.ylabel('birth weight')

Properties of Correlation

Correlation is a simple and powerful concept, but it is sometimes misused. Before using $r$, it is important to be aware of the following points.

  • Correlation only measures association. Correlation does not imply causation. Though the correlation between the weight and the math ability of children in a school district may be positive, that does not mean that doing math makes children heavier or that putting on weight improves the children's math skills. Age is a confounding variable: older children are both heavier and better at math than younger children, on average.

  • Correlation measures linear association. Variables that have strong non-linear association might have very low correlation. Here is an example of variables that have a perfect quadratic relation $y = x^2$ but have correlation equal to 0.

tsq = Table([np.arange(-4, 4.1, 0.5), np.arange(-4, 4.1, 0.5)**2],
           ['x','y'])
plots.scatter(tsq['x'], tsq['y'])
plots.xlabel('x')
plots.ylabel('y', rotation=0)
corr(tsq, 'x', 'y')
0.0
  • Outliers can have a big effect on correlation. Here is an example where a scatter plot for which $r$ is equal to 1 is turned into a plot for which $r$ is equal to 0, by the addition of just one outlying point.
t = Table([[1,2,3,4],[1,2,3,4]], ['x','y'])
plots.scatter(t['x'], t['y'], s=30, color='r')
plots.xlim(0, 6)
plots.ylim(-0.5,6)
corr(t, 'x', 'y')
1.0
t_outlier = Table([[1,2,3,4,5],[1,2,3,4,0]], ['x','y'])
plots.scatter(t_outlier['x'], t_outlier['y'], s=30, color='r')
plots.xlim(0, 6)
plots.ylim(-0.5,6)
corr(t_outlier, 'x', 'y')
0.0
  • Correlations based on aggregated data can be misleading. As an example, here are data on the Critical Reading and Math SAT scores in 2014. There is one point for each of the 50 states and one for Washington, D.C. The column Participation Rate contains the percent of high school seniors who took the test. The next three columns show the average score in the state on each portion of the test, and the final column is the average of the total scores on the test.
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014
State Participation Rate Critical Reading Math Writing Combined
Alabama 6.7 547 538 532 1617
Alaska 54.2 507 503 475 1485
Arizona 36.4 522 525 500 1547
Arkansas 4.2 573 571 554 1698
California 60.3 498 510 496 1504
Colorado 14.3 582 586 567 1735
Connecticut 88.4 507 510 508 1525
Delaware 100 456 459 444 1359
District of Columbia 100 440 438 431 1309
Florida 72.2 491 485 472 1448

... (41 rows omitted)

The scatter diagram of Math scores versus Critical Reading scores is very tightly clustered around a straight line; the correlation is close to 0.985.

plots.scatter(sat2014['Critical Reading'], sat2014['Math'])
plots.xlabel('Critical Reading')
plots.ylabel('Math')
corr(sat2014, 'Critical Reading', 'Math')
0.98475584110674341

It is important to note that this does not reflect the strength of the relation between the Math and Critical Reading scores of students. States don't take tests – students do. The data in the table have been created by lumping all the students in each state into a single point at the average values of the two variables in that state. But not all students in the state will be at that point, as students vary in their performance. If you plot a point for each student instead of just one for each state, there will be a cloud of points around each point in the figure above. The overall picture will be more fuzzy. The correlation between the Math and Critical Reading scores of the students will be lower than the value calculated based on state averages.

Correlations based on aggregates and averages are called ecological correlations and are frequently reported. As we have just seen, they must be interpreted with care.

Serious or tongue-in-cheek?

In 2012, a paper in the respected New England Journal of Medicine examined the relation between chocolate consumption and Nobel Prizes in a group of countries. The Scientific American responded seriously; others were more relaxed. The paper included the following graph:

choc_Nobel

See what you think about the analysis and also about the Reuters report that said, "The p-value Messerli calculated was 0.0001. 'This means that the odds of this being due to chance are less than 1 in 10,000," he said."

Regression

Interact

The regression line

The concepts of correlation and the "best" straight line through a scatter plot were developed in the late 1800's and early 1900's. The pioneers in the field were Sir Francis Galton, who was a cousin of Charles Darwin, and Galton's protégé Karl Pearson. Galton was interested in eugenics, and was a meticulous observer of the physical traits of parents and their offspring. Pearson, who had greater expertise than Galton in mathematics, helped turn those observations into the foundations of mathematical statistics.

The scatter plot below is of a famous dataset collected by Pearson and his colleagues in the early 1900's. It consists of the heights, in inches, of 1,078 pairs of fathers and sons.

The data are contained in the table heights, in the columns father and son respectively. In the previous section, we saw how to use the table method scatter to draw scatter plots. Here, the method scatter in the pyplot module of matplotlib (imported as plots) is used for the same purpose. The first argument of scatter is a array containing the variable on the horizontal axis. The second argument contains the variable on the vertical axis. The optional argument s=10 sets the size of the points; the default value is s=20.

# Scatter plot using matplotlib method
heights = Table.read_table('heights.csv')
plots.scatter(heights['father'], heights['son'], s=10)
plots.xlabel("father's height")
plots.ylabel("son's height")

Notice the familiar football shape. This is characteristic of variable pairs that follow a bivariate normal distribution: the scatter plot is oval, the distribution of each variable is roughly normal, and the distribution of the variable in each vertical and horizontal strip is roughly normal as well.

The correlation between the heights of the fathers and sons is about 0.5.

r = corr(heights, 'father', 'son')
r
0.50116268080759108

The regression effect

The figure below shows the scatter plot of the data when both variables are measured in standard units. As we saw earlier, the red line of equal standard units is too steep to serve well as the line of estimates of $y$ based on $x$. Rather, the estimates are on the green line, which is flatter and picks off the centers of the vertical strips.

This flattening was noticed by Galton, who had been hoping that exceptionally tall fathers would have sons who were just as exceptionally tall. However, the data were clear, and Galton realized that the tall fathers have sons who are not quite as exceptionally tall, on average. Frustrated, Galton called this phenomenon "regression to mediocrity." Because of this, the line of best fit through a scatter plot is called the regression line.

Galton also noticed that exceptionally short fathers had sons who were somewhat taller relative to their generation, on average. In general, individuals who are away from average on one variable are expected to be not quite as far away from average on the other. This is called the regression effect.

# The regression effect

f_su = (heights['father']-np.mean(heights['father']))/np.std(heights['father'])
s_su = (heights['son']-np.mean(heights['son']))/np.std(heights['son'])
plots.scatter(f_su, s_su, s=10)
plots.plot([-4, 4], [-4, 4], color='r', lw=1)
plots.plot([-4, 4], [-4*r, 4*r], color='g', lw=1)
plots.axes().set_aspect('equal')

The Table method scatter can be used with the option fit_line=True to draw the regression line through a scatter plot.

# Plotting the regression line, using Table

heights.scatter('father', fit_line=True)
plots.xlabel("father's height")
plots.ylabel("son's height")

Karl Pearson used the observation of the regression effect in the data above, as well as in other data provided by Galton, to develop the formal calculation of the correlation coefficient $r$. That is why $r$ is sometimes called Pearson's correlation.

The equation of the regression line

As we saw in the last section for football shaped scatter plots, when the variables $x$ and $y$ are measured in standard units, the best straight line for estimating $y$ based on $x$ has slope $r$ and passes through the origin. Thus the equation of the regression line can be written as:

$~~~~~~~~~~$ estimate of $y$, in $y$-standard units $~=~$ $r ~ \times$ (the given $x$, in $x$-standard units)

That is, $$ \frac{\mbox{estimate of}~y ~-~\mbox{average of}~y}{\mbox{SD of}~y} ~=~ r \times \frac{\mbox{the given}~x ~-~\mbox{average of}~x}{\mbox{SD of}~x} $$

The equation can be converted into the original units of the data, either by rearranging this equation algebraically, or by labeling some important features of the line both in standard units and in the original units.

regline

It is a remarkable fact of mathematics that what we have observed to be true for football shaped scatter plots turns out to be true for all scatter plots, no matter what they look like.

Regardless of the shape of the scatter plot:

$$ \mbox{slope of the regression line} ~=~ \frac{r \cdot \mbox{SD of}~y}{\mbox{SD of}~x} ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ $$$$ \mbox{intercept of the regression line} ~=~ \mbox{average of}~y ~-~ \mbox{slope} \cdot \mbox{(average of}~x\mbox{)} $$

Calculation of the slope and intercept

The NumPy method np.polyfit takes as its first argument an array consisiting of the values of the given variable; its second argument an array consisting of the variable to be estimated; its third argument deg=1 specifies that we are fitting a straight line, that is, a polynomial of degree 1. It evaluates to an array consisting of the slope and the intercept of the regression line.

# Slope and intercept by NumPy method

np.polyfit(heights['father'], heights['son'], deg=1)
array([  0.51400591,  33.89280054])

It is worth noting that the intercept of approximately 33.89 inches is not intended as an estimate of the height of a son whose father is 0 inches tall. There is no such son and no such father. The intercept is merely a geometric or algebraic quantity that helps define the line. In general, it is not a good idea to extrapolate, that is, to make estimates outside the range of the available data. It is certainly not a good idea to extrapolate as far away as 0 is from the heights of the fathers in the study.

The calculations of the slope and intercept of the regression line are straightforward, so we will set np.polyfit aside and write our own function to compute the two quantities. The function regress takes as its arguments the name of the table, the column label of the given variable, and the column label of the variable to be estimated; it evaluates to an array containing the slope and the intercept of the regression line.

# Slope and intercept of regression line

def regress(table, column_x, column_y):
    r = corr(table, column_x, column_y)
    reg_slope = r*np.std(table[column_y])/np.std(table[column_x])
    reg_int = np.mean(table[column_y]) - reg_slope*np.mean(table[column_x])
    return np.array([reg_slope, reg_int])

A call to regress yields the same results as the call to np.polyfit made above.

slope_int_h = regress(heights, 'father', 'son')
slope_int_h
array([  0.51400591,  33.89280054])

Fitted values

We can use the regression line to get an estimate of the height of every son in the data. The estimated values of $y$ are called the fitted values. They all lie on a straight line. To calculate them, take a son's height, multiply it by the slope of the regression line, and add the intercept. In other words, calculate the height of the regression line at the given value of $x$.

# Estimates of sons' heights are on the regression line

heights['fitted value'] = slope_int_h[0]*heights['father'] + slope_int_h[1]
heights
father son fitted value
65 59.8 67.3032
63.3 63.2 66.4294
65 63.3 67.3032
65.8 62.8 67.7144
61.1 64.3 65.2986
63 64.2 66.2752
65.4 64.1 67.5088
64.7 64 67.149
66.1 64.6 67.8686
67 64 68.3312

... (1068 rows omitted)

Residuals

The amount of error in each of these regression estimates is the difference between the son's height and its estimate. These errors are called residuals. Some residuals are positive. These correspond to points that are above the regression line – points for which the regression line under-estimates $y$. Negative residuals correspond to the line over-estimating values of $y$.

# Error in the regression estimate: Distance between observed value and fitted value
# "Residual" 

heights['residual'] = heights['son'] - heights['fitted value']
heights
father son fitted value residual
65 59.8 67.3032 -7.50318
63.3 63.2 66.4294 -3.22937
65 63.3 67.3032 -4.00318
65.8 62.8 67.7144 -4.91439
61.1 64.3 65.2986 -0.998562
63 64.2 66.2752 -2.07517
65.4 64.1 67.5088 -3.40879
64.7 64 67.149 -3.14898
66.1 64.6 67.8686 -3.26859
67 64 68.3312 -4.3312

... (1068 rows omitted)

As with deviations from average, the positive and negative residuals exactly cancel each other out. So the average (and sum) of the residuals is 0.

Error in the regression estimate

Though the average residual is 0, each individual residual is not. Some residuals might be quite far from 0. To get a sense of the amount of error in the regression estimate, we will start with a graphical description of the sense in which the regression line is the "best".

Our example is a dataset that has one point for every chapter of the novel "Little Women." The goal is to estimate the number of characters (that is, letters, punctuation marks, and so on) based on the number of periods. Recall that we attempted to do this in the very first lecture of this course.

lw = Table.read_table('little_women.csv')
# One point for each chapter
# Horizontal axis: number of periods
# Vertical axis: number of characters (as in a, b, ", ?, etc; not people in the book)

plots.scatter(lw['Periods'], lw['Characters'])
plots.xlabel('Periods')
plots.ylabel('Characters')
corr(lw, 'Periods', 'Characters')
0.92295768958548163

The scatter plot is remarkably close to linear, and the correlation is more than 0.92.

a = [131, 14431]
b = [231, 20558]
c = [392, 40935]
d = [157, 23524]
def lw_errors(slope, intercept):
    xlims = np.array([50, 450])
    plots.scatter(lw['Periods'], lw['Characters'])
    plots.plot(xlims, slope*xlims + intercept, lw=2)
    plots.plot([a[0],a[0]], [a[1], slope*a[0] + intercept], color='r', lw=2)
    plots.plot([b[0],b[0]], [b[1], slope*b[0] + intercept], color='r', lw=2)
    plots.plot([c[0],c[0]], [c[1], slope*c[0] + intercept], color='r', lw=2)
    plots.plot([d[0],d[0]], [d[1], slope*d[0] + intercept], color='r', lw=2)
    plots.xlabel('Periods')
    plots.ylabel('Characters')

The figure below shows the scatter plot and regression line, with four of the errors marked in red.

# Residuals: Deviations from the regression line

slope_int_lw = regress(lw, 'Periods', 'Characters')
lw_errors(slope_int_lw[0], slope_int_lw[1])

Had we used a different line to create our estimates, the errors would have been different. The picture below shows how big the errors would be if we were to use a particularly silly line for estimation.

# Errors: Deviations from a different line

lw_errors(-100, 50000)

Below is a line that we have used before without saying that we were using a line to create estimates. It is the horizontal line at the value "average of $y$." Suppose you were asked to estimate $y$ and were not told the value of $x$; then you would use the average of $y$ as your estimate, regardless of the chapter. In other words, you would use the flat line below.

Each error that you would make would then be a deviation from average. The rough size of these deviations is the SD of $y$.

In summary, if we use the flat line at the average of $y$ to make our estimates, the estimates will be off by the SD of $y$.

# Errors: Deviations from the flat line at the average of y

lw_errors(0, np.mean(lw['Characters']))

The Method of Least Squares

If you use any arbitrary line as your line of estimates, then some of your errors are likely to be positive and others negative. To avoid cancellation when measuring the rough size of the errors, we take the mean of the sqaured errors rather than the mean of the errors themselves. This is exactly analogous to our reason for looking at squared deviations from average, when we were learning how to calculate the SD.

The mean squared error of estimation using a straight line is a measure of roughly how big the squared errors are; taking the square root yields the root mean square error, which is in the same units as $y$.

Here is the second remarkable fact of mathematics in this section: the regression line minimizes the mean squared error of estimation (and hence also the root mean squared error) among all straight lines. That is why the regression line is sometimes called the "least squares line."

Computing the "best" line.

  • To get estimates of $y$ based on $x$, you can use any line you want.
  • Every line has a mean squared error of estimation.
  • "Better" lines have smaller errors.
  • The regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines.

Regression Functions

Regression is one of the most commonly used methods in statistics, and we will be using it frequently in the next few sections. It will be helpful to be able to call functions to compute the various quantities connected with regression. The first two of the functions below have already been defined; the rest are defined below.

  • corr: the correlation coefficient
  • regress: the slope and intercept of the regression line
  • fit: the fitted value at one given value of $x$
  • fitted_values: the fitted values at all the values of $x$ in the data
  • residuals: the residuals
  • scatter_fit: scatter plot and regression line
  • residual_plot: plot of residuals versus $x$
# Correlation coefficient

def corr(table, column_A, column_B):
    x = table[column_A]
    y = table[column_B]
    x_su = (x-np.mean(x))/np.std(x)
    y_su = (y-np.mean(y))/np.std(y)
    return np.mean(x_su*y_su)
# Slope and intercept of regression line

def regress(table, column_x, column_y):
    r = corr(table, column_x, column_y)
    reg_slope = r*np.std(table[column_y])/np.std(table[column_x])
    reg_int = np.mean(table[column_y]) - reg_slope*np.mean(table[column_x])
    return np.array([reg_slope, reg_int])
# Fitted value; the regression estimate at x=new_x

def fit(table, column_x, column_y, new_x):
    slope_int = regress(table, column_x, column_y)
    return slope_int[0]*new_x + slope_int[1]
# Fitted values; the regression estimates lie on a straight line

def fitted_values(table, column_x, column_y):
    slope_int = regress(table, column_x, column_y)
    return slope_int[0]*table[column_x] + slope_int[1]
# Residuals: Deviations from the regression line

def residuals(table, column_x, column_y):
    fitted = fitted_values(table, column_x, column_y)
    return table[column_y] - fitted
# Scatter plot with fitted (regression) line

def scatter_fit(table, column_x, column_y):
    plots.scatter(table[column_x], table[column_y], s=10)
    plots.plot(table[column_x], fitted_values(table, column_x, column_y), lw=1, color='green')
    plots.xlabel(column_x)
    plots.ylabel(column_y)
# A residual plot

def residual_plot(table, column_x, column_y):
    plots.scatter(table[column_x], residuals(table, column_x, column_y), s=10)
    xm = np.min(table[column_x])
    xM = np.max(table[column_x])
    plots.plot([xm, xM], [0, 0], color='k', lw=1)
    plots.xlabel(column_x)
    plots.ylabel('residual')

Residual plots

Suppose you have carried out the regression of sons' heights on fathers' heights.

scatter_fit(heights, 'father', 'son')

It is a good idea to then draw a residual plot. This is a scatter plot of the residuals versus the values of $x$. The residual plot of a good regression looks like the one below: a formless cloud with no pattern, centered around the horizontal axis. It shows that there is no discernible non-linear pattern in the original scatter plot.

residual_plot(heights, 'father','son')

Residual plots can be useful for spotting non-linearity in the data, or other features that weaken the regression analysis. For example, consider the SAT data of the previous section, and suppose you try to estimate the Combined score based on Participation Rate.

sat2014 = Table.read_table('sat2014.csv')
sat2014
State Participation Rate Critical Reading Math Writing Combined
North Dakota 2.3 612 620 584 1816
Illinois 4.6 599 616 587 1802
Iowa 3.1 605 611 578 1794
South Dakota 2.9 604 609 579 1792
Minnesota 5.9 598 610 578 1786
Michigan 3.8 593 610 581 1784
Wisconsin 3.9 596 608 578 1782
Missouri 4.2 595 597 579 1771
Wyoming 3.3 590 599 573 1762
Kansas 5.3 591 596 566 1753

... (41 rows omitted)

plots.scatter(sat2014['Participation Rate'], sat2014['Combined'], s=10)
plots.xlabel('Participation Rate')
plots.ylabel('Combined')

The relation between the variables is clearly non-linear, but you might be tempted to fit a straight line anyway, especially if you did not first draw the scatter diagram of the data.

scatter_fit(sat2014, 'Participation Rate', 'Combined')
plots.title("A bad idea")

The points in the scatter plot start out above the regression line, then are consistently below the line, then above, then below. This pattern of non-linearity is more clearly visible in the residual plot.

residual_plot(sat2014, 'Participation Rate', 'Combined')
plots.title('Residual plot of the bad regression')

This residual plot shows a non-linear pattern, and is a signal that linear regression should not have been used for these data.

The rough size of the residuals

Let us return to the heights of the fathers and sons, and compare the estimates based on using the regression line and the flat line (in yellow) at the average height of the sons. As noted above, the rough size of the errors made using the flat line is the SD of $y$. Clearly, the regression line does a better job of estimating sons' heights than the flat line does; indeed, it minimizes the mean squared error among all lines. Thus, the rough size of the errors made using the regression line must be smaller that that using the flat line. In other words, the SD of the residuals must be smaller than the overall SD of $y$.

ave_y = np.mean(heights['son'])
scatter_fit(heights, 'father', 'son')
plots.plot([np.min(heights['father']), np.max(heights['father'])], [ave_y, ave_y], lw=2, color='gold')

Here, once again, are the residuals in the estimation of sons' heights based on fathers' heights. Each residual is the difference between the height of a son and his estimated (or "fitted") height.

heights
father son fitted value residual
65 59.8 67.3032 -7.50318
63.3 63.2 66.4294 -3.22937
65 63.3 67.3032 -4.00318
65.8 62.8 67.7144 -4.91439
61.1 64.3 65.2986 -0.998562
63 64.2 66.2752 -2.07517
65.4 64.1 67.5088 -3.40879
64.7 64 67.149 -3.14898
66.1 64.6 67.8686 -3.26859
67 64 68.3312 -4.3312

... (1068 rows omitted)

The average of the residuals is 0. All the negative errors exactly cancel out all the positive errors.

The SD of the residuals is about 2.4 inches, while the overall SD of the sons' heights is about 2.7 inches. As expected, the SD of the residuals is smaller than the overall SD of $y$.

np.std(heights['residual'])
2.4358716091393409
np.std(heights['father'])
2.744553207672785

Smaller by what factor? Another remarkable fact of mathematics is that no matter what the data look like, the SD of the residuals is $\sqrt{1-r^2}$ times the SD of $y$.

np.std(heights['residual'])/np.std(heights['son'])
0.86535308826267487
np.sqrt(1 - r**2)
0.86535308826267476

Average and SD of the Residuals

Regardless of the shape of the scatter plot:

average of the residuals = 0

SD of the residuals $~=~ \sqrt{1 - r^2} \cdot \mbox{SD of}~y$

The residuals are equal to the values of $y$ minus the fitted values. Since the average of the residuals is 0, the average of the fitted values must be equal to the average of $y$.

In the figure below, the fitted values are all on the green line segment. The center of that segment is at the point of averages, consistent with our calculation of the average of the fitted values.

scatter_fit(heights, 'father', 'son')

The SD of the fitted values is visibly smaller than the overall SD of $y$. The fitted values range from about 64 to about 73, whereas the values of $y$ range from about 58 to 77.

So if we take the ratio of the SD of the fitted values to the SD of $y$, we expect to get a number between 0 and 1. And indeed we do: a very special number between 0 and 1.

np.std(heights['fitted value'])/np.std(heights['son'])
0.50116268080759108
r
0.50116268080759108

Here is the final remarkable fact of mathematics in this section:

Average and SD of the Fitted Values

Regardless of the shape of the scatter plot:

average of the fitted values = the average of $y$

SD of the fitted values $~=~ |r| \cdot$ SD of $y$

Notice the absolute value of $r$ in the formula above. For the heights of fathers and sons, the correlation is positive and so there is no difference between using $r$ and using its absolute value. However, the result is true for variables that have negative correlation as well, provided we are careful to use the absolute value of $r$ instead of $r$.

Bootstrap for Regression

Interact

Assumptions of randomness: a "regression model"

In the last section, we developed the concepts of correlation and regression as ways to describe data. We will now see how these concepts can become powerful tools for inference, when used appropriately.

Questions of inference may arise if we believe that a scatter plot reflects the underlying relation between the two variables being plotted but does not specify the relation completely. For example, a scatter plot of the heights of fathers and sons shows us the precise relation between the two variables in one particular sample of men; but we might wonder whether that relation holds true, or almost true, among all fathers and sons in the population from which the sample was drawn, or indeed among fathers and sons in general.

As always, inferential thinking begins with a careful examination of the assumptions about the data. Sets of assumptions are known as models. Sets of assumptions about randomness in roughly linear scatter plots are called regression models.

In brief, such models say that the underlying relation between the two variables is perfectly linear; this straight line is the signal that we would like to identify. However, we are not able to see the line clearly. What we see are points that are scattered around the line. In each of the points, the signal has been contaminated by random noise. Our inferential goal, therefore, is to separate the signal from the noise.

In greater detail, the regression model specifies that the points in the scatter plot are generated at random as follows.

  • The relation between $x$ and $y$ is perfectly linear. We cannot see this "true line" but Tyche can. She is the Goddess of Chance.
  • Tyche creates the scatter plot by taking points on the line and pushing them off the line vertically, either above or below, as follows:
    • For each $x$, Tyche finds the corresponding point on the true line, and then adds an error.
    • The errors are drawn at random with replacement from a population of errors that has a normal distribution with mean 0.
    • Tyche creates a point whose horizontal coordinate is $x$ and whose vertical coordinate is "the height of the true line at $x$, plus the error".
  • Finally, Tyche erases the true line from the scatter, and shows us just the scatter plot of her points.

Based on this scatter plot, how should we estimate the true line? The best line that we can put through a scatter plot is the regression line. So the regression line is a natural estimate of the true line.

The simulation below shows how close the regression line is to the true line. The first panel shows how Tyche generates the scatter plot from the true line; the second show the scatter plot that we see; the third shows the regression line through the plot; and the fourth shows both the regression line and the true line.

Run the simulation a few times, with different values for the slope and intercept of the true line, and varying sample sizes. You will see that the regression line is a good estimate of the true line if the sample size is moderately large.

def draw_and_compare(true_slope, true_int, sample_size):
    x = np.random.normal(50, 5, sample_size)
    xlims = np.array([np.min(x), np.max(x)])
    eps = np.random.normal(0, 6, sample_size)
    y = (true_slope*x + true_int) + eps
    tyche = Table([x,y],['x','y'])

    plots.figure(figsize=(6, 16))
    plots.subplot(4, 1, 1)
    plots.scatter(tyche['x'], tyche['y'], s=10)
    plots.plot(xlims, true_slope*xlims + true_int, lw=1, color='gold')
    plots.title('What Tyche draws')

    plots.subplot(4, 1, 2)
    plots.scatter(tyche['x'],tyche['y'], s=10)
    plots.title('What we get to see')

    plots.subplot(4, 1, 3)
    scatter_fit(tyche, 'x', 'y')
    plots.xlabel("")
    plots.ylabel("")
    plots.title('Regression line: our estimate of true line')

    plots.subplot(4, 1, 4)
    scatter_fit(tyche, 'x', 'y')
    xlims = np.array([np.min(tyche['x']), np.max(tyche['x'])])
    plots.plot(xlims, true_slope*xlims + true_int, lw=1, color='gold')
    plots.title("Regression line and true line")
# Tyche's true line,
# the points she creates,
# and our estimate of the true line.
# Arguments: true slope, true intercept, number of points

draw_and_compare(3, -5, 20)