### Chapter 4

# Making Predictions

# Correlation

## Measuring the relation between two variables¶

In the last chapter, we began developing a method that would allow us to estimate the birth weight of babies based on the number of gestational days. In this chapter, we develop a more general approach to measuring the relation between two variables and estimating the value of one variable given the value of another.

Our proposed estimate in the example of birth weights was simple:

- Find the ratio of the birth weight to gestational days for each baby in the sample.
- Find the median of the ratios.
- For a new baby, multiply the number of gestational days by that median. This is our estimate of the baby's birth weight.

Here are the data; the column `r_bwt_gd`

contains the ratio of birth weight to gestational days.

```
baby = Table.read_table('baby.csv')
baby['r_bwt_gd'] = baby['birthwt']/baby['gest_days']
baby
```

The median of the ratios is just about 0.429 ounces per day:

```
np.median(baby['r_bwt_gd'])
```

So we can construct our proposed estimate by multiplying the number of gestational days of each baby by 0.429. The column `est_wt`

contains these estimates.

```
baby['est_bwt'] = 0.429*baby['gest_days']
baby.drop('r_bwt_gd')
```

Because of the way they are constructed – by multiplying the gestational days by a constant – The estimates all lie on a straight line.

```
bb0 = baby.select(['gest_days', 'est_bwt'])
bb0.scatter('gest_days')
plots.ylim(40,200)
plots.xlabel('gestational days')
plots.ylabel('estimated birth weight')
```

A natural question to ask is, "How good are these estimates?" To get a sense of the answer, let us visualize the data by drawing a *scatter plot* of birth weight versus gestational days. The scatter plot consists of one point for each of the 1,174 babies in the sample. The horizontal axis represents the number of gestational days and the vertical axis represents the birth weight.

The plot is generated by using a `Table`

method called `scatter`

. The argument is the label of the column that contains the values to be plotted on the horizontal axis. For each of the other columns in the table, a scatter diagram is produced with the corresponding variable on the vertical axis. To generate just one scatter plot, therefore, we start by selecting only the variables that we need.

```
gd_bwt = baby.select(['gest_days', 'birthwt'])
gd_bwt.scatter('gest_days')
plots.xlabel('gestational days')
plots.ylabel('birth weight')
```

How good are the estimates that we calculated? To get a sense of this, we will re-draw the scatter plot and overlay the line of estimates. This can be done by using the argument `overlay=True`

when we call `scatter`

.

```
gd_bwt_est = baby.select(['gest_days', 'birthwt', 'est_bwt'])
gd_bwt_est.scatter('gest_days', overlay=True)
plots.xlabel('gestational days')
```

The line appears to be roughly in the center of the scatter diagram; if we use the line for estimation, the amount of over-estimation will be comparable to the amount of under-estimation, roughly speaking. This is a natural criterion for determining a "good" straight line of estimates.

Let us see if we can formalize this idea and create a "best" straight line of estimates.

To see roughly where such a line must lie, we will start by attempting to estimate maternal pregnancy weight based on the mother's height. The heights have been measured to the nearest inch, resulting in the vertical stripes in the scatter plot.

```
ht_pw = baby.select(['mat_ht', 'mat_pw'])
```

```
plots.scatter(ht_pw['mat_ht'], ht_pw['mat_pw'], s=5, color='gold')
plots.xlabel('height (inches)')
plots.ylabel('pregnancy weight (pounds)')
```

Suppose we know that one of the women is 65 inches tall. What would be our estimate for her pregnancy weight?

We know that the point corresponding to this woman must be on the vertical strip at 65 inches. A natural estimate of her pregnancy weight is the average of the weights in that strip; the rough size of the error in this estimate will be the SD of the weights in the strip.

For a woman who is 60 inches tall, our estimate of pregnancy weight would be the average of the weights in the vertical strip at 60 inches. And so on.

Here are the average pregnancy weights for all the values of the heights.

```
v_means = ht_pw.group('mat_ht', np.mean)
v_means
```

In the figure below, these averages are overlaid on the scatter plot of pregnancy weight versus height, and appear as green dots at the averages of the vertical strips. The graph of green dots is called the *graph of averages*. A graph of averages shows the average of the variable on the vertical axis, for each value of the variable on the horizontal axis. It can be used for estimating the variable on the vertical axis, given the variable on the horizontal.

```
plots.scatter(baby['mat_ht'], baby['mat_pw'], s=5, color='gold')
plots.scatter(v_means['mat_ht'], v_means['mat_pw mean'], color='g')
plots.xlabel('height (inches)')
plots.ylabel('pregnancy weight (pounds)')
plots.title('Graph of Averages')
```

For these two variables, the graph of averages looks roughly linear: the green dots are fairly close to a straight line for much of the scatter. That straight line is the "best" straight line for estimating pregnancy weight based on height.

To identify the line more precisely, let us examine the oval or *football shaped* scatter plot.

First, we note that the position of the line is a property of the shape of the scatter, and is not affected by the units in which the heights and weights were measured. Therefore, we will measure both variables in standard units.

## Identifying the line that goes through the graph of averages¶

```
x_demo = np.random.normal(0, 1, 10000)
z_demo = np.random.normal(0, 1, 10000)
y_demo = 0.5*x_demo + np.sqrt(.75)*z_demo
plots.scatter(x_demo, y_demo, s=10)
plots.xlim(-4, 4)
plots.ylim(-4, 4)
plots.axes().set_aspect('equal')
plots.plot([-4, 4], [-4*0.6,4*0.6], color='g', lw=1)
plots.plot([-4,4],[-4,4], lw=1, color='r')
plots.plot([1.5,1.5], [-4,4], lw=1, color='k')
plots.xlabel('x in standard units')
plots.ylabel('y in standard units')
```

Here is a football shaped scatter plot. We will follow the usual convention of calling the variable along the horizontal axis $x$ and the variable on the vertical axis $y$. Both variables are in standard units. This implies that the center of the football, corresponding to the point where both variables are at their average, is the origin (0, 0).

Because of the symmetry of the figure, resulting from both variables being measured in standard units, it is natural to see whether the 45 degree line is the best line for estimation. The 45 degree line has been drawn in red. For points on the red line, the value of $x$ in standard units is equal to the value of $y$ in standard units.

Suppose the given value of $x$ is at the average, in other words at 0 on the standard units scale. The points corresponding to that value of $x$ are on the vertical strip at $x$ equal to 0 standard units. The average of the values of $y$ in that strip can be seen to be at 0, by symmetry. So the straight line of estimates should pass through (0, 0).

A careful look at the vertical strips shows that the red line does not work for estimating $y$ based on other values of $x$. For example, suppose the value of $x$ is 1.5 standard units. The black vertical line corresponds to this value of $x$. The points on the scatter whose value of $x$ is 1.5 standard units are the blue dots on the black line; their values on the vertical scale range from about 2 to about 3. It is clear from the figure that the red line does not pass through the average of these points; the red line is too steep.

To get to the average of the vertical strip at $x$ equal to 1.5 standard units, we have to come down from the red line to the center of the strip. The green line is at that point; it has been drawn by connecting the center of the strip to the point (0, 0) and then extending the line on both sides.

The scatter plot is linear, so the green line picks off the centers of the vertical strips. It is the line that should be used to estimate $y$ based on $x$ when both variables are in standard units.

The slope of the red line is equal to 1. The green line is less steep, and so its slope is less than 1. Because it is sloping upwards in the figure, we have established that its slope is between 0 and 1.

**Summary.** When both variables are measured in standard units, the best line for estimating $y$ based on $x$ is less steep than the 45 degree line and thus has slope less than one. The discussion above was based on a scatter plot sloping upwards, and so the slope of the best line is a number between 0 and 1. For scatter diagrams sloping downwards, the slope would be a number between 0 and -1. For the slope to be close to -1 or 1, the red and green lines must be close, or in other words, the scatter diagram must be tightly clustered around a straight line.

## The Correlation Coefficient¶

This number between -1 and 1 is called the *correlation coefficient* and is said to measure the *correlation* between the two variables.

- The correlation coefficient $r$ is a number between $-1$ and 1.
- When both the variables are measured in standard units, $r$ is the slope of the "best straight line" for estimating $y$ based on $x$.
- $r$ measures the extent to which the scatter plot clusters around a straight line of positive or negative slope.

The function `football`

takes a value of $r$ as its argument and generates a football shaped scatter plot with correlation roughly $r$. The red line is the 45 degree line $y=x$, corresponding to points that are the same number of standard units in both variables. The green line, $y = rx$, is a smoothed version of the graph of averages. Points on the green line correspond to our estimates of the variable on the vertical axis, given values on the horizontal.

Call `football`

a few times, with different values of $r$ as the argument, and see how the football changes. Positive $r$ corresponds to positive association: above-average values of one variable are associated with above-average values of the other, and the scatter plot slopes upwards.

Notice also that the bigger the absolute value of $r$, the more clustered the points are around the green line of averages, and the closer the green line is to the red line of equal standard units.

When $r=1$ the scatter plot is perfectly linear and slopes upward. When $r=-1$, the scatter plot is perfectly linear and slopes downward. When $r=0$, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be *uncorrelated*.

```
# Football shaped scatter, both axes in standard units
# Argument of function: correlation coefficient r
# red line: slope = 1 (or -1)
# green line: smoothed graph of averages, slope = r
# Use the green line for estimating y based on x
football(0.6)
```

```
football(0.2)
```

```
football(0)
```

```
football(-0.7)
```

### Calculating $r$¶

The formula for $r$ is not apparent from our observations so far; it has a mathematical basis that is outside the scope of this class. However, the calculation is straightforward and helps us understand several of the properties of $r$.

**Formula for $r$**:

- $r$ is the average of the products of the two variables, when both variables are measured in standard units.

Here are the steps in the calculation. We will apply the steps to a simple table of values of $x$ and $y$.

```
t = Table([np.arange(1,7,1), [2, 3, 1, 5, 2, 7]], ['x','y'])
t
```

Based on the scatter plot, we expect that $r$ will be positive but not equal to 1.

```
plots.scatter(t['x'], t['y'], s=30, color='r')
plots.xlim(0, 8)
plots.ylim(0, 8)
plots.xlabel('x')
plots.ylabel('y', rotation=0)
```

**Step 1.** Convert each variable to standard units.

```
t['x_su'] = (t['x'] - np.mean(t['x']))/np.std(t['x'])
t['y_su'] = (t['y'] - np.mean(t['y']))/np.std(t['y'])
t
```

**Step 2.** Multiply each pair of standard units.

```
t['su_product'] = t['x_su']*t['y_su']
t
```

**Step 3.** $r$ is the average of the products computed in Step 2.

```
# r is the average of the products of standard units
r = np.mean(t['su_product'])
r
```

As expected, $r$ is positive but not equal to 1.

The calculation shows that:

- $r$ is a pure number; it has no units. This is because $r$ is based on standard units.
- $r$ is unaffected by changing the units on either axis. This too is because $r$ is based on standard units.
- $r$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called $x$ and which $y$. Geometrically, switching axes reflects the scatter plot about the line $y=x$, but does not change the amount of clustering nor the sign of the association.

```
plots.scatter(t['y'], t['x'], s=30, color='r')
plots.xlim(0, 8)
plots.ylim(0, 8)
plots.xlabel('y')
plots.ylabel('x', rotation=0)
```

The NumPy method `corrcoef`

can be used to calculate $r$. The arguments are an array containing the values of $x$ and another containing the corresponding values of $y$. The program evaluates to a *correlation matrix*, which is this case is a 2x2 table indexed by $x$ and $y$. The top left element is the correlation between $x$ and $x$, and hence is 1. The top right element is the correlation between $x$ and $y$, which is equal to the correlation between $y$ and $x$ displayed on the bottom left. The bottom right element is 1, the correlation between $y$ and $y$.

```
np.corrcoef(t['x'], t['y'])
```

For the purposes of this class, correlation matrices are unnecessary. We will define our own function `corr`

to compute $r$, based on the formula that we used above. The arguments are the name of the table and the labels of the columns containing the two variables. The function returns the mean of the products of standard units, which is $r$.

```
def corr(table, column_A, column_B):
x = table[column_A]
y = table[column_B]
return np.mean(((x-np.mean(x))/np.std(x))*((y-np.mean(y))/np.std(y)))
```

Here are examples of calling the function. Notice that it gives the same answer to the correlation between $x$ and $y$ as we got by using `np.corrcoef`

and earlier by direct application of the formula for $r$. Notice also that the correlation between maternal age and birth weight is very low, confirming the lack of any upward or downward trend in the scatter diagram.

```
corr(t, 'x', 'y')
```

```
corr(baby, 'birthwt', 'gest_days')
```

```
corr(baby, 'birthwt', 'mat_age')
```

```
plots.scatter(baby['mat_age'], baby['birthwt'])
plots.xlabel("mother's age")
plots.ylabel('birth weight')
```

### Properties of Correlation¶

Correlation is a simple and powerful concept, but it is sometimes misused. Before using $r$, it is important to be aware of the following points.

Correlation only measures association. Correlation does not imply causation. Though the correlation between the weight and the math ability of children in a school district may be positive, that does not mean that doing math makes children heavier or that putting on weight improves the children's math skills. Age is a confounding variable: older children are both heavier and better at math than younger children, on average.

Correlation measures

**linear**association. Variables that have strong non-linear association might have very low correlation. Here is an example of variables that have a perfect quadratic relation $y = x^2$ but have correlation equal to 0.

```
tsq = Table([np.arange(-4, 4.1, 0.5), np.arange(-4, 4.1, 0.5)**2],
['x','y'])
plots.scatter(tsq['x'], tsq['y'])
plots.xlabel('x')
plots.ylabel('y', rotation=0)
```

```
corr(tsq, 'x', 'y')
```

- Outliers can have a big effect on correlation. Here is an example where a scatter plot for which $r$ is equal to 1 is turned into a plot for which $r$ is equal to 0, by the addition of just one outlying point.

```
t = Table([[1,2,3,4],[1,2,3,4]], ['x','y'])
plots.scatter(t['x'], t['y'], s=30, color='r')
plots.xlim(0, 6)
plots.ylim(-0.5,6)
```

```
corr(t, 'x', 'y')
```

```
t_outlier = Table([[1,2,3,4,5],[1,2,3,4,0]], ['x','y'])
plots.scatter(t_outlier['x'], t_outlier['y'], s=30, color='r')
plots.xlim(0, 6)
plots.ylim(-0.5,6)
```

```
corr(t_outlier, 'x', 'y')
```

- Correlations based on aggregated data can be misleading. As an example, here are data on the Critical Reading and Math SAT scores in 2014. There is one point for each of the 50 states and one for Washington, D.C. The column
`Participation Rate`

contains the percent of high school seniors who took the test. The next three columns show the average score in the state on each portion of the test, and the final column is the average of the total scores on the test.

```
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014
```

The scatter diagram of Math scores versus Critical Reading scores is very tightly clustered around a straight line; the correlation is close to 0.985.

```
plots.scatter(sat2014['Critical Reading'], sat2014['Math'])
plots.xlabel('Critical Reading')
plots.ylabel('Math')
```

```
corr(sat2014, 'Critical Reading', 'Math')
```

It is important to note that this does not reflect the strength of the relation between the Math and Critical Reading scores of students. States don't take tests – students do. The data in the table have been created by lumping all the students in each state into a single point at the average values of the two variables in that state. But not all students in the state will be at that point, as students vary in their performance. If you plot a point for each student instead of just one for each state, there will be a cloud of points around each point in the figure above. The overall picture will be more fuzzy. The correlation between the Math and Critical Reading scores of the students will be lower than the value calculated based on state averages.

Correlations based on aggregates and averages are called *ecological correlations* and are frequently reported. As we have just seen, they must be interpreted with care.

### Serious or tongue-in-cheek?¶

In 2012, a paper in the respected New England Journal of Medicine examined the relation between chocolate consumption and Nobel Prizes in a group of countries. The Scientific American responded seriously; others were more relaxed. The paper included the following graph:

See what you think about the analysis and also about the Reuters report that said, "The p-value Messerli calculated was 0.0001. 'This means that the odds of this being due to chance are less than 1 in 10,000," he said."

# Regression

## The regression line¶

The concepts of correlation and the "best" straight line through a scatter plot were developed in the late 1800's and early 1900's. The pioneers in the field were Sir Francis Galton, who was a cousin of Charles Darwin, and Galton's protégé Karl Pearson. Galton was interested in eugenics, and was a meticulous observer of the physical traits of parents and their offspring. Pearson, who had greater expertise than Galton in mathematics, helped turn those observations into the foundations of mathematical statistics.

The scatter plot below is of a famous dataset collected by Pearson and his colleagues in the early 1900's. It consists of the heights, in inches, of 1,078 pairs of fathers and sons.

The data are contained in the table `heights`

, in the columns `father`

and `son`

respectively. In the previous section, we saw how to use the table method `scatter`

to draw scatter plots. Here, the method `scatter`

in the `pyplot`

module of `matplotlib`

(imported as `plots`

) is used for the same purpose. The first argument of `scatter`

is a array containing the variable on the horizontal axis. The second argument contains the variable on the vertical axis. The optional argument `s=10`

sets the size of the points; the default value is `s=20`

.

```
# Scatter plot using matplotlib method
heights = Table.read_table('heights.csv')
plots.scatter(heights['father'], heights['son'], s=10)
plots.xlabel("father's height")
plots.ylabel("son's height")
```

Notice the familiar football shape. This is characteristic of variable pairs that follow a *bivariate normal* distribution: the scatter plot is oval, the distribution of each variable is roughly normal, and the distribution of the variable in each vertical and horizontal strip is roughly normal as well.

The correlation between the heights of the fathers and sons is about 0.5.

```
r = corr(heights, 'father', 'son')
r
```

### The regression effect¶

The figure below shows the scatter plot of the data when both variables are measured in standard units. As we saw earlier, the red line of equal standard units is too steep to serve well as the line of estimates of $y$ based on $x$. Rather, the estimates are on the green line, which is flatter and picks off the centers of the vertical strips.

This flattening was noticed by Galton, who had been hoping that exceptionally tall fathers would have sons who were just as exceptionally tall. However, the data were clear, and Galton realized that the tall fathers have sons who are not quite as exceptionally tall, on average. Frustrated, Galton called this phenomenon "regression to mediocrity." Because of this, the line of best fit through a scatter plot is called the *regression line*.

Galton also noticed that exceptionally short fathers had sons who were somewhat taller relative to their generation, on average. In general, individuals who are away from average on one variable are expected to be not quite as far away from average on the other. This is called the *regression effect*.

```
# The regression effect
f_su = (heights['father']-np.mean(heights['father']))/np.std(heights['father'])
s_su = (heights['son']-np.mean(heights['son']))/np.std(heights['son'])
plots.scatter(f_su, s_su, s=10)
plots.plot([-4, 4], [-4, 4], color='r', lw=1)
plots.plot([-4, 4], [-4*r, 4*r], color='g', lw=1)
plots.axes().set_aspect('equal')
```

The `Table`

method `scatter`

can be used with the option `fit_line=True`

to draw the regression line through a scatter plot.

```
# Plotting the regression line, using Table
heights.scatter('father', fit_line=True)
plots.xlabel("father's height")
plots.ylabel("son's height")
```

Karl Pearson used the observation of the regression effect in the data above, as well as in other data provided by Galton, to develop the formal calculation of the correlation coefficient $r$. That is why $r$ is sometimes called *Pearson's correlation*.

### The equation of the regression line¶

As we saw in the last section for football shaped scatter plots, when the variables $x$ and $y$ are measured in standard units, the best straight line for estimating $y$ based on $x$ has slope $r$ and passes through the origin. Thus the equation of the regression line can be written as:

$~~~~~~~~~~$ estimate of $y$, in $y$-standard units $~=~$ $r ~ \times$ (the given $x$, in $x$-standard units)

That is, $$ \frac{\mbox{estimate of}~y ~-~\mbox{average of}~y}{\mbox{SD of}~y} ~=~ r \times \frac{\mbox{the given}~x ~-~\mbox{average of}~x}{\mbox{SD of}~x} $$

The equation can be converted into the original units of the data, either by rearranging this equation algebraically, or by labeling some important features of the line both in standard units and in the original units.

It is a remarkable fact of mathematics that what we have observed to be true for football shaped scatter plots turns out to be true for all scatter plots, no matter what they look like.

**Regardless of the shape of the scatter plot**:

### Calculation of the slope and intercept¶

The NumPy method `np.polyfit`

takes as its first argument an array consisiting of the values of the given variable; its second argument an array consisting of the variable to be estimated; its third argument `deg=1`

specifies that we are fitting a straight line, that is, a polynomial of degree 1. It evaluates to an array consisting of the slope and the intercept of the regression line.

```
# Slope and intercept by NumPy method
np.polyfit(heights['father'], heights['son'], deg=1)
```

It is worth noting that the intercept of approximately 33.89 inches is *not* intended as an estimate of the height of a son whose father is 0 inches tall. There is no such son and no such father. The intercept is merely a geometric or algebraic quantity that helps define the line. In general, it is not a good idea to *extrapolate*, that is, to make estimates outside the range of the available data. It is certainly not a good idea to extrapolate as far away as 0 is from the heights of the fathers in the study.

The calculations of the slope and intercept of the regression line are straightforward, so we will set `np.polyfit`

aside and write our own function to compute the two quantities. The function `regress`

takes as its arguments the name of the table, the column label of the given variable, and the column label of the variable to be estimated; it evaluates to an array containing the slope and the intercept of the regression line.

```
# Slope and intercept of regression line
def regress(table, column_x, column_y):
r = corr(table, column_x, column_y)
reg_slope = r*np.std(table[column_y])/np.std(table[column_x])
reg_int = np.mean(table[column_y]) - reg_slope*np.mean(table[column_x])
return np.array([reg_slope, reg_int])
```

A call to `regress`

yields the same results as the call to `np.polyfit`

made above.

```
slope_int_h = regress(heights, 'father', 'son')
slope_int_h
```

## Fitted values¶

We can use the regression line to get an estimate of the height of every son in the data. The estimated values of $y$ are called the *fitted values*. They all lie on a straight line. To calculate them, take a son's height, multiply it by the slope of the regression line, and add the intercept. In other words, calculate the height of the regression line at the given value of $x$.

```
# Estimates of sons' heights are on the regression line
heights['fitted value'] = slope_int_h[0]*heights['father'] + slope_int_h[1]
heights
```

### Residuals¶

The amount of error in each of these regression estimates is the difference between the son's height and its estimate. These errors are called *residuals*. Some residuals are positive. These correspond to points that are above the regression line – points for which the regression line under-estimates $y$. Negative residuals correspond to the line over-estimating values of $y$.

```
# Error in the regression estimate: Distance between observed value and fitted value
# "Residual"
heights['residual'] = heights['son'] - heights['fitted value']
heights
```

As with deviations from average, the positive and negative residuals exactly cancel each other out. So the average (and sum) of the residuals is 0.

### Error in the regression estimate¶

Though the average residual is 0, each individual residual is not. Some residuals might be quite far from 0. To get a sense of the amount of error in the regression estimate, we will start with a graphical description of the sense in which the regression line is the "best".

Our example is a dataset that has one point for every chapter of the novel "Little Women." The goal is to estimate the number of characters (that is, letters, punctuation marks, and so on) based on the number of periods. Recall that we attempted to do this in the very first lecture of this course.

```
lw = Table.read_table('little_women.csv')
```

```
# One point for each chapter
# Horizontal axis: number of periods
# Vertical axis: number of characters (as in a, b, ", ?, etc; not people in the book)
plots.scatter(lw['Periods'], lw['Characters'])
plots.xlabel('Periods')
plots.ylabel('Characters')
```

```
corr(lw, 'Periods', 'Characters')
```

The scatter plot is remarkably close to linear, and the correlation is more than 0.92.

```
a = [131, 14431]
b = [231, 20558]
c = [392, 40935]
d = [157, 23524]
def lw_errors(slope, intercept):
xlims = np.array([50, 450])
plots.scatter(lw['Periods'], lw['Characters'])
plots.plot(xlims, slope*xlims + intercept, lw=2)
plots.plot([a[0],a[0]], [a[1], slope*a[0] + intercept], color='r', lw=2)
plots.plot([b[0],b[0]], [b[1], slope*b[0] + intercept], color='r', lw=2)
plots.plot([c[0],c[0]], [c[1], slope*c[0] + intercept], color='r', lw=2)
plots.plot([d[0],d[0]], [d[1], slope*d[0] + intercept], color='r', lw=2)
plots.xlabel('Periods')
plots.ylabel('Characters')
```

The figure below shows the scatter plot and regression line, with four of the errors marked in red.

```
# Residuals: Deviations from the regression line
slope_int_lw = regress(lw, 'Periods', 'Characters')
lw_errors(slope_int_lw[0], slope_int_lw[1])
```

Had we used a different line to create our estimates, the errors would have been different. The picture below shows how big the errors would be if we were to use a particularly silly line for estimation.

```
# Errors: Deviations from a different line
lw_errors(-100, 50000)
```

Below is a line that we have used before without saying that we were using a line to create estimates. It is the horizontal line at the value "average of $y$." Suppose you were asked to estimate $y$ and *were not told the value of $x$*; then you would use the average of $y$ as your estimate, regardless of the chapter. In other words, you would use the flat line below.

Each error that you would make would then be a deviation from average. The rough size of these deviations is the SD of $y$.

In summary, if we use the flat line at the average of $y$ to make our estimates, the estimates will be off by the SD of $y$.

```
# Errors: Deviations from the flat line at the average of y
lw_errors(0, np.mean(lw['Characters']))
```

### The Method of Least Squares¶

If you use any arbitrary line as your line of estimates, then some of your errors are likely to be positive and others negative. To avoid cancellation when measuring the rough size of the errors, we take the mean of the sqaured errors rather than the mean of the errors themselves. This is exactly analogous to our reason for looking at squared deviations from average, when we were learning how to calculate the SD.

The mean squared error of estimation using a straight line is a measure of roughly how big the squared errors are; taking the square root yields the root mean square error, which is in the same units as $y$.

Here is the second remarkable fact of mathematics in this section: the regression line minimizes the mean squared error of estimation (and hence also the root mean squared error) among all straight lines. That is why the regression line is sometimes called the "least squares line."

**Computing the "best" line.**

- To get estimates of $y$ based on $x$, you can use any line you want.
- Every line has a mean squared error of estimation.
- "Better" lines have smaller errors.
**The regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines.**

### Regression Functions¶

Regression is one of the most commonly used methods in statistics, and we will be using it frequently in the next few sections. It will be helpful to be able to call functions to compute the various quantities connected with regression. The first two of the functions below have already been defined; the rest are defined below.

`corr`

: the correlation coefficient`regress`

: the slope and intercept of the regression line`fit`

: the fitted value at one given value of $x$`fitted_values`

: the fitted values at all the values of $x$ in the data`residuals`

: the residuals`scatter_fit`

: scatter plot and regression line`residual_plot`

: plot of residuals versus $x$

```
# Correlation coefficient
def corr(table, column_A, column_B):
x = table[column_A]
y = table[column_B]
x_su = (x-np.mean(x))/np.std(x)
y_su = (y-np.mean(y))/np.std(y)
return np.mean(x_su*y_su)
```

```
# Slope and intercept of regression line
def regress(table, column_x, column_y):
r = corr(table, column_x, column_y)
reg_slope = r*np.std(table[column_y])/np.std(table[column_x])
reg_int = np.mean(table[column_y]) - reg_slope*np.mean(table[column_x])
return np.array([reg_slope, reg_int])
```

```
# Fitted value; the regression estimate at x=new_x
def fit(table, column_x, column_y, new_x):
slope_int = regress(table, column_x, column_y)
return slope_int[0]*new_x + slope_int[1]
```

```
# Fitted values; the regression estimates lie on a straight line
def fitted_values(table, column_x, column_y):
slope_int = regress(table, column_x, column_y)
return slope_int[0]*table[column_x] + slope_int[1]
```

```
# Residuals: Deviations from the regression line
def residuals(table, column_x, column_y):
fitted = fitted_values(table, column_x, column_y)
return table[column_y] - fitted
```

```
# Scatter plot with fitted (regression) line
def scatter_fit(table, column_x, column_y):
plots.scatter(table[column_x], table[column_y], s=10)
plots.plot(table[column_x], fitted_values(table, column_x, column_y), lw=1, color='green')
plots.xlabel(column_x)
plots.ylabel(column_y)
```

```
# A residual plot
def residual_plot(table, column_x, column_y):
plots.scatter(table[column_x], residuals(table, column_x, column_y), s=10)
xm = np.min(table[column_x])
xM = np.max(table[column_x])
plots.plot([xm, xM], [0, 0], color='k', lw=1)
plots.xlabel(column_x)
plots.ylabel('residual')
```

## Residual plots¶

Suppose you have carried out the regression of sons' heights on fathers' heights.

```
scatter_fit(heights, 'father', 'son')
```

It is a good idea to then draw a residual plot. This is a scatter plot of the residuals versus the values of $x$. The residual plot of a good regression looks like the one below: a formless cloud with no pattern, centered around the horizontal axis. It shows that there is no discernible non-linear pattern in the original scatter plot.

```
residual_plot(heights, 'father','son')
```

Residual plots can be useful for spotting non-linearity in the data, or other features that weaken the regression analysis. For example, consider the SAT data of the previous section, and suppose you try to estimate the `Combined`

score based on `Participation Rate`

.

```
sat2014 = Table.read_table('sat2014.csv')
sat2014
```

```
plots.scatter(sat2014['Participation Rate'], sat2014['Combined'], s=10)
plots.xlabel('Participation Rate')
plots.ylabel('Combined')
```

The relation between the variables is clearly non-linear, but you might be tempted to fit a straight line anyway, especially if you did not first draw the scatter diagram of the data.

```
scatter_fit(sat2014, 'Participation Rate', 'Combined')
plots.title("A bad idea")
```

The points in the scatter plot start out above the regression line, then are consistently below the line, then above, then below. This pattern of non-linearity is more clearly visible in the residual plot.

```
residual_plot(sat2014, 'Participation Rate', 'Combined')
plots.title('Residual plot of the bad regression')
```

This residual plot shows a non-linear pattern, and is a signal that linear regression should not have been used for these data.

### The rough size of the residuals¶

Let us return to the heights of the fathers and sons, and compare the estimates based on using the regression line and the flat line (in yellow) at the average height of the sons. As noted above, the rough size of the errors made using the flat line is the SD of $y$. Clearly, the regression line does a better job of estimating sons' heights than the flat line does; indeed, it minimizes the mean squared error among all lines. Thus, the rough size of the errors made using the regression line must be smaller that that using the flat line. In other words, the SD of the residuals must be smaller than the overall SD of $y$.

```
ave_y = np.mean(heights['son'])
scatter_fit(heights, 'father', 'son')
plots.plot([np.min(heights['father']), np.max(heights['father'])], [ave_y, ave_y], lw=2, color='gold')
```

Here, once again, are the residuals in the estimation of sons' heights based on fathers' heights. Each residual is the difference between the height of a son and his estimated (or "fitted") height.

```
heights
```

The average of the residuals is 0. All the negative errors exactly cancel out all the positive errors.

The SD of the residuals is about 2.4 inches, while the overall SD of the sons' heights is about 2.7 inches. As expected, the SD of the residuals is smaller than the overall SD of $y$.

```
np.std(heights['residual'])
```

```
np.std(heights['father'])
```

Smaller by what factor? Another remarkable fact of mathematics is that no matter what the data look like, the SD of the residuals is $\sqrt{1-r^2}$ times the SD of $y$.

```
np.std(heights['residual'])/np.std(heights['son'])
```

```
np.sqrt(1 - r**2)
```

### Average and SD of the Residuals¶

**Regardless of the shape of the scatter plot:**

average of the residuals = 0

SD of the residuals $~=~ \sqrt{1 - r^2} \cdot \mbox{SD of}~y$

The residuals are equal to the values of $y$ minus the fitted values. Since the average of the residuals is 0, the average of the fitted values must be equal to the average of $y$.

In the figure below, the fitted values are all on the green line segment. The center of that segment is at the point of averages, consistent with our calculation of the average of the fitted values.

```
scatter_fit(heights, 'father', 'son')
```

The SD of the fitted values is visibly smaller than the overall SD of $y$. The fitted values range from about 64 to about 73, whereas the values of $y$ range from about 58 to 77.

So if we take the ratio of the SD of the fitted values to the SD of $y$, we expect to get a number between 0 and 1. And indeed we do: a very special number between 0 and 1.

```
np.std(heights['fitted value'])/np.std(heights['son'])
```

```
r
```

Here is the final remarkable fact of mathematics in this section:

### Average and SD of the Fitted Values¶

**Regardless of the shape of the scatter plot:**

average of the fitted values = the average of $y$

SD of the fitted values $~=~ |r| \cdot$ SD of $y$

Notice the absolute value of $r$ in the formula above. For the heights of fathers and sons, the correlation is positive and so there is no difference between using $r$ and using its absolute value. However, the result is true for variables that have negative correlation as well, provided we are careful to use the absolute value of $r$ instead of $r$.

# Bootstrap for Regression

## Assumptions of randomness: a "regression model"¶

In the last section, we developed the concepts of correlation and regression as ways to describe data. We will now see how these concepts can become powerful tools for inference, when used appropriately.

Questions of inference may arise if we believe that a scatter plot reflects the underlying relation between the two variables being plotted but does not specify the relation completely. For example, a scatter plot of the heights of fathers and sons shows us the precise relation between the two variables in one particular sample of men; but we might wonder whether that relation holds true, or almost true, among all fathers and sons in the population from which the sample was drawn, or indeed among fathers and sons in general.

As always, inferential thinking begins with a careful examination of the assumptions about the data. Sets of assumptions are known as *models*. Sets of assumptions about randomness in roughly linear scatter plots are called *regression models*.

In brief, such models say that the underlying relation between the two variables is perfectly linear; this straight line is the *signal* that we would like to identify. However, we are not able to see the line clearly. What we see are points that are scattered around the line. In each of the points, the signal has been contaminated by *random noise*. Our inferential goal, therefore, is to separate the signal from the noise.

In greater detail, the regression model specifies that the points in the scatter plot are generated at random as follows.

- The relation between $x$ and $y$ is perfectly linear. We cannot see this "true line" but Tyche can. She is the Goddess of Chance.
- Tyche creates the scatter plot by taking points on the line and pushing them off the line vertically, either above or below, as follows:
- For each $x$, Tyche finds the corresponding point on the true line, and then adds an error.
- The errors are drawn at random with replacement from a population of errors that has a normal distribution with mean 0.
- Tyche creates a point whose horizontal coordinate is $x$ and whose vertical coordinate is "the height of the true line at $x$, plus the error".

- Finally, Tyche erases the true line from the scatter, and shows us just the scatter plot of her points.

Based on this scatter plot, how should we estimate the true line? The best line that we can put through a scatter plot is the regression line. So the regression line is a natural estimate of the true line.

The simulation below shows how close the regression line is to the true line. The first panel shows how Tyche generates the scatter plot from the true line; the second show the scatter plot that we see; the third shows the regression line through the plot; and the fourth shows both the regression line and the true line.

Run the simulation a few times, with different values for the slope and intercept of the true line, and varying sample sizes. You will see that the regression line is a good estimate of the true line if the sample size is moderately large.

```
def draw_and_compare(true_slope, true_int, sample_size):
x = np.random.normal(50, 5, sample_size)
xlims = np.array([np.min(x), np.max(x)])
eps = np.random.normal(0, 6, sample_size)
y = (true_slope*x + true_int) + eps
tyche = Table([x,y],['x','y'])
plots.figure(figsize=(6, 16))
plots.subplot(4, 1, 1)
plots.scatter(tyche['x'], tyche['y'], s=10)
plots.plot(xlims, true_slope*xlims + true_int, lw=1, color='gold')
plots.title('What Tyche draws')
plots.subplot(4, 1, 2)
plots.scatter(tyche['x'],tyche['y'], s=10)
plots.title('What we get to see')
plots.subplot(4, 1, 3)
scatter_fit(tyche, 'x', 'y')
plots.xlabel("")
plots.ylabel("")
plots.title('Regression line: our estimate of true line')
plots.subplot(4, 1, 4)
scatter_fit(tyche, 'x', 'y')
xlims = np.array([np.min(tyche['x']), np.max(tyche['x'])])
plots.plot(xlims, true_slope*xlims + true_int, lw=1, color='gold')
plots.title("Regression line and true line")
```

```
# Tyche's true line,
# the points she creates,
# and our estimate of the true line.
# Arguments: true slope, true intercept, number of points
draw_and_compare(3, -5, 20)
```