Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Single Variable Regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings
warnings.simplefilter("ignore")
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import patches
from datascience import *
%matplotlib inline
from sklearn.metrics import mean_squared_error
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

Suppose we have some data that looks like this on a scatter plot:

title

In Data 8 we learned that xx and yy above have some correlation coefficient rr, which is a measure of the strength of the linear relationship between the two variables.

It looks like there is some positive linear association between xx and yy such that larger values of xx correspond to larger values of yy. We therefore expect rr to be some positive number between 0 and 1, but not exactly 0 or exactly 1.

First, let’s convert the data (stored in the arrays x and y) to standard units. To convert a set of data points to standard units, we subtract out the mean of the data and scale by the standard deviation. This has the effect of changing the data so that the data in standard units have mean 0 and standard deviation 1. Below we construct a function that does this.

xsu=xμxσxx_{su} = \dfrac{x - \mu_x}{\sigma_x}
def standard_units(array):
    return (array - np.mean(array)) / np.std(array)
# Hide this.
np.random.seed(42)
x = np.random.uniform(0, 10, 100)
noise = np.random.randn(100) * 4
y = 1.5 * x + noise
x_standard = standard_units(x)
y_standard = standard_units(y)
plt.figure(figsize=(8,6))
plt.scatter(x_standard, y_standard)
plt.xlabel('$x$_standard', fontsize = 14)
plt.ylabel('$y$_standard', fontsize = 14);
<Figure size 800x600 with 1 Axes>

The scatter plot looks the same as before, except now the axes are scaled such that we measure xx and yy in standard units. Now recall that rr is calculated as the average of the product of two variables, when the variables are measured in standard units. Below we define a function that calculates rr, assuming that the inputs have already been converted to standard units.

def correlation(array1, array2):
    return np.mean(array1 * array2)

What is the correlation between these two variables?

correlation(x_standard, y_standard)
0.73513640760529786

Recall from Data 8 that we use rr to form a line called the regression line, which makes predictions for yy given some xx. Our prediction for yy in standard units is rxr \cdot x. If we want to fit this regression line in the original units, recall that the slope of this line is given by

slope=rσ^yσ^x\text{slope} = r \cdot \dfrac{\hat{\sigma}_y}{\hat{\sigma}_x}

and the intercept is given by

intercept=μ^yslopeμ^x\text{intercept} = \hat{\mu}_y - \text{slope} \cdot \hat{\mu}_x

where σ^x\hat{\sigma}_x is the observed standard deviation of a variable xx and μ^x\hat{\mu}_x is the observed mean. Our regression line will have the form

y=α^+β^xy = \hat{\alpha} + \hat{\beta} x

where α^\hat{\alpha} is the intercept from above and β^\hat{\beta} the slope.

Below we plot this line on the same scatter plot from before.

r = correlation(x_standard, y_standard)
slope = r * np.std(y) / np.std(x)
intercept = np.mean(y) - slope * np.mean(x)
plt.figure(figsize=(8,6))
plt.scatter(x, y)
plt.plot(np.linspace(0, 10), slope * np.linspace(0, 10) + intercept, color='tab:orange')
plt.xlabel('$x$', fontsize = 14)
plt.ylabel('$y$', fontsize = 14);
<Figure size 800x600 with 1 Axes>

Let’s take a closer look at the slope we found.

print('Slope: ', slope)
Slope:  1.31609070915

To generate the data above, we started with some range of xx values, and generated yy as a linear function of xx with some random noise added in. Take a look:

np.random.seed(42)
x = np.random.uniform(0, 10, 100)
noise = np.random.randn(100) * 4
y = 1.5 * x + noise

Notice how I defined yy:

y=1.5x+uy = 1.5 \cdot x + u

where uu is some normally-distributed random noise whose average is 0. So, while there is some randomness to the data, on average the “true” slope of the relationship is 1.5. Yet we predicted it to be roughly 1.3!

This highlights the following fact: Suppose we have some random data that we believe has a linear relationship. The least-squares slope we generate from the data is an estimate of the “true” slope of that data. Because of this, the estimated slope is a random variable that depends on the data we happen to have.

To highlight this fact, let’s repeat the procedure above but with a different random seed, in order to get data with the same underlying relationship but different values.

np.random.seed(189)
x = np.random.uniform(0, 10, 100)
noise = np.random.randn(100) * 4
y = 1.5 * x + noise

r = correlation(x_standard, y_standard)
slope = r * np.std(y) / np.std(x)
intercept = np.mean(y) - slope * np.mean(x)
plt.figure(figsize=(8,6))
plt.scatter(x, y)
plt.plot(np.linspace(0, 10), slope * np.linspace(0, 10) + intercept, color='tab:orange')
plt.xlabel('$x$', fontsize = 14)
plt.ylabel('$y$', fontsize = 14);
print('Slope: ', slope)
Slope:  1.55901512653
<Figure size 800x600 with 1 Axes>

Now the estimated slope is roughly 1.6, even though the underlying data was still generated using a slope of 1.5. This is a very important concept that we will revisit soon.

Keep in mind, however, that correlation in data does not imply causation. In this example we know the true causal relationship between xx and yy because we defined it ourselves. However, when using real data you do not see the “true” relation and thus cannot conclude causality from correlation. It could simply be that both your variables depend on an unseen third variable and have no causal effect on one another. Or even worse, while unlikely it could be the case that slight linear trends in two variables is a complete coincidence.

Root-Mean-Squared Error

While we can arbitrarily pick α^\hat{\alpha} and β^\hat{\beta} values, we do want to pick the values that help predict y^\hat{y} that are closest to actual yy values. To achieve this, we want to minimize a loss function that quantifies how far off our prediction y^\hat{y} is from yy for some known data points. One of the most common loss functions is called the root-mean-squared error, and is defined as

RMSE=1ni=1n(yiy^i)2\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - \hat{y}_i \right ) ^2 }

where nn is the number of observations. The effect of this is to take the mean of the distance of each value of y^\hat{y} from its corresponding value in yy; squaring these values keeps them positive, and then we take the square root to correct the units of the error.

Plugging in the formula y^\hat{y} in RMSE formula, we get,

RMSE=1ni=1n(yi(α^+β^xi))2\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\hat{\alpha} + \hat{\beta}x_i) \right ) ^2 }

By doing a bit of calculus, we get the following formulas for α^\hat{\alpha} and β^\hat{\beta}

β^=rσ^yσ^xα^=μ^yβ^μ^x\Large \hat{\beta} = r\frac {\hat{\sigma}_y} {\hat{\sigma}_x} \qquad \qquad \hat{\alpha} = \hat{\mu}_y - \hat{\beta}\hat{\mu}_x

where rr is the correlation between xx and yy, σ^y\hat{\sigma}_y is the standard deviation of yy, σ^x\hat{\sigma}_x is the standard deviation of xx, μ^y\hat{\mu}_y is the average of all our yy values, and μ^x\hat{\mu}_x is the average of all our xx values. (As an aside, note the hats on our σ\sigma’s and μ\mu’s; this is because these are empirical estimates of the parameters of these distributions, rather than the true values.) These are the same values we had above!

Note that our formula for β^\hat{\beta} involves the correlation coefficient rr of xx and yy. The correlation coefficient of two variables is a measure of the strength of a linear relationship between them. rr goes from -1 to 1, where r=1|r|=1 is a perfect linear relationship and r=0r=0 is no linear relationship. The formula for rr is

r=1ni=1n(xiμ^xσ^x)(yiμ^yσ^y)r = \frac{1}{n}\sum^n_{i=1} \left ( \frac{x_i - \hat{\mu}_x}{\hat{\sigma}_x} \right ) \left ( \frac{y_i - \hat{\mu}_y}{\hat{\sigma}_y} \right )

(Note: the form xiμ^xσ^x\frac{x_i - \hat{\mu}_x}{\hat{\sigma}_x} of a variable xx is it’s representation in standard units, as mentioned above.)

To calculate the RMSE, we will write an rmse function that makes use of sklearn’s mean_squared_error function.

def rmse(target, pred):
    return np.sqrt(mean_squared_error(target, pred))

To get a better idea of what the RMSE represents, the figures below show a small dataset, a proposed regression line, and the squared error that we are summing in the RMSE. The data points are

xxyy
01
1.5
2-1
32
4-3

Here are the proposed regression lines and their errors:

se_slope_1.pngse_slope_-1.png
d = Table().with_columns(
    'x', make_array(0,  1,  2,  3,  4),
    'y', make_array(1, .5, -1,  2, -3))

def plot_line_and_errors(slope, intercept):
    print("RMSE:", rmse(slope * d.column('x') + intercept, d.column('y')))
    plt.figure(figsize=(5,5))
    points = make_array(-2, 7)
    p = plt.plot(points, slope*points + intercept, color='orange', label='Proposed line')
    ax = p[0].axes
    
    predicted_ys = slope*d.column('x') + intercept
    diffs = predicted_ys - d.column('y')
    for i in np.arange(d.num_rows):
        x = d.column('x').item(i)
        y = d.column('y').item(i)
        diff = diffs.item(i)
        
        if diff > 0:
            bottom_left_x = x
            bottom_left_y = y
        else:
            bottom_left_x = x + diff
            bottom_left_y = y + diff
        
        ax.add_patch(patches.Rectangle(make_array(bottom_left_x, bottom_left_y), abs(diff), abs(diff), color='red', alpha=.3, label=('Squared error' if i == 0 else None)))
        plt.plot(make_array(x, x), make_array(y, y + diff), color='red', alpha=.6, label=('Error' if i == 0 else None))
    
    plt.scatter(d.column('x'), d.column('y'), color='blue', label='Points')
    
    plt.xlim(-4, 8)
    plt.ylim(-6, 6)
    plt.gca().set_aspect('equal', adjustable='box')
    
    plt.legend(bbox_to_anchor=(1.8, .8))
    plt.show()

interact(plot_line_and_errors, slope=widgets.FloatSlider(min=-4, max=4, step=.1), intercept=widgets.FloatSlider(min=-4, max=4, step=.1));
Loading...

Econometric Single Variable Regression

The regression line can have two purposes:

  • Of particular interest to data scientists is the line’s ability to predict values of yy for new values of xx that we didn’t see before.

  • Of particular interest to economists is the line’s ability to estimate the “true” underlying slope of the data via its slope.

This is why regression is such a powerful tool and forms the backbone of econometrics. If we believe that our data satisfy certain assumptions (which we won’t explore too much this lecture), then we can use the slope of the regression line to estimate the “true” relation between the variables in question and learn more about the world we live in.

In econometrics, we usually write the “true” underlying linear relationship as follows:

y=α+βx+εy = \alpha + \beta \cdot x + \varepsilon

where yy and xx are values for any arbitrary point, α\alpha is the intercept, β\beta is the slope, and ε\varepsilon is some noise. This is entirely analogous to the code from earlier that determined the true linear relationship between the data:

y = 1.5 * x + noise

Here, β=1.5\beta = 1.5, α=0\alpha = 0, and ε=noise\varepsilon = \text{noise}.

When we fit a regression line onto the data, we express the line as:

y^=α^+β^x\hat{y} = \hat{\alpha} + \hat{\beta} \cdot x

Here, we put hats over the slope and intercept terms because they are estimates of the true slope and intercept terms. Similarly, we put a hat over yy because this is the yy value that the regression line predicts.

Notice how the noise term ε\varepsilon does not appear in the expression for the regression line. This is because the noise term is a random variable that has no relation with xx, and is thus impossible to predict from the data. Furthermore, the noise term has a mean value of 0, so on average we actually don’t expect the noise term to have any impact on the underlying trends of the data.

For the Data 8 demonstration above, we forced these conditions to be true. However, with real data these are assumptions that we have to make, and is something that econometricians spend a lot of time thinking about.

Years of Schooling and Earnings

Consider a case where we want to study how years of schooling relate to a person’s earnings. This should be of particular interest to college students. Below we import a dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey.

cps = Table.read_table('cps.csv')
cps
Loading...

We want to consider a person’s wage and years of schooling. But first, we will convert wage to log-wage. Wage is a variable that we would expect to increase proportionally (or, exponentially) with changes in years of schooling. And as such, we usually take the natural log of wage instead. Below we plot log wage and years of schooling for the CPS data.

educ = cps.column('educ')
logwage = cps.column('logwage')

plt.figure(figsize=(8,6))
plt.scatter(educ, logwage)
plt.xlabel('Years of Education', fontsize = 14)
plt.ylabel('Log Wage', fontsize = 14)
plt.title('Log Wage vs. Years of Education', size = 20);
<Figure size 800x600 with 1 Axes>

Now let’s fit a least-squares regression line onto this data. First, we’ll do it manually in the Data 8 style above.

educ_standard = standard_units(educ)
logwage_standard = standard_units(logwage)

r = correlation(logwage_standard, educ_standard)
slope = r * np.std(logwage) / np.std(educ)
intercept = np.mean(logwage) - slope * np.mean(educ)
plt.figure(figsize=(8,6))
plt.scatter(educ, logwage)
plt.plot(np.linspace(0, 20), slope * np.linspace(0, 20) + intercept, color='tab:orange')
plt.xlabel('Years of Education', fontsize = 14)
plt.ylabel('Log Wage', fontsize = 14)
plt.title('Log Wage vs. Years of Education', size = 20);
print('Slope: ', slope)
print('Intercept: ', intercept)
Slope:  0.107811339248
Intercept:  1.47228767301
<Figure size 800x600 with 1 Axes>

So from the very simple and straight-forward model above, it seems that we estimate a slope of roughly 0.1, meaning we might expect that a one-year increase in schooling is associated with a 10% increase in wage, on average.

We can also see that we have a non-zero intercept term. We should be careful how we interpret this term; from a strictly mathematical point of view, the intercept represents the expected value of yy (in this case log wage) when x=0x = 0. However, in economics sometimes it makes no sense for xx to be 0, and so we cannot use the above interpretation. We won’t go into detail this lecture, but regardless of whether the intercept is interpretable, we almost always want to include it.

Uncertainty in β^\hat{\beta}

We mentioned earlier that the slope we estimate from regression is exactly that: an estimate of the “true” underlying slope. Because of this, the estimate β^\hat{\beta} is a random variable that depends on the underlying data.

Let’s assume there is the following true linear relation between log wage and years of schooling,

logwage=α+βyears of schooling+ε\text{logwage} = \alpha + \beta \cdot \text{years of schooling} + \varepsilon

and we try to estimate α\alpha and β\beta.

If our data are “well-behaved”, then even though there is uncertainty in our estimate β^\hat{\beta}, on average β^\hat{\beta} will be β\beta; that is to say that the expectation of β^\hat{\beta} is β\beta. Additionally, if our data are “well-behaved”, then β^\hat{\beta} has some normal distribution with mean β\beta. We won’t worry too much about what assumptions need to be satisfied to make the data “well-behaved”.

You can think of each person as an observation of these variables, and using a sample of people we can estimate the relationship between the two variables. However, due to the noise term and the fact that we only have a finite sample of people, the true relationship is always hidden from us, and we can only hope to get better estimates by designing better experiments and sampling more people.

Let’s try to get an idea of how “certain” we can be of our estimate β^\hat{\beta}. We’ll do this in classic Data 8 style: bootstrapping. Using our existing sample data, we’ll create new samples by bootstrapping from the existing data. Then, for each sample, we’ll fit a line, and keep the slope of that line in a list with all of the other slopes. Then, we’ll find the standard deviation of that list of slopes.

slopes = make_array()
educ_logwage = cps.select("educ", "logwage")

np.random.seed(42)
for i in np.arange(200):
    educ_logwage_sample = educ_logwage.sample()
    y = educ_logwage_sample.column("logwage")
    X = educ_logwage_sample.column("educ")
    model = sm.OLS(y, sm.add_constant(X)).fit()
    slopes  = np.append(model.params[1], slopes)
    
Table().with_columns("Slopes", slopes).hist()    
print('Standard dev. of bootstrapped slopes: ', np.std(slopes))
Standard dev. of bootstrapped slopes:  0.00159484193178
<Figure size 600x400 with 1 Axes>

Our bootstrapped approximation standard error of 0.00159 is pretty close to the true standard error of 0.00144. statsmodels, the package we will be using to perform regressions, actually uses a precise mathematical formula for finding the standard error whereas we tried to find this value through simulation, but the idea behind the standard error is the same.

Armed with a standard error, we can now form a 95% confidence interval and perform a test of significance to see if β^\hat{\beta} is significantly different from 0.

# Using our resampled slopes
lower_bound = percentile(2.5, slopes)
upper_bound = percentile(97.5, slopes)
print('95% confidence interval: [{}, {}]'.format(lower_bound, upper_bound))
95% confidence interval: [0.10456031293885952, 0.11093735589830463]

The 95% confidence interval does not contain 0, and so β\beta is unlikely to be 0.

Regression with a Binary Variable

A binary variable is a variable that takes on the value of 1 if some condition is true, and 0 otherwise. These are also called dummy variables or indicator variables. It might sound strange at first, but you can actually perform regression of a variable like log earnings onto a binary variable.

Let’s import a different dataset that has the following features. Some will be useful to us later.

nlsy = Table.read_table('nlsy_cleaned_small.csv')
nlsy
Loading...

Now let’s visualize log earnings vs. the binary variable corresponding to whether or not an observation went to college.

coll = nlsy.column('college')
logearn = nlsy.column('log_earn_1999')

plt.figure(figsize=(8,6))
plt.scatter(coll, logearn)
plt.xlabel('College', fontsize = 14)
plt.ylabel('Log Earnings', fontsize = 14)
plt.title('Log Earnings vs. College Completion', size = 20)
plt.xticks([0,1]);
<Figure size 800x600 with 1 Axes>
no_college = nlsy.where('college', 0).column("log_earn_1999")
has_college = nlsy.where('college', 1).column("log_earn_1999")

plt.figure(figsize=(8,6))
plt.xlabel('College', fontsize = 14)
plt.ylabel('Log Earnings', fontsize = 14)
plt.title('Log Earnings vs. College Completion', size = 20)
plt.violinplot(no_college, positions = [0], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False)
plt.violinplot(has_college, positions = [1], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False);
<Figure size 800x600 with 1 Axes>

Now let’s fit a regression model:

coll_standard = standard_units(coll)
logearn_standard = standard_units(logearn)

r = correlation(logearn_standard, coll_standard)
slope = r * np.std(logearn) / np.std(coll)
intercept = np.mean(logearn) - slope * np.mean(coll)

print("y = {:.5f} * x + {:.5f}".format(slope, intercept))
y = 0.72389 * x + 10.27883

Wow! This regression would imply that we expect, on average, observations who went to college to have 70% higher earnings than those who did not go to college. Let’s now plot this line on the data:

plt.figure(figsize=(8,6))
plt.scatter(coll, logearn)
plt.plot(np.linspace(0, 1), slope * np.linspace(0, 1) + intercept, color='tab:orange')
plt.xlabel('College', fontsize = 14)
plt.ylabel('Log Earnings', fontsize = 14)
plt.title('Log Earnings vs. College Completion', size = 20)
plt.xticks([0,1]);
<Figure size 800x600 with 1 Axes>
no_college = nlsy.where('college', 0).column("log_earn_1999")
has_college = nlsy.where('college', 1).column("log_earn_1999")

plt.figure(figsize=(8,6))
plt.plot(np.linspace(0, 1), slope * np.linspace(0, 1) + intercept, color='tab:green')
plt.xlabel('College', fontsize = 14)
plt.ylabel('Log Earnings', fontsize = 14)
plt.title('Log Earnings vs. College Completion', size = 20)
plt.violinplot(no_college, positions = [0], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False)
plt.violinplot(has_college, positions = [1], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False);
<Figure size 800x600 with 1 Axes>

When we perform a simple regression onto just a dummy variable, it is an important fact that α^\hat{\alpha} is the mean value of yy for all observations in the sample where x=0x = 0, and β^\hat{\beta} is the difference between the mean value of yy for observations in the sample where x=1x = 1 and observations where x=0x = 0. Proving this claim is beyond our scope this week, but let’s verify it with our data:

avg_logearn_coll = np.mean(logearn[coll == 1])
avg_logearn_nocoll = np.mean(logearn[coll == 0])

print('Avg logearn for coll = 1: ', avg_logearn_coll)
print('Avg logearn for coll = 0: ', avg_logearn_nocoll)
print('Difference between the two: ', avg_logearn_coll - avg_logearn_nocoll)
Avg logearn for coll = 1:  11.0027228592
Avg logearn for coll = 0:  10.2788317347
Difference between the two:  0.723891124528
print('Intercept: ', intercept)
print('Slope: ', slope)
Intercept:  10.2788317347
Slope:  0.723891124528