Skip to article frontmatterSkip to article content

3.2. Multiple Linear Regression

In Chapter 3.1, we learned how to use our knowledge of spans and projections to recast the problem of finding optimal model parameters for the simple linear regression model, h(xi)=w0+w1xih(x_i) = w_0 + w_1 x_i. Along the way, we defined new characters, most importantly, the design matrix

X=[1x11x21xn]\color{#3d81f6} X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}

Here, we’ll look at how and why to extend X\color{#3d81f6} X to include multiple input features, in what’s called multiple linear regression.

Image produced in Jupyter

Incorporating Multiple Features

Motivation

The commute times dataset that we were reintroduced to in Chapter 3.1 has several columns in it, but we’ve only used one – departure_hour – as an input variable.

commutes_full.head()
Loading...

Let’s take a look at a subset of them.

commutes_full[['departure_hour', 'day', 'day_of_month', 'minutes']].head()
Loading...

For example, the first row above tells us that the first recorded commute was on a Monday, which was the 15th of that month.

Using day seems like it would be a good idea, since traffic patterns likely differ based on the day of the week. But, it’s not clear how to use strings, like 'Tue', in a linear model. We’ll address this in a moment. For now, suppose we’d like to fit a linear model that predicts commute time in minutes using both departure_hour and day_of_month (“dom” for short).

Such a hypothesis function is of the form

h(departure houri,domi)=pred. commutei=w0+w1departure houri+w2domi\underbrace{h(\text{departure hour}_i, \text{dom}_i)}_{= \text{pred. commute}_i} = w_0 + w_1 \cdot \text{departure hour}_i + w_2 \cdot \text{dom}_i

which is a plane in R3\mathbb{R}^3. Once we figure out how to find w0w_0^*, w1w_1^*, and w2w_2^*, we’ll visualize the resulting plane, not to worry.

(To be crystal clear, the \cdot’s in the above equations refer to “regular” scalar multiplication, not the dot product. There are no vectors in the equation above.)

But, how do we find these three optimal model parameters? If we consult the modeling recipe, and choose squared loss, we need to find the values of w0w_0^*, w1w_1^*, and w2w_2^* that minimize

Rsq(w0,w1,w2)=1ni=1n(yi(w0+w1departure houri+w2domi))2R_\text{sq}(w_0, w_1, w_2) = \frac{1}{n} \sum_{i = 1}^n (y_i - \left( w_0 + w_1 \cdot \text{departure hour}_i + w_2 \cdot \text{dom}_i \right))^2

The solution is to use what we know about spans, projections, and the design matrix. The design matrix for this problem will have 3 columns: a column of 1’s for the intercept, a column of departure_hour values, and a column of dom values.

p=[w0+w1departure hour1+w2dom1w0+w1departure hour2+w2dom2w0+w1departure hourn+w2domn]=[1departure hour1dom11departure hour2dom21departure hourndomn]X[w0w1w2]w\begin{align*} {\color{#004d40} \vec p} &= \begin{bmatrix} w_0 + w_1 \cdot {\color{#3d81f6} \text{departure hour}_1} + w_2 \cdot {\color{#3d81f6} \text{dom}_1} \\ w_0 + w_1 \cdot {\color{#3d81f6} \text{departure hour}_2} + w_2 \cdot {\color{#3d81f6} \text{dom}_2} \\ \vdots \\ w_0 + w_1 \cdot {\color{#3d81f6} \text{departure hour}_n} + w_2 \cdot {\color{#3d81f6} \text{dom}_n} \end{bmatrix} \\ &= {\color{#3d81f6} \underbrace{\begin{bmatrix} 1 & \text{departure hour}_1 & \text{dom}_1 \\ 1 & \text{departure hour}_2 & \text{dom}_2 \\ \vdots & \vdots & \vdots \\ 1 & \text{departure hour}_n & \text{dom}_n \end{bmatrix}}_{X}} \underbrace{\begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}}_{\vec w} \end{align*}

To find w\vec w^*, all we need to do is define the observation vector, y=[y1y2yn]\color{orange} \vec y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, and find the w\vec w^* that minimizes

Rsq(w)=1nyXw2R_\text{sq}(\vec w) = \frac{1}{n} \left\lVert {\color{orange} \vec y} - {\color{#3d81f6} X} \vec w \right\lVert^2

which is a solved problem at this point: it’s the w\vec w^* that satisfies the normal equation, XTXw=XTy{\color{#3d81f6} X^TX} \vec w^* = {\color{#3d81f6} X^T} \color{orange} \vec y.

Generalized Notation

Let me try and state this problem in slightly more general terms, where we have dd features rather than just 1 or 2.

As before, subscripts distinguish between individuals (rows) in our dataset, and we will try and use superscripts (in parentheses) to distinguish between features (columns). Specifically, we’ll use the notation xi(j)x_i^{(j)} to represent the jj-th feature for the ii-th individual. In the example above, x5(1)x_5^{(1)} is the departure hour for the 5th row in the dataset. Think of x(1)x^{(1)}, x(2)x^{(2)}, ... as new variable names, like new letters.

Consider the following example dataset.

Loading...

There are n=3n = 3 rows, each of which has d=2d = 2 features (minutes is not a feature, it’s what we’re trying to predict).

We can represent each row (day) with a feature vector, xi=[xi(1)xi(2)]\vec x_i = \begin{bmatrix} x_i^{(1)} \\ x_i^{(2)} \end{bmatrix}. A feature vector contains all of the information we’re using to make predictions for the ii-th individual.

x1=[10.81666715],x2=[7.7516],x3=[8.4522]\vec x_1 = \begin{bmatrix} 10.816667 \\ 15 \end{bmatrix}, \qquad \vec x_2 = \begin{bmatrix} 7.75 \\ 16 \end{bmatrix}, \qquad \vec x_3 = \begin{bmatrix} 8.45 \\ 22 \end{bmatrix}

In general,

xi=[xi(1)xi(2)xi(d)]\vec x_i = \begin{bmatrix} x_i^{(1)} \\ x_i^{(2)} \\ \vdots \\ x_i^{(d)} \end{bmatrix}

Note that if our model has dd features, then there are d+1d+1 parameters, w0w_0, w1w_1, w2w_2, ..., wdw_d. There is one parameter for each feature, plus one for the intercept. So, the generalized multiple linear regression model has a hypothesis function of the form

h(xi)=w0+w1xi(1)+w2xi(2)++wdxi(d)h(\vec x_i) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + \ldots + w_d x_i^{(d)}

It’d be great if we could express our hypothesis function hh as a dot product between a parameter vector w\vec w and a feature vector xi\vec x_i, but we can’t: w\vec w has d+1d+1 components while xi\vec x_i has dd components.

To address this issue, we’ll define the augmented feature vector, Aug(xi)\text{Aug}({\vec x_i}), which is the vector obtained by adding a 1 to the front of feature vector xi\vec x_i (much like the design matrix X\color{#3d81f6} X has a column of 1’s).

Aug(xi)=[1xi(1)xi(2)xi(d)]\text{Aug}({\vec x_i}) = \begin{bmatrix} 1 \\ x_i^{(1)} \\ x_i^{(2)} \\ \vdots \\ x_i^{(d)} \end{bmatrix}

Then, the multiple linear regression model can be written as

h(xi)=w0+w1xi(1)+w2xi(2)++wdxi(d)=wAug(xi)\boxed{h(\vec x_i) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + \ldots + w_d x_i^{(d)} = \vec w \cdot \text{Aug}({\vec x_i})}

The General Problem

Now we can finally state the problem of multiple linear regression in its most general form. Suppose we have nn data points, (x1,y1),(x2,y2),,(xn,yn)\left({ \vec x_1}, {y_1}\right), \left({ \vec x_2}, {y_2}\right), \ldots, \left({ \vec x_n}, {y_n}\right), where each xi \vec x_i is a feature vector of dd features:

xi=[xi(1)xi(2)xi(d)]{\vec{x_i}} = \begin{bmatrix} {x^{(1)}_i} \\ {x^{(2)}_i} \\ \vdots \\ {x^{(d)}_i} \end{bmatrix}

We’d like to find a good linear model,

h(xi)=w0+w1xi(1)+w2xi(2)++wdxi(d)=wAug(xi)h({\vec x_i}) = w_0 + w_1 { x_i^{(1)}} + w_2 { x_i^{(2)}} + \ldots + w_d { x_i^{(d)}} = \vec w \cdot \text{Aug}({ \vec x_i})

By good, we mean that we’d like to find optimal parameters w0w_0^*, w1w_1^*, ..., wdw_d^* that minimize mean squared error:

Rsq(w)=1ni=1n(yih(xi))2=1ni=1n(yi(w0+w1xi(1)+w2xi(2)++wdxi(d)))2=1ni=1n(yiwAug(xi))2=1nyXw2all equivalent ways of writing mean squared error!\underbrace{\begin{align*} R_\text{sq}(\vec w) &= \frac{1}{n} \sum_{i = 1}^n (y_i - h(\vec x_i))^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \left( y_i - (w_0 + w_1 { x_i^{(1)}} + w_2 { x_i^{(2)}} + \ldots + w_d { x_i^{(d)}})\right)^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \left(y_i - \vec w \cdot \text{Aug}(\vec x_i) \right)^2 \\ &= \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 \end{align*}}_\text{all equivalent ways of writing mean squared error!}

The solution is to define the n×(d+1)n \times (d + 1) design matrix X\color{#3d81f6} X and nn-dimensional observation vector y\color{orange} \vec y:

X=[1x1(1)x1(2)x1(d)1x2(1)x2(2)x2(d)1xn(1)xn(2)xn(d)]=[Aug(x1)TAug(x2)TAug(xn)T],y=[y1y2yn]{\color{#3d81f6} X= \begin{bmatrix} {1} & { x^{(1)}_1} & { x^{(2)}_1} & \dots & { x^{(d)}_1} \\\\ { 1} & { x^{(1)}_2} & { x^{(2)}_2} & \dots & { x^{(d)}_2} \\\\ \vdots & \vdots & \vdots & & \vdots \\\\ { 1} & { x^{(1)}_n} & { x^{(2)}_n} & \dots & { x^{(d)}_n} \end{bmatrix} = \begin{bmatrix} \text{Aug}({\vec{x_1}})^T \\\\ \text{Aug}({\vec{x_2}})^T \\\\ \vdots \\\\ \text{Aug}({\vec{x_n}})^T \end{bmatrix}}, \qquad \color{orange} { \vec y = \begin{bmatrix} { y_1} \\ { y_2} \\ \vdots \\ { y_n} \end{bmatrix}}

and to solve the normal equation to find the optimal parameter vector, w\vec{w}^*:

XTXw=XTy{\color{#3d81f6} X^TX} \vec w^* = {\color{#3d81f6} X^T} \color{orange} \vec y

The w\vec w^* that satisfies the equations above minimizes mean squared error, Rsq(w)R_\text{sq}(\vec w), so use this w\vec w^* to make predictions in h(xi)=wAug(xi)h(\vec x_i) = \vec w^* \cdot \text{Aug}(\vec x_i). Once you find that w\vec w^*,

  • p=Xw{\color{#004d40} \vec p} = {\color{#3d81f6} X} \vec w^* is a prediction vector of predicted yy-values for the entire dataset. It is also the projection of y\color{orange} \vec y onto the span of the columns of X\color{#3d81f6} X.

  • h(xi)=wAug(xi)h(\vec x_i) = \vec w^* \cdot \text{Aug}(\vec x_i) is the predicted yy-value for just the input xi\vec x_i.

Using sklearn

In Chapter 3.1, we saw how to use sklearn to fit a simple linear regression model. Performing multiple linear regression is just as easy.

from sklearn.linear_model import LinearRegression

model_multiple = LinearRegression()
model_multiple.fit(X=commutes_full[['departure_hour', 'day_of_month']], y=commutes_full['minutes'])
Loading...

After calling fit, we can access w\vec w^*.

model_multiple.intercept_, model_multiple.coef_
(141.86402699471932, array([-8.2233821 , 0.05615985]))

The outputs above tell us that the “best way” (according to squared loss) to make commute time predictions using a linear model is using:

pred. commutei=141.868.22departure houri+0.06day of monthi\text{pred. commute}_i = 141.86 - 8.22 \cdot \text{departure hour}_i + 0.06 \cdot \text{day of month}_i

This is the plane of best fit given historical data; it is not a causal statement.

Loading...

Fit LinearRegression objects have a predict method, which can be used to predict commute times given a departure_hour and day_of_month.

# What if I leave at 9:15AM on the 26th of the month?
# To supress the warning below, we should convert X and y to numpy arrays before calling .fit.
model_multiple.predict([[9.25, 26]])
/Users/surajrampure/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

array([67.25789852])
model_multiple.predict(pd.DataFrame({'departure_hour': [9.25], 'day_of_month': [26]}))
array([67.25789852])

Comparing Models

While it’s not difficult to implement ourselves, sklearn has a built-in mean_squared_error function.

from sklearn.metrics import mean_squared_error
mean_squared_error(commutes_full['minutes'], model_multiple.predict(commutes_full[['departure_hour', 'day_of_month']]))
96.78730488437492

So, the mean squared error of our plane of best fit is 96.78 minutes2\text{minutes}^2.

Throughout this section, we’ll fit various different models to the commute times dataset, and it’ll be useful to keep track of their mean squared errors in one place. I’ll do so in a Python dictionary. (I’m intentionally showing you more code than I typically have, so you have some sense of how to do this all yourself – I hope you don’t mind!)

mse_dict = {}

# Multiple linear regression model.
mse_dict['departure_hour + day_of_month'] = mean_squared_error(commutes_full['minutes'], model_multiple.predict(commutes_full[['departure_hour', 'day_of_month']]))

# Simple linear regression model.
model_simple = LinearRegression()
model_simple.fit(X=commutes_full[['departure_hour']], y=commutes_full['minutes'])
mse_dict['departure_hour'] = mean_squared_error(commutes_full['minutes'], model_simple.predict(commutes_full[['departure_hour']]))

# Constant model.
model_constant = commutes_full['minutes'].mean()
mse_dict['constant'] = mean_squared_error(commutes_full['minutes'], np.ones(commutes_full.shape[0]) * model_constant)

mse_dict
{'departure_hour + day_of_month': 96.78730488437492, 'departure_hour': 97.04687150819183, 'constant': 167.535147928994}

As you can see, adding day_of_month barely reduced our model’s mean squared error, which matches our intuition that knowing whether it’s the 15th or 19th or 3rd of the month doesn’t seem like it would be helpful in predicting commute time. The activity below has you think about why the mean squared error still did decrease, and whether this is always the case.


Feature Engineering

So far, we’ve limited ourselves to using departure_hour and day_of_month, two features that already came to us in the dataset. In practice, we’ll often need to engineer features, which means creating new features from existing ones, or “transforming” raw data into good features. We might do this to improve model performance, for example, in case the relationship between the features and output variable isn’t truly linear, or maybe even to enable the use of categorical features.

Example: One Hot Encoding

Suppose we’d like to use the day column from commutes_full as a feature.

commutes_full[['departure_hour', 'day', 'day_of_month']].head()
Loading...

One naive approach would be to convert each day value to a number, e.g. "Mon" is 1, "Tue" is 2, "Wed" is 3, etc. The reason this is a bad idea is that this approach implies that Monday is somehow “less than” Tuesday and should impact our model’s predictions less than Tuesday (remember that this column will be multiplied by a parameter in order to make predictions). This is what’s called an ordinal encoding, and these only make sense when the features have some inherent order that is meaningful to the prediction problem.

Instead, we’ll perform one hot encoding, which turns a categorical feature into several binary features: one per unique value of the categorical feature.

Since day has 5 unique values, we’ll need to create 5 new binary features.

# pandas Series have a value_counts method, which describes the frequency of each unique value in the Series.
commutes_full['day'].value_counts()
day Tue 25 Mon 20 Thu 15 Wed 3 Fri 2 Name: count, dtype: int64

Suppose we want to build a model that predicts minutes using departure_hour, day_of_month, and one hot encoded day. That model would have 1+1+5=71 + 1 + 5 = 7 features, so its design matrix would have 7+1=87 + 1 = 8 columns!

Loading...

The actual process of one hot encoding can be done efficiently using sklearn.preprocessing.OneHotEncoder. You’ll see how to use this in Homework 8.

Question: What is the rank of the design matrix above?

You might notice that among the 5 binary columns, in a particular row, exactly one of them is 1, and the other 4 are 0. This means that if we sum together the 5 binary columns, we’ll get a column of 1’s – which is already the first column in our design matrix!

This means that, by default, when we perform one hot encoding, the design matrix doesn’t have full rank. This doesn’t stop us from projecting y\color{orange} \vec y onto colsp(X)\text{colsp}({\color{#3d81f6} X}), but it does mean that there are infinitely many possible optimal parameters w\vec w^*. Again, all of these would give us the same predictions and same mean squared error, but it’s annoying to have to deal with this setting.

np.linalg.matrix_rank(X_for_ohe) # Not 8!
7
X_for_ohe['day == Mon'] + X_for_ohe['day == Tue'] + X_for_ohe['day == Wed'] + X_for_ohe['day == Thu'] + X_for_ohe['day == Fri']
0 1 1 1 2 1 3 1 4 1 .. 60 1 61 1 62 1 63 1 64 1 Length: 65, dtype: int64

So, a common solution when performing one hot encoding is to drop one of the binary columns. This doesn’t change the information conveyed by the other features, and doesn’t change colsp(X)\text{colsp}({\color{#3d81f6} X}), which is what matters for the quality of the model’s predictions. And intuitively, even if we don’t have the value of day == Fri, we know it’s Friday if day == Mon, day == Tue, day == Wed, and day == Thu are all 0.

Now that we’ve converted day to a numerical variable, we can use it as input in a regression model. Here’s the model we’ll try to fit:

pred. commute timei=w0+w1departure houri+w2day of monthi+w3dayi == Mon+w4dayi == Tue+w5dayi == Wed+w6dayi == Thu\begin{align*}\text{pred. commute time}_i = w_0 &+ w_1 \cdot \text{departure hour}_i \\ &+ w_2 \cdot \text{day of month}_i \\ &+ w_3 \cdot \text{day$_i$ == Mon} \\ &+ w_4 \cdot \text{day$_i$ == Tue} \\ &+ w_5 \cdot \text{day$_i$ == Wed} \\ &+ w_6 \cdot \text{day$_i$ == Thu} \end{align*}

This model has 6 features (1 + 1 + 4) and hence 7 parameters. Its design matrix is n×7n \times 7.

X_for_ohe = for_ohe[['1', 'departure_hour', 'day_of_month', 'day == Mon', 'day == Tue', 'day == Wed', 'day == Thu']]
X_for_ohe.head()
Loading...

Notice that below, I’ve set fit_intercept=False, because I manually added a column of 1’s to the design matrix, so I don’t need sklearn to add another one.

model_with_ohe = LinearRegression(fit_intercept=False)
model_with_ohe.fit(X=X_for_ohe, y=commutes_full['minutes'])
Loading...
model_with_ohe.coef_
array([ 1.34043066e+02, -8.41714813e+00, -2.52657944e-02, 5.09024617e+00, 1.63763100e+01, 5.11983892e+00, 1.14972088e+01])

So, our linear model to predict commute time given departure_hour, day_of_month, and day (Mon, Tue, Wed, or Thu) is:

pred. commute timei=1348.42departure houri0.03day of monthi+5.09dayi == Mon+16.38dayi == Tue+5.12dayi == Wed+11.5dayi == Thu\begin{align*}\text{pred. commute time}_i = 134 &- 8.42 \cdot \text{departure hour}_i \\ &- 0.03 \cdot \text{day of month}_i \\ &+ 5.09 \cdot \text{day$_i$ == Mon} \\ &+ 16.38 \cdot \text{day$_i$ == Tue} \\ &+ 5.12 \cdot \text{day$_i$ == Wed} \\ &+ 11.5 \cdot \text{day$_i$ == Thu} \end{align*}

While this model has 6 features, and thus requires 7 dimensions to graph, it’s really a collection of five parallel planes in 3D, all with slightly different zz-intercepts! Remember that dayi==Mon\text{day}_i == \text{Mon}, dayi==Tue\text{day}_i == \text{Tue}, etc. aren’t all free variables: if one of them is 1, the others must be 0. There are 5 cases to consider, and each one corresponds to one of these 5 parallel planes. You’re exploring this idea in Homework 7, Question 3.

If we do want to visualze the model in R2\mathbb{R}^2, we need to pick a single feature to place on the xx-axis.

Loading...

Despite being a linear model, this model doesn’t look like a straight line, since it’s linear in terms of all 6 features, but when you project it into 2D, it no longer appears linear.

How does this new model compare to our previous models?

mse_dict['departure_hour + day_of_month + day'] = mean_squared_error(
    commutes_full['minutes'],
    model_with_ohe.predict(X_for_ohe)
)

mse_dict
{'departure_hour + day_of_month': 96.78730488437492, 'departure_hour': 97.04687150819183, 'constant': 167.535147928994, 'departure_hour + day_of_month + day': 70.21791287461917}

Note that adding the day of the week decreased our model’s mean squared error significantly: this one hot encoded model is our best yet.

Example: Numerical Transformations

Another common step in feature engineering is to produce new numerical features out of existing ones.

commutes_full[['departure_hour', 'day', 'day_of_month']].head()
Loading...

As an example, I could fit a degree 3 polynomial model

h(xi)=w0+w1xi+w2xi2+w3xi3h(x_i) = w_0 + w_1 x_i + w_2 x_i^2 + w_3 x_i^3

by producing the design matrix

X=[1x1x12x131x2x22x231xnxn2xn3]X = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 \\ \end{bmatrix}

Below, we’ll apply this thinking to the departure_hour feature. We’ll create these features manually, though in Homework 8 you’ll see how to do this easily and more efficiently using sklearn.preprocessing.PolynomialFeatures.

X_for_polynomial = commutes_full[['departure_hour']].copy()
X_for_polynomial['departure_hour^2'] = X_for_polynomial['departure_hour'] ** 2
X_for_polynomial['departure_hour^3'] = X_for_polynomial['departure_hour'] ** 3

model_polynomial = LinearRegression()
model_polynomial.fit(X=X_for_polynomial, y=commutes_full['minutes'])
model_polynomial.intercept_, model_polynomial.coef_
(816.0936290109973, array([-227.63718864, 23.33213694, -0.80864398]))

The above tells me that my best-fitting degree 3 polynomial model is

h(xi)=816.09227.63xi+23.33xi20.809xi3h(x_i) = 816.09 - 227.63 x_i + 23.33 x_i^2 - 0.809 x_i^3

This is a linear model, even though the features themselves are made up of non-linear functions. If we could visualize in R4\mathbb{R}^4, with one axis for departure hour\text{departure hour}, one for departure hour2\text{departure hour}^2, one for departure hour3\text{departure hour}^3, and one for commute time\text{commute time}, we’d see that the model’s predictions lie on a higher-dimensional plane.

But, when we visualize in R2\mathbb{R}^2, the model doesn’t look linear, it looks cubic!

Loading...

Is this model better than the simple linear regression model we fit earlier? Well, it has certainly has a lower mean squared error than the simple linear regression model:

mse_dict['departure_hour with cubic features'] = mean_squared_error(
    commutes_full['minutes'],
    model_polynomial.predict(X_for_polynomial)
)

mse_dict
{'departure_hour + day_of_month': 96.78730488437492, 'departure_hour': 97.04687150819183, 'constant': 167.535147928994, 'departure_hour + day_of_month + day': 70.21791287461917, 'departure_hour with cubic features': 75.27204428216439}

75.27, the mean squared error of this cubic model, is lower than the simple linear regression model’s mean squared error of 97.04 (but worse than we got when one hot encoding day).

Keep in mind that adding complex features doesn’t always equate to a “better model”, even if it lowers the mean squared error on the training data. To illustrate what I mean, what if we fit a degree 10 polynomial to the data?

Loading...

Without computing it, we can tell this model’s mean squared error is lower than that of the cubic or simple linear regression models, since it passes closer to the training data. But, this model is overfit to the training data. Remember, our eventual goal is to build models that generalize well to new data, and it seems unlikely that this degree 10 polynomial reflects the true relationship between departure hour and commute time in nature.

The question, then, is how to choose the right model, e.g. how to choose the right polynomial degree, if we’re committed to making polynomial features. The solution is to intentionally split our data into training data and test data, and use only the training data to pick the features we’d like to include (e.g. polynomial degree). Then, we pick the combination of features that performed best on the test data, since that combination of features seems most likely to generalize well on unseen data. You’ll explore this paradigm in Homework 8.

In the above examples, I created polynomial features out of departure_hour only, but there’s nothing stopping me from creating features out of day_of_month too, or out of both of them simultaneously. A perfectly valid model is

pred. commute timei=w0+w1departure houri+w2(departure houridomi)+w3log(domi)\text{pred. commute time}_i = w_0 + w_1 \cdot \text{departure hour}_i + \\ w_2 \cdot (\text{departure hour}_i \cdot \text{dom}_i) + w_3 \cdot \log(\text{dom}_i)

which, using our more generic syntax, would look like

h(xi)=w0+w1xi(1)+w2(xi(1)xi(2))+w3log(xi(2))h(\vec x_i) = w_0 + w_1 x_i^{(1)} + w_2 \left( x_i^{(1)} x_i^{(2)} \right) + w_3 \log(x_i^{(2)})

You may wonder, how do I know which features to create? Through a variety of techniques:

  • Domain knowledge

  • Trial and error

  • By visualizing the data

In the polynomial examples above, the data didn’t particularly look like it was quadratic or cubic, so a linear model sufficed. But if we’re given a dataset that clearly has some non-linear relationship, visualization can help us identify what features might improve model performance.

As the dimensionality of our data increases, though, visualization becomes less possible. (How do we visualize a dataset with 100 features?) Strategies for reducing the dimensionality of our data will be explored in Chapter 5.

Linear in the Parameters

What is the extent to which I can use linear regression to fit a model? As long as a model can be expressed in the form

h(xi)=wAug(xi)h(\vec x_i) = \vec w \cdot \text{Aug}(\vec x_i)

for some parameter vector w\vec w and some feature vector xi\vec x_i, then it can be fit using linear regression (i.e. by creating a design matrix XX and finding w\vec w^* by solving the normal equations). We say such a model is linear in the parameters. The choice of features to include in xi\vec x_i is up to us.

For example,

  • If xix_i is a scalar, then

    h(xi)=w0+w1xi+w2cos(xi)+w3exi=w[1xicos(xi)exi]h(x_i) = w_0 + w_1 x_i + w_2 \cos(x_i) + w_3 e^{x_i} = \vec w \cdot \begin{bmatrix} 1 \\ x_i \\ \cos(x_i) \\ e^{x_i} \end{bmatrix}

    is linear in the parameters. It would be linear in the parameters even if the w0w_0 term wasn’t there; then, we wouldn’t need to augment xi\vec x_i with a 1.

  • If xi(1)x_i^{(1)} and xi(2)x_i^{(2)} are scalars (e.g. height and weight), then

    h(xi)=w0+w1xi(1)+w2xi(2)+w3xi(1)xi(2)+w4exi(1)xi(2)=w[1xi(1)xi(2)xi(1)xi(2)exi(1)xi(2)]h(\vec x_i) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(1)} x_i^{(2)} + w_4 \frac{e^{x_i^{(1)}}}{x_i^{(2)}} = \vec w \cdot \begin{bmatrix} 1 \\ x_i^{(1)} \\ x_i^{(2)} \\ x_i^{(1)} x_i^{(2)} \\ \frac{e^{x_i^{(1)}}}{x_i^{(2)}} \end{bmatrix}

    is linear in the parameters.

  • If xix_i is a scalar, then

    h(xi)=w0+w1cos(xi)+sin(w2xi)h(x_i) = w_0 + w_1 \cos(x_i) + \sin(w_2 x_i)

    is not linear in the parameters, because w2w_2 is within the sin\sin function. This model can’t be written as wAug(xi)\vec w \cdot \text{Aug}(\vec x_i) for any choice of xi\vec x_i.