In Chapter 2.7, I encouraged you to think of a matrix as a collection of vectors stacked together.
A=⎣⎡503613−142021−141⎦⎤
Soon, these matrices will be made up of observations and features from some dataset that we’d like to build a regression model with. The content of this section will be a crucial building block for helping us get there.
A is a 5×3 matrix. It can be thought of as either:
3 vectors in R5 (the columns of A)
5 vectors in R3 (the rows of A)
Let’s start with the column perspective. We saw in Chapter 2.7 that if x∈R3, then Ax is a new vector in R5 that is a linear combination of the columns of A. For instance, if we take x=⎣⎡20−1⎦⎤, then
By definition, the vector Ax is in the span of the columns of A, since it’s just a linear combination of A’s columns.
Given what we’ve learned in Chapter 2.4 and Chapter 2.6, it’s natural to try and describe the span of A’s columns.
Notice that I’ve intentionally chosen not to use subscripts to refer to columns; this is so that when we switch back to focusing on datasets and machine learning, we keep consistent the fact that subscripts refer to different rows/data points, not columns/features.
“Column space” is just a new term for a concept we’re already familiar with: the span of a set of vectors. In the example A we’ve been working with, the column space is
This column space is a 2-dimensional subspace of R5. Why is it 2-dimensional? The last column is a linear combination of the first two columns. Specifically,
column 3=column 1−column 2
Remember that:
the dimension of a subspace is the number of vectors in a basis for the subspace, and
a basis for a subspace is a linearly independent set of vectors that spans the entire subspace.
The first two columns of A alone span the column space, and are linearly independent, and so dim(colsp(A))=2. This number, 2, is the most important number associated with the matrix A, so much so that we give it a special name.
The rank of a matrix tells us how “large” the space of possible linear combinations of the columns of A is. We care about this because ultimately, our predictions in regression are just linear combinations of the columns of some data matrix.
To get a feel for the idea of rank, let’s work through some examples.
Find three 3×4 matrices: one with rank 1, one with rank 2, and one with rank 3. Is it possible to have a 3×4 matrix with rank 4?
Solution
To create a 3×4 matrix with rank 1, we need there to only be one linearly independent column. For instance,
A=⎣⎡111222333444⎦⎤
has rank 1 because all columns are multiples of the first column.
To create a 3×4 matrix with rank 2, we need there to be two linearly independent columns. One way to construct such a matrix is to make the first two columns linearly independent, and make the last two columns linear combinations of the first two. For instance, in
A=⎣⎡111123222234⎦⎤
column 3 is a scalar multiple of column 1, and column 4 is column 1 + column 2.
To create a 3×4 matrix with rank 3, we need there to be three linearly independent columns, and the fourth column can be anything. One solution is
A=⎣⎡100010001987⎦⎤
A 3×4 matrix cannot have rank 4, because it’s impossible to have four linearly independent vectors in R3. Any three linearly independent vectors in R3 span all of R3, so a fourth vector in R3 would have to be a linear combination of the first three vectors.
Find a condition on a,b,c,d that ensures rank(A)=2.
Solution
In order for rank(A)=2, the columns of A must be linearly independent. This means that the second column cannot be a scalar multiple of the first column.
ab is the number we multiply a by to get b. So, we just need ab⋅c to be different from d.
ab⋅c=d⟹ad−bc=0
So, if ad−bc=0, then rank(A)=1, and otherwise, rank(A)=2.
This expression, ad−bc, is called the determinant of A. We’ll learn more about determinants in Chapter 2.9.
Suppose d1,d2,…,dn are real numbers. What is the rank of the n×ndiagonal matrix
D=⎣⎡d10⋮00d2⋮0⋯⋯⋱⋯00⋮dn⎦⎤
if
di=i, for i=1,2,…,n?
d1=d2=⋯=dn=0?
k of the di are equal to 0, and the rest are positive?
Solution
rank(D)=n: If di=i, for i=1,2,…,n, meaning d1=1,d2=2,…,dn=n, then the rank is n, because all columns are linearly independent. None can be written as a linear combination of any others, since they all have their sole non-zero entry in a different row.
rank(D)=0: If d1=d2=⋯=dn=0, then the rank is 0, because all columns are the zero vector.
rank(D)=n−k: If k of the di are equal to 0, and the rest are positive, then the rank is n−k, because we can throw out the zero columns and the remaining columns are linearly independent.
As we’ve seen before, the dot product u⋅v=uTv is a scalar, equal to -17 here.
The outer product of u and v is the matrix
uvT=⎣⎡1−34⎦⎤[25−1]=⎣⎡2−685−1520−13−4⎦⎤
In general, for any two vectors u,v∈Rn, what is the rank of uvT?
Solution
The rank of uvT is 1, since each column in uvT is a scalar multiple of u. In the example above, column 2 is 25 times column 1, and column 3 is −21 times column 1.
In Chapter 5, we’ll see that any rank r matrix can be written as a sum of r rank 1 matrices, each of which is of the form uvT. So, rank 1 matrices can be thought of as the building blocks of all matrices!
Note that we’ve already seen plenty of problems of this form in earlier homeworks. It’s just that there, the spanning set of vectors was given to you directly, and here, they’re stored as columns in a matrix. The idea is the same.
Solution
Column 2 is a multiple of column 1.
Column 3 is independent from column 1.
Column 4 is 3 times column 1 plus -5 times column 3.
Column 5 is just column 1.
A only has two linearly independent columns – columns 1 and 3 – and so rank(A)=2, and a basis for the column space is
Shortly, we’ll learn a new technique for finding the rank of matrix by hand. But for the most part, we’ll not need to do this, and instead can use the power of Python to help us.
So far, we’ve focused on thinking of a matrix as a collection of “column” vectors written next to each other. This is the more common perspective, since – as I’ve harped on – Ax is a linear combination of the columns of A.
But we can also think of a matrix as a collection of “row” vectors written on top of each other.
A=⎣⎡503613−142021−141⎦⎤
A contains 5 vectors in R3 in its rows, each in R3. These vectors also have a span, which in this case is a subspace of R3.
Where did ATy come from? Remember, Ax is a linear combination of the columns of A. If we transpose A, then ATy is a linear combination of the columns of AT, which are the rows of A.
ATy=⎣⎡5320−1134−1624101⎦⎤⎣⎡y1y2y3y4y5⎦⎤=linear combination of rows of Ay1⎣⎡532⎦⎤+y2⎣⎡0−11⎦⎤+y3⎣⎡34−1⎦⎤+y4⎣⎡624⎦⎤+y5⎣⎡101⎦⎤
Remember from Chapter 2.7 that (ATy)T=yTA. The product yTA is also a linear combination of the rows of A; it just returns a row vector with shape 1×d rather than a vector with shape d×1.
yTA=[y1y2y3y4y5]⎣⎡503613−142021−141⎦⎤=linear combination of rows of Ay1[532]+y2[0−11]+...+y5[101]
In yTA, we left-multipliedA by a vector; in Ax, we right-multipliedA by a vector. These are distinct types of multiplication, as they involve vectors of different shapes.
Since the columns of AT are the rows of A, the row space of A is the column space of AT, meaning
rowsp(A)=colsp(AT)
To avoid carrying around lots of notation, I’ll often just use colsp(AT) to refer to the row space of A.
What is the dimension of the row space of A? Other ways of phrasing this question are:
How many linearly independent rows does A have?
What is the rank of AT?
We know the answer can’t be more than 3, since the rows of A are vectors in R3. We could use the algorithm first presented in Chapter 2.4 to find a linearly independent set of rows with the same span as all 5 rows.
An easy way to see that dim(colsp(AT))=2 is to pick out two of the five vectors, row 2=a2=⎣⎡0−11⎦⎤ and row 5=a5=⎣⎡101⎦⎤, and show that the other three vectors can be written as linear combinations of them. I chose a2 and a5 just because they have the simplest numbers, and because they’re linearly independent from one another.
a1=⎣⎡532⎦⎤=−3a2+5a5
a3=⎣⎡34−1⎦⎤=−4a2+3a5
a4=⎣⎡624⎦⎤=−2a2+6a5
You might notice that the number of linearly independent rows of A and the number of linearly independent columns of A were both 2.
Sometimes, we say the dimension of colsp(A) is the column rank of A, and the dimension of colsp(AT) is the row rank of A. For the example A we’ve been working with, both the column rank and row rank are 2.
But, there’s no need for separate names.
It’s not immediately obvious why this is true, and honestly, most of the proofs of it I’ve found aren’t all that convincing for a first-time linear algebra student. Still, I’ll try to prove this fact for you in just a little bit.
So, in general, what is rank(A)? Since the rank is equal to both the number of linearly independent columns and the number of linearly independent rows, then the largest the rank can be is the smaller of the number of rows and columns.
For example, if A is a 7×9 matrix, then its rank is at most 7, since it cannot have more than 7 linearly independent rows. So in general, if A is an n×d matrix, then
0≤rank(A)≤min(n,d)
if n>d, rows can’t be independent⎣⎡52−130313161⎦⎤if n<d, columns can’t be independent⎣⎡1919030300206−12−3⎦⎤
We say a matrix is full rank if it has the largest possible rank for a matrix of its shape, i.e. ifrank(A)=min(n,d). We’ll mostly use this term when refering to square matrices, which we’ll focus more on in Chapter 2.9.
As we’ve seen, colsp(A) is a subspace of Rn that contains all possible results of Ax for x∈Rd. Think of colsp(A) as the “reach” of the columns of A.
If the columns of A are linearly independent, then the only way to create 0 through a linear combination of the columns of A is if all the coefficients in the linear combination are 0, i.e. if x=0 in Ax.
But, if A’s columns aren’t all linearly independent, then there will be some non-zero vectors x where Ax=0. This holds from the definition of linear independence, which says that if there’s a non-zero linear combination of a collection of vectors that produces 0, the vectors aren’t linearly independent.
It turns out that it’s worthwhile to study the set of vectors x that get sent to 0 when multiplied by A.
Sometimes, the null space is also called the kernel of A, though we will mostly avoid that term in our class, since kernel often means something different in the context of machine learning.
Important: nullsp(A) is a subspace of Rd, since it is made up of vectors that get multiplied by A, an n×d matrix.
Let’s return to the example A from earlier.
A=⎣⎡503613−142021−141⎦⎤
nullsp(A) is the set of all vectors x∈R3 where Ax=0. rank(A)=2, so there will be some non-zero vectors x in the null space. But what are they?
To find them, we need to find the general solution to
For example, ⎣⎡−555⎦⎤∈nullsp(A). Remember that vectors in nullsp(A) are in R3, since A is a 5×3 matrix, while vectors in colsp(A) are in R5.
In this example, nullsp(A) is a 1-dimensional subspace of R3. We also know from earlier that rank(A)=2. And curiously, 1+2=3, the number of columns in A. This is not a coincidence, and sheds light on an important theorem.
The proof of this theorem is beyond the scope of our course. But, this is such an important theorem that it’s sometimes called the fundamental theorem of linear algebra. It tells us, for one, that the dimension of the null space is equal to the number of columns minus the rank. “Nullity” is just another word for the dimension of the null space.
Let’s see how it can be used in practice. Some of these examples are taken from Gilbert Strang’s book.
Describe the column space, row space, and null space of the matrix
A=[122436]
Solution
A only has one linearly independent column; columns 2 and 3 are both just multiples of column 1. So, rank(A)=1.
The column space of A is the span of the first column, i.e.
colsp(A)=span({[12]})
This is a 1-dimensional subspace of R2; 1 comes from the rank of A, and 2 comes from the fact that each column of A is a vector in R2.
The row space of A is the span of the first row, i.e.
colsp(AT)=span⎝⎛⎩⎨⎧⎣⎡123⎦⎤⎭⎬⎫⎠⎞
This is a 1-dimensional subspace of R3. Remember that the number of linearly independent columns and rows of a matrix are always the same, which is why we know the dimensions of the column space and row space are the same.
The null space of A is the set of all vectors x∈R3 where Ax=0. This is a 2-dimensional subspace of R3. 2 came from the rank-nullity theorem, which says that the dimension of the null space is equal to the number of columns minus the rank, or 3−1=2 here.
Can we say more about the null space? It is the set of all vectors x∈R3 where Ax=0. This is equivalent to the set of all vectors x=⎣⎡xyz⎦⎤ where
x+2y+3z=02x+4y+6z=0
The two equations above are equivalent. So, the null space is the set of all vectors x=⎣⎡xyz⎦⎤ where x+2y+3z=0. This is a plane in R3, as we’d expect from a 2-dimensional subspace.
Suppose A is a 3×4 matrix with rank 3. Describe colsp(A) and nullsp(AT). The latter is the null space of AT, and is sometimes called the left null space of A, as it’s a null space of A when multiplied by vectors on the left, like in yTA (which performs the same calculation as ATy; the results are just transposed).
Solution
When faced with a problem like this, I like drawing out a rectangle that roughly shows me the dimensions of the matrix.
A=⎣⎡⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⎦⎤
The rank tells us that there are 3 linearly independent columns. Since the columns themselves are in R3, this must mean that
colsp(A)=all of R3
since any 3 linearly independent vectors in R3 will span the entire space. (The existence of the 4th linearly dependent column doesn’t change this fact, it just means that the linear combinations of the four columns won’t be unique.)
The null space of AT is the set of all vectors y where
nullsp(AT) is a collection of vectors in R3. What is the dimension of this space? Rank-nullity tells us that
rank(A)+dim(nullsp(A))=number of columns of A
We can’t use this directly, since the matrix we’re dealing with is AT, not A. So, replacing A with AT in the equation above, we get
rank(AT)+dim(nullsp(AT))=number of columns of AT
But, rank(AT)=rank(A), and the number of columns of AT is the same as the number of rows of A. So,
rank(A)+dim(nullsp(AT))=number of rows of Adim(nullsp(AT))=number of rows of A−rank(A)
But since A has 3 rows and a rank of 3, we know that
dim(nullsp(AT))=3−3=0
So, nullsp(AT) is a 0-dimensional subspace of R3, meaning it only contains the zero vector.
nullsp(AT)={0}
More intuitively, using the results of the previous example, we know that AT’s columns are linearly independent (since there are 3 of them and rank(AT)=3). So, the only vector in nullsp(AT) is the zero vector.
Suppose r∈colsp(AT) and n∈nullsp(A), meaning r is in the row space of A and n is in the null space of A.
Prove that r and n must be orthogonal. Often, this is phrased as “the row space and null space are orthogonal complements”.
Solution
The row space of A, colsp(AT), is the set of all vectors r where r=ATy for some y∈Rn. Note that if A is an n×d matrix, then AT is a d×n matrix, and r is in Rd.
The null space of A, nullsp(A), is the set of all vectors n where An=0. Note that if A is an n×d matrix, then n is in Rd.
So, r and n are both in Rd, which means they exist in the same universe (they have the same number of components), and so we can ask if they’re orthogonal. (If they had different numbers of components, this question would be a non-starter.)
In order to show that they’re orthogonal, we need to show that their dot product is 0.
r⋅n=(ATy)⋅n=u⋅v=uTv(ATy)Tn=yT0An=yT0=0
So, every vector in the row space of A is orthogonal to every vector in the null space of A!
Suppose A is an n×d matrix, and B is a d×p matrix.
Explain why the column space of AB is a subset of the column space of A, and the row space of AB is a subset of the row space of B. What does this imply about the rank of AB?
Solution
To show that colsp(AB) is a subset of colsp(A), i.e.
colsp(AB)⊆colsp(A)
we need to show that any vector in colsp(AB) is also in colsp(A). AB is a matrix of shape n×p, so to multiply it by a vector x, x must be in Rp.
Suppose y∈Rn is in colsp(AB). Then, y can be written as
y=ABx
for some x∈Rp.
But,
y=ABx=Avector in Rd(Bx)
meaning that y is also a linear combination of the columns of A (since we wrote it as Az, for some vector z∈Rd).
So, we’ve shown that if y is in colsp(AB), then y is also in colsp(A). Therefore, colsp(AB)⊆colsp(A). This tells us that rank(AB)≤rank(A), since the rank of a matrix is the dimension of its column space.
Using similar logic, any vector in rowsp(AB)=colsp((AB)T)=colsp(BTAT) is of the form
BTATx
But, BTATx is of the form BTy for some y∈Rd, meaning that BTATx is in the column space of BT or row space of B. So, rowsp(AB)⊆rowsp(B), meaning rank(AB)≤rank(B).
Putting these two results together, we have that
rank(AB)≤rank(A) and rank(AB)≤rank(B)
But, since both must be true, then
rank(AB)≤min(rank(A),rank(B))
So intuitively, when we multiply two matrices, the rank of the resulting matrix can’t be greater than the rank of either of the two matrices we started with, but it can “drop”.
Prove that rank(XTX)=rank(X) for any n×d matrix X.
The matrix XTX is hugely important for our regression problem, and you’ll also see in Homework 5 that it helps define the covariance matrix of our data.
Solution
First, let’s think about the shape of XTX. If X is an n×d matrix, then XT is a d×n matrix, and XTX is a d×d matrix. So, X and XTX have the same number of columns (d), but XTX is also square, which X doesn’t have to be.
The way we’ll proceed is to show that both matrices have the same null space. If we can show they both have the same null space, then the dimensions of both null spaces must be the same. Since the rank-nullity theorem tells us that
rank(A)+dim(nullsp(A))=number of columns of A
if we can show that dim(nullsp(XTX))=dim(nullsp(X)), then we’ll have shown that rank(XTX)=rank(X), since both X and XTX have the same number of columns.
To show that both X and XTX have the same null space, we need to show that any vector x in the null space of X is also in the null space of XTX, and vice versa.
Part 1: Show that v∈nullsp(X)⟹v∈nullsp(XTX)
Suppose v∈nullsp(X). Then, Xv=0. Multiplying both sides on the left by XT, we get
XTXv=XT0=0
So, v is also in the null space of XTX.
(This was the easier part.)
Part 2: Show that v∈nullsp(XTX)⟹v∈nullsp(X)
Suppose v∈nullsp(XTX). Then, XTXv=0. It’s not immediately clear how to use this to show that v is in the null space of X, so let’s try something different.
What if we left-multiply both sides of the equation by vT? This is equivalent to taking the dot product of both sides with v.
XTXv=0vTXTXv=vT0=0
Okay, so vTXTXv=0. What does this tell us? If we look closely, the left-hand side is really just (Xv)TXv, which is also just (Xv)⋅(Xv), which we also know is equal to ∥Xv∥2.
So, we have
∥Xv∥2=0
But, the only vector with a norm of 0 is the zero vector. So, Xv=0. So, we’ve now shown that if XTXv=0, then it has to be the case thatXv=0 also.
Now that we’ve shown both directions, we’ve shown that nullsp(XTX)=nullsp(X), because any vector in one of these sets is also in the other set, and so the two sets must be equal.
In the coming chapters, we’ll spend a lot of our time decomposing matrices into smaller pieces, each of which gives us some insight into the data stored in the matrix. This is not a new concept: in earlier math courses, you’ve learned to write a number as a product of prime factors, and to factor quadratics like x2−5x+6 into (x−2)(x−3).
The “ultimate” decomposition for the purposes of machine learning is the singular value decomposition (SVD), which decomposes a matrix into the product of three other matrices.
X=UΣVT
where U and V are orthogonal matrices and Σ is a diagonal matrix. This decomposition will allow us to solve the dimensionality reduction problem we first alluded to in Chapter 1.1.
We’re not yet ready for that. For now, we’ll introduce a decomposition that ties together ideas from this section, and allows us to prove the fact that the number of linearly independent columns of A is equal to the number of linearly independent rows of A, i.e. that rank(A)=rank(AT).
Suppose A is an n×d matrix with rank r. This tells us that A has r linearly independent columns (and rows), and the remaining d−r columns can be written as linear combinations of the r “good” columns.
Let’s continue thinking about
A=⎣⎡503613−142021−141⎦⎤
Recall, rank(A)=2, and
column 3=column 1−column 2
Define the matrix C as containing just the linearly independent columns of A, i.e. C’s columns are a basis for colsp(A). Notice that C is a 5×2 matrix; its number of columns is equal to the rank of A.
C=⎣⎡503613−1420⎦⎤
I’d like to produce a “mixing” matrix R such that A=CR. Here, R will tell us how to “mix” the columns of C (which are linearly independent) to produce the columns of A. Since A is 5×3 and C is 5×2, R must be 2×3 in order for the dimensions to work out.
Column 1 of A is just 1(column 1 of C)
Column 2 of A is just 1(column 2 of C)
Column 3 of A is 1(column 1 of C)−1(column 2 of C)
By definition, C’s columns are a basis for the column space of A, colsp(A), since C’s columns are linearly independent and span colsp(A).
But, there’s another fact hiding in plain sight: R’s rows are a basis for the row space of A, colsp(AT)! The row space of A is the set of all linear combinations of A’s rows, which is a subspace of R3, and any vector in that subspace can be written as a linear combination of R’s two rows, ⎣⎡101⎦⎤ and ⎣⎡01−1⎦⎤. (In fact, the two rows of R happen to be rows in A, though this is not always guaranteed.)
To be precise, a CR decomposition of A consists of a matrix C of shape n×r that contains A’s linearly independent columns, and a matrix R of shape r×d that contains the coefficients needed to write each column of A as a linear combination of the columns of C. As we’re seeing above, in addition to C’s columns being linearly independent, R’s rows are also linearly independent. I won’t prove this fact here, but it’s a good thought exercise to think about why it must be true.
Notice that I’ve said a CR decomposition, not the CR decomposition, because this decomposition is not unique. Think about why the following is also a CR decomposition of A.
C = np.array([[5, 2],
[0, 1],
[3, -1],
[6, 4],
[1, 1]])
R = np.array([[1, 1, 0],
[0, -1, 1]])
C @ R
You can think of rank(A) as being the minimum possible number of columns in C to make A=CR possible.
Here, since rank(A)=2, the minimum possible number of columns in C is 2; no C with just a single column would allow for A=CR to work, and while a C with 3 columns would work, 2 is the minimum number of possible columns.
The reason I’ve introduced the CR decomposition is that it will allow us to see why rank(A)=rank(AT) for any matrix A, i.e. that the number of linearly independent columns of A is equal to the number of linearly independent rows of A.
For now, let’s suppose we don’t know that this is true, and just interpret rank(A) as the number of linearly independent columns of A.
Suppose, as we typically do, that A is an n×d matrix with rank r, and that A=CR is a CR decomposition of A into matrices
C (with shape n×r), and
R (with shape r×d)
As mentioned above, C’s columns are linearly independent, and R’s rows are linearly independent.
What happens if we transpose A=CR? We get
AT=(CR)T=RTCT
This tells us that AT can be written as the product of RT (shape d×r) and CT (shape r×n).
Key insight: this is just a CR decomposition of AT, using RT and CT instead of C and R!
The columns of AT are just the rows of A, and the columns of RT are just the rows of R. Since we knew that the rows of R were linearly independent (using my earlier assumption), we know that the columns of RT are also linearly independent, which makes this a CR decomposition of AT.
Because AT has a CR decomposition using a matrix with r linearly independent columns (RT), it must have a rank of r. But since rank(A)=r, we must have
rank(A)=rank(AT)
I skipped some minor steps in this proof, but the main idea is that the CR decomposition A=CR gives us two things:
A basis for the column space of A, stored in C’s columns
A basis for the row space of A, stored in R’s rows
Remember that C is carefully constructed to contain A’s linearly independent columns; any arbitrary decomposition of A=XY might not have these properties.