As we’ve seen, colsp(A) is a subspace of Rn that contains all possible results of Ax for x∈Rd. Think of colsp(A) as the “reach” of the columns of A.
If the columns of A are linearly independent, then the only way to create 0 through a linear combination of the columns of A is if all the coefficients in the linear combination are 0, i.e. if x=0 in Ax.
But, if A’s columns aren’t all linearly independent, then there will be some non-zero vectors x where Ax=0. This holds from the definition of linear independence, which says that if there’s a non-zero linear combination of a collection of vectors that produces 0, the vectors aren’t linearly independent.
It turns out that it’s worthwhile to study the set of vectors x that get sent to 0 when multiplied by A.
Sometimes, the null space is also called the kernel of A, though we will mostly avoid that term in our class, since kernel often means something different in the context of machine learning.
Important: nullsp(A) is a subspace of Rd, since it is made up of vectors that get multiplied by A, an n×d matrix.
Let’s return to the example A from earlier.
A=⎣⎡503613−142021−141⎦⎤
nullsp(A) is the set of all vectors x∈R3 where Ax=0. rank(A)=2, so there will be some non-zero vectors x in the null space. But what are they?
To find them, we need to find the general solution to
For example, ⎣⎡−555⎦⎤∈nullsp(A). Remember that vectors in nullsp(A) are in R3, since A is a 5×3 matrix, while vectors in colsp(A) are in R5.
In this example, nullsp(A) is a 1-dimensional subspace of R3. We also know from earlier that rank(A)=2. And curiously, 1+2=3, the number of columns in A. This is not a coincidence, and sheds light on an important theorem.
The proof of this theorem is beyond the scope of our course. But, this is such an important theorem that it’s sometimes called the fundamental theorem of linear algebra. It tells us, for one, that the dimension of the null space is equal to the number of columns minus the rank. “Nullity” is just another word for the dimension of the null space.
Let’s see how it can be used in practice. Some of these examples are taken from Gilbert Strang’s book.
Describe the column space, row space, and null space of the matrix
A=[122436]
Solution
A only has one linearly independent column; columns 2 and 3 are both just multiples of column 1. So, rank(A)=1.
The column space of A is the span of the first column, i.e.
colsp(A)=span({[12]})
This is a 1-dimensional subspace of R2; 1 comes from the rank of A, and 2 comes from the fact that each column of A is a vector in R2.
The row space of A is the span of the first row, i.e.
colsp(AT)=span⎝⎛⎩⎨⎧⎣⎡123⎦⎤⎭⎬⎫⎠⎞
This is a 1-dimensional subspace of R3. Remember that the number of linearly independent columns and rows of a matrix are always the same, which is why we know the dimensions of the column space and row space are the same.
The null space of A is the set of all vectors x∈R3 where Ax=0. This is a 2-dimensional subspace of R3. 2 came from the rank-nullity theorem, which says that the dimension of the null space is equal to the number of columns minus the rank, or 3−1=2 here.
Can we say more about the null space? It is the set of all vectors x∈R3 where Ax=0. This is equivalent to the set of all vectors x=⎣⎡xyz⎦⎤ where
x+2y+3z=02x+4y+6z=0
The two equations above are equivalent. So, the null space is the set of all vectors x=⎣⎡xyz⎦⎤ where x+2y+3z=0. This is a plane in R3, as we’d expect from a 2-dimensional subspace.
Suppose A is a 3×4 matrix with rank 3. Describe colsp(A) and nullsp(AT). The latter is the null space of AT, and is sometimes called the left null space of A, as it’s a null space of A when multiplied by vectors on the left, like in yTA (which performs the same calculation as ATy; the results are just transposed).
Solution
When faced with a problem like this, I like drawing out a rectangle that roughly shows me the dimensions of the matrix.
A=⎣⎡⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⎦⎤
The rank tells us that there are 3 linearly independent columns. Since the columns themselves are in R3, this must mean that
colsp(A)=all of R3
since any 3 linearly independent vectors in R3 will span the entire space. (The existence of the 4th linearly dependent column doesn’t change this fact, it just means that the linear combinations of the four columns won’t be unique.)
The null space of AT is the set of all vectors y where
nullsp(AT) is a collection of vectors in R3. What is the dimension of this space? Rank-nullity tells us that
rank(A)+dim(nullsp(A))=number of columns of A
We can’t use this directly, since the matrix we’re dealing with is AT, not A. So, replacing A with AT in the equation above, we get
rank(AT)+dim(nullsp(AT))=number of columns of AT
But, rank(AT)=rank(A), and the number of columns of AT is the same as the number of rows of A. So,
rank(A)+dim(nullsp(AT))=number of rows of Adim(nullsp(AT))=number of rows of A−rank(A)
But since A has 3 rows and a rank of 3, we know that
dim(nullsp(AT))=3−3=0
So, nullsp(AT) is a 0-dimensional subspace of R3, meaning it only contains the zero vector.
nullsp(AT)={0}
More intuitively, using the results of the previous example, we know that AT’s columns are linearly independent (since there are 3 of them and rank(AT)=3). So, the only vector in nullsp(AT) is the zero vector.
Suppose r∈colsp(AT) and n∈nullsp(A), meaning r is in the row space of A and n is in the null space of A.
Prove that r and n must be orthogonal. Often, this is phrased as “the row space and null space are orthogonal complements”.
Solution
The row space of A, colsp(AT), is the set of all vectors r where r=ATy for some y∈Rn. Note that if A is an n×d matrix, then AT is a d×n matrix, and r is in Rd.
The null space of A, nullsp(A), is the set of all vectors n where An=0. Note that if A is an n×d matrix, then n is in Rd.
So, r and n are both in Rd, which means they exist in the same universe (they have the same number of components), and so we can ask if they’re orthogonal. (If they had different numbers of components, this question would be a non-starter.)
In order to show that they’re orthogonal, we need to show that their dot product is 0.
r⋅n=(ATy)⋅n=u⋅v=uTv(ATy)Tn=yT0An=yT0=0
So, every vector in the row space of A is orthogonal to every vector in the null space of A!
Suppose A is an n×d matrix, and B is a d×p matrix.
Explain why the column space of AB is a subset of the column space of A, and the row space of AB is a subset of the row space of B. What does this imply about the rank of AB?
Solution
To show that colsp(AB) is a subset of colsp(A), i.e.
colsp(AB)⊆colsp(A)
we need to show that any vector in colsp(AB) is also in colsp(A). AB is a matrix of shape n×p, so to multiply it by a vector x, x must be in Rp.
Suppose y∈Rn is in colsp(AB). Then, y can be written as
y=ABx
for some x∈Rp.
But,
y=ABx=Avector in Rd(Bx)
meaning that y is also a linear combination of the columns of A (since we wrote it as Az, for some vector z∈Rd).
So, we’ve shown that if y is in colsp(AB), then y is also in colsp(A). Therefore, colsp(AB)⊆colsp(A). This tells us that rank(AB)≤rank(A), since the rank of a matrix is the dimension of its column space.
Using similar logic, any vector in rowsp(AB)=colsp((AB)T)=colsp(BTAT) is of the form
BTATx
But, BTATx is of the form BTy for some y∈Rd, meaning that BTATx is in the column space of BT or row space of B. So, rowsp(AB)⊆rowsp(B), meaning rank(AB)≤rank(B).
Putting these two results together, we have that
rank(AB)≤rank(A) and rank(AB)≤rank(B)
But, since both must be true, then
rank(AB)≤min(rank(A),rank(B))
So intuitively, when we multiply two matrices, the rank of the resulting matrix can’t be greater than the rank of either of the two matrices we started with, but it can “drop”.
Prove that rank(XTX)=rank(X) for any n×d matrix X.
The matrix XTX is hugely important for our regression problem, and you’ll also see in Homework 5 that it helps define the covariance matrix of our data.
Solution
First, let’s think about the shape of XTX. If X is an n×d matrix, then XT is a d×n matrix, and XTX is a d×d matrix. So, X and XTX have the same number of columns (d), but XTX is also square, which X doesn’t have to be.
The way we’ll proceed is to show that both matrices have the same null space. If we can show they both have the same null space, then the dimensions of both null spaces must be the same. Since the rank-nullity theorem tells us that
rank(A)+dim(nullsp(A))=number of columns of A
if we can show that dim(nullsp(XTX))=dim(nullsp(X)), then we’ll have shown that rank(XTX)=rank(X), since both X and XTX have the same number of columns.
To show that both X and XTX have the same null space, we need to show that any vector x in the null space of X is also in the null space of XTX, and vice versa.
Part 1: Show that v∈nullsp(X)⟹v∈nullsp(XTX)
Suppose v∈nullsp(X). Then, Xv=0. Multiplying both sides on the left by XT, we get
XTXv=XT0=0
So, v is also in the null space of XTX.
(This was the easier part.)
Part 2: Show that v∈nullsp(XTX)⟹v∈nullsp(X)
Suppose v∈nullsp(XTX). Then, XTXv=0. It’s not immediately clear how to use this to show that v is in the null space of X, so let’s try something different.
What if we left-multiply both sides of the equation by vT? This is equivalent to taking the dot product of both sides with v.
XTXv=0vTXTXv=vT0=0
Okay, so vTXTXv=0. What does this tell us? If we look closely, the left-hand side is really just (Xv)TXv, which is also just (Xv)⋅(Xv), which we also know is equal to ∥Xv∥2.
So, we have
∥Xv∥2=0
But, the only vector with a norm of 0 is the zero vector. So, Xv=0. So, we’ve now shown that if XTXv=0, then it has to be the case thatXv=0 also.
Now that we’ve shown both directions, we’ve shown that nullsp(XTX)=nullsp(X), because any vector in one of these sets is also in the other set, and so the two sets must be equal.
In the coming chapters, we’ll spend a lot of our time decomposing matrices into smaller pieces, each of which gives us some insight into the data stored in the matrix. This is not a new concept: in earlier math courses, you’ve learned to write a number as a product of prime factors, and to factor quadratics like x2−5x+6 into (x−2)(x−3).
The “ultimate” decomposition for the purposes of machine learning is the singular value decomposition (SVD), which decomposes a matrix into the product of three other matrices.
X=UΣVT
where U and V are orthogonal matrices and Σ is a diagonal matrix. This decomposition will allow us to solve the dimensionality reduction problem we first alluded to in Chapter 1.1.
We’re not yet ready for that. For now, we’ll introduce a decomposition that ties together ideas from this section, and allows us to prove the fact that the number of linearly independent columns of A is equal to the number of linearly independent rows of A, i.e. that rank(A)=rank(AT).
Suppose A is an n×d matrix with rank r. This tells us that A has r linearly independent columns (and rows), and the remaining d−r columns can be written as linear combinations of the r “good” columns.
Let’s continue thinking about
A=⎣⎡503613−142021−141⎦⎤
Recall, rank(A)=2, and
column 3=column 1−column 2
Define the matrix C as containing just the linearly independent columns of A, i.e. C’s columns are a basis for colsp(A). Notice that C is a 5×2 matrix; its number of columns is equal to the rank of A.
C=⎣⎡503613−1420⎦⎤
I’d like to produce a “mixing” matrix R such that A=CR. Here, R will tell us how to “mix” the columns of C (which are linearly independent) to produce the columns of A. Since A is 5×3 and C is 5×2, R must be 2×3 in order for the dimensions to work out.
Column 1 of A is just 1(column 1 of C)
Column 2 of A is just 1(column 2 of C)
Column 3 of A is 1(column 1 of C)−1(column 2 of C)
By definition, C’s columns are a basis for the column space of A, colsp(A), since C’s columns are linearly independent and span colsp(A).
But, there’s another fact hiding in plain sight: R’s rows are a basis for the row space of A, colsp(AT)! The row space of A is the set of all linear combinations of A’s rows, which is a subspace of R3, and any vector in that subspace can be written as a linear combination of R’s two rows, ⎣⎡101⎦⎤ and ⎣⎡01−1⎦⎤. (In fact, the two rows of R happen to be rows in A, though this is not always guaranteed.)
To be precise, a CR decomposition of A consists of a matrix C of shape n×r that contains A’s linearly independent columns, and a matrix R of shape r×d that contains the coefficients needed to write each column of A as a linear combination of the columns of C. As we’re seeing above, in addition to C’s columns being linearly independent, R’s rows are also linearly independent. I won’t prove this fact here, but it’s a good thought exercise to think about why it must be true.
Notice that I’ve said a CR decomposition, not the CR decomposition, because this decomposition is not unique. Think about why the following is also a CR decomposition of A.
C = np.array([[5, 2],
[0, 1],
[3, -1],
[6, 4],
[1, 1]])
R = np.array([[1, 1, 0],
[0, -1, 1]])
C @ R
You can think of rank(A) as being the minimum possible number of columns in C to make A=CR possible.
Here, since rank(A)=2, the minimum possible number of columns in C is 2; no C with just a single column would allow for A=CR to work, and while a C with 3 columns would work, 2 is the minimum number of possible columns.
The reason I’ve introduced the CR decomposition is that it will allow us to see why rank(A)=rank(AT) for any matrix A, i.e. that the number of linearly independent columns of A is equal to the number of linearly independent rows of A.
For now, let’s suppose we don’t know that this is true, and just interpret rank(A) as the number of linearly independent columns of A.
Suppose, as we typically do, that A is an n×d matrix with rank r, and that A=CR is a CR decomposition of A into matrices
C (with shape n×r), and
R (with shape r×d)
As mentioned above, C’s columns are linearly independent, and R’s rows are linearly independent.
What happens if we transpose A=CR? We get
AT=(CR)T=RTCT
This tells us that AT can be written as the product of RT (shape d×r) and CT (shape r×n).
Key insight: this is just a CR decomposition of AT, using RT and CT instead of C and R!
The columns of AT are just the rows of A, and the columns of RT are just the rows of R. Since we knew that the rows of R were linearly independent (using my earlier assumption), we know that the columns of RT are also linearly independent, which makes this a CR decomposition of AT.
Because AT has a CR decomposition using a matrix with r linearly independent columns (RT), it must have a rank of r. But since rank(A)=r, we must have
rank(A)=rank(AT)
I skipped some minor steps in this proof, but the main idea is that the CR decomposition A=CR gives us two things:
A basis for the column space of A, stored in C’s columns
A basis for the row space of A, stored in R’s rows
Remember that C is carefully constructed to contain A’s linearly independent columns; any arbitrary decomposition of A=XY might not have these properties.