As we saw in Chapter 5.3, colsp(A) is a subspace of Rn that contains all possible results of Ax for x∈Rd. Think of colsp(A) as the “reach” of the columns of A.
If the columns of A are linearly independent, then the only way to create 0 through a linear combination of the columns of A is if all the coefficients in the linear combination are 0, i.e. if x=0 in Ax.
But, if A’s columns aren’t all linearly independent, then there will be some non-zero vectors x where Ax=0. This holds from the definition of linear independence, which says that if there’s a non-zero linear combination of a collection of vectors that produces 0, the vectors aren’t linearly independent.
It turns out that it’s worthwhile to study the set of vectors x that get sent to 0 when multiplied by A.
Sometimes, the null space is also called the kernel of A, though we will mostly avoid that term in our class, since kernel often means something different in the context of machine learning.
Important: nullsp(A) is a subspace of Rd, since it is made up of vectors that get multiplied by A, an n×d matrix. Vectors in nullsp(A) are in Rd, while vectors in colsp(A) are in Rn.
Let’s return to the example A from earlier.
A=⎣⎡503613−142021−141⎦⎤
nullsp(A) is the set of all vectors x∈R3 where Ax=0. rank(A)=2, which is less than 3 (the number of columns of A), so there will be some non-zero vectors x in the null space. But what are they?
There’s a shortcut to finding the null space of a matrix, which works if you happen to already know the relationship between the columns of A, or if the relationship is simple enough to eyeball. In Chapter 5.3, we saw that
column 3=column 1−column 2
or, in other words,
column 1−column 2−column 3=0
This, right away, tells us that taking 1 of column 1, subtracting 1 of column 2, and subtracting 1 of column 3 will give us the zero vector. But Ax is just a linear combination of the columns in A using the coefficients in x, so this means that
A⎣⎡1−1−1⎦⎤=0
So, we’ve found a vector in nullsp(A). Any scalar multiple of this vector will also be in nullsp(A).
In this example, nullsp(A) is a 1-dimensional subspace of R3. We also know from earlier that rank(A)=2. And curiously, 1+2=3, the number of columns in A. This is not a coincidence, and sheds light on an important theorem.
The proof of this theorem is beyond the scope of our course. But, this is such an important theorem that it’s sometimes called the fundamental theorem of linear algebra. It tells us, for one, that the dimension of the null space is equal to the number of columns minus the rank. “Nullity” is just another word for the dimension of the null space.
Let’s see how it can be used in practice. Some of these examples are taken from Gilbert Strang’s book.
Describe the column space, row space, and null space of the matrix
A=[122436]
Additionally, find a basis for the null space.
Solution
A only has one linearly independent column; columns 2 and 3 are both just multiples of column 1. So, rank(A)=1.
The column space of A is the span of the first column, i.e.
colsp(A)=span({[12]})
This is a 1-dimensional subspace of R2; 1 comes from the rank of A, and 2 comes from the fact that each column of A is a vector in R2.
The row space of A is the span of the first row, i.e.
colsp(AT)=span⎝⎛⎩⎨⎧⎣⎡123⎦⎤⎭⎬⎫⎠⎞
This is a 1-dimensional subspace of R3. Remember that the number of linearly independent columns and rows of a matrix are always the same, which is why we know the dimensions of the column space and row space are the same.
The null space of A is the set of all vectors x∈R3 where Ax=0. This is a 2-dimensional subspace of R3. 2 came from the rank-nullity theorem, which says that the dimension of the null space is equal to the number of columns minus the rank, or
3−1=2
here.
Can we say more about the null space? It is the set of all vectors x∈R3 where Ax=0. This is equivalent to the set of all vectors x=⎣⎡xyz⎦⎤ where
x+2y+3z=02x+4y+6z=0
The two equations above are equivalent. So, the null space is the set of all vectors x=⎣⎡xyz⎦⎤ where x+2y+3z=0. This is a plane in R3, as we’d expect from a 2-dimensional subspace.
Can we find a basis for the null space? Yes, we can. My preferred method is to leverage the fact that we know the relationship between the columns of A:
column 2=2(column 1)⟹2(column 1)−column 2=0
column 3=3(column 1)⟹3(column 1)−column 3=0
So, this tells us that
A⎣⎡2−10⎦⎤=[122436]⎣⎡2−10⎦⎤=[00]
and
A⎣⎡3−11⎦⎤=[122436]⎣⎡3−11⎦⎤=[00]
So, we’ve found two linearly independent vectors in the null space; since nullsp(A) is 2-dimensional, these must be a basis.
Suppose A is a 3×4 matrix with rank 3. Describe colsp(A) and nullsp(AT). The latter is the null space of AT, and is sometimes called the left null space of A, as it’s a null space of A when multiplied by vectors on the left, like in yTA (which performs the same calculation as ATy; the results are just transposed).
Solution
When faced with a problem like this, I like drawing out a rectangle that roughly shows me the dimensions of the matrix.
A=⎣⎡⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⎦⎤
The rank tells us that there are 3 linearly independent columns. Since the columns themselves are in R3, this must mean that
colsp(A)=all of R3
since any 3 linearly independent vectors in R3 will span the entire space. (The existence of the 4th linearly dependent column doesn’t change this fact, it just means that the linear combinations of the four columns won’t be unique.)
The null space of AT is the set of all vectors y where
nullsp(AT) is a collection of vectors in R3. What is the dimension of this space? Rank-nullity tells us that
rank(A)+dim(nullsp(A))=number of columns of A
We can’t use this directly, since the matrix we’re dealing with is AT, not A. So, replacing A with AT in the equation above, we get
rank(AT)+dim(nullsp(AT))=number of columns of AT
But, rank(AT)=rank(A), and the number of columns of AT is the same as the number of rows of A. So,
rank(A)+dim(nullsp(AT))=number of rows of Adim(nullsp(AT))=number of rows of A−rank(A)
But since A has 3 rows and a rank of 3, we know that
dim(nullsp(AT))=3−3=0
So, nullsp(AT) is a 0-dimensional subspace of R3, meaning it only contains the zero vector.
nullsp(AT)={0}
More intuitively, using the results of the previous example, we know that AT’s columns are linearly independent (since there are 3 of them and rank(AT)=3). So, the only vector in nullsp(AT) is the zero vector.
Find a matrix A such that [31]∈colsp(A) and [13]∈nullsp(A).
Solution
A needs to satisfy both of the following conditions.
[31]∈colsp(A)[13]∈nullsp(A)
Since A can be multiplied by a vector in R2, and elements in colsp(A) are in R2, we know that A must be a 2×2 matrix. So, let
A=[acbd]
The second condition gives
A[13]=[a+3bc+3d]=[00]
which implies
a=−3bc=−3d
Hence
A=[−3b−3dbd]
A’s columns are scalar multiples of one another, and are not linearly independent. This is what we’d expect, given that A has a non-trivial null space. (Remember, the trivial null space is just {0}; since A’s null space has more than just 0 in it, its columns are not linearly independent.)
Now, we just need to make sure we pick b and d such that [31]∈colsp(A). The easy solution is to choose b=3 and d=1, which makes [31] literally one of the columns of A, which means it must be in colsp(A). This gives
A=[−9−331]
This is not the only solution; another one came from b=6 and d=2, which would have made
A=[−18−662]
In both (and in the infinitely many other) cases, [31]∈colsp(A) and [13]∈nullsp(A).
Suppose A is an n×d matrix, and B is a d×p matrix.
Explain why the column space of AB is a subset of the column space of A, and the row space of AB is a subset of the row space of B. What does this imply about the rank of AB?
Solution
To show that colsp(AB) is a subset of colsp(A), i.e.
colsp(AB)⊆colsp(A)
we need to show that any vector in colsp(AB) is also in colsp(A). AB is a matrix of shape n×p, so to multiply it by a vector x, x must be in Rp.
Suppose y∈Rn is in colsp(AB). Then, y can be written as
y=ABx
for some x∈Rp.
But,
y=ABx=Avector in Rd(Bx)
meaning that y is also a linear combination of the columns of A (since we wrote it as Az, for some vector z∈Rd).
So, we’ve shown that if y is in colsp(AB), then y is also in colsp(A). Therefore, colsp(AB)⊆colsp(A). This tells us that rank(AB)≤rank(A), since the rank of a matrix is the dimension of its column space.
Using similar logic, any vector in rowsp(AB)=colsp((AB)T)=colsp(BTAT) is of the form
BTATx
But, BTATx is of the form BTy for some y∈Rd, meaning that BTATx is in the column space of BT or row space of B. So, rowsp(AB)⊆rowsp(B), meaning rank(AB)≤rank(B).
Putting these two results together, we have that
rank(AB)≤rank(A) and rank(AB)≤rank(B)
But, since both must be true, then
rank(AB)≤min(rank(A),rank(B))
So intuitively, when we multiply two matrices, the rank of the resulting matrix can’t be greater than the rank of either of the two matrices we started with, but it can “drop”.
Suppose r∈colsp(AT) and n∈nullsp(A), meaning r is in the row space of A and n is in the null space of A.
Prove that r and n must be orthogonal.
Solution
The row space of A, colsp(AT), is the set of all vectors r where r=ATy for some y∈Rn. Note that if A is an n×d matrix, then AT is a d×n matrix, and r is in Rd.
The null space of A, nullsp(A), is the set of all vectors n where An=0. Note that if A is an n×d matrix, then n is in Rd.
So, r and n are both in Rd, which means they exist in the same universe (they have the same number of components), and so we can ask if they’re orthogonal. (If they had different numbers of components, this question would be a non-starter.)
In order to show that they’re orthogonal, we need to show that their dot product is 0.
r⋅n=(ATy)⋅n=u⋅v=uTv(ATy)Tn=yT0An=yT0=0
So, every vector in the row space of A is orthogonal to every vector in the null space of A!
Prove that rank(XTX)=rank(X) for any n×d matrix X.
The matrix XTX is hugely important for our regression problem, and you’ll also see in a homework that it helps define the covariance matrix of our data.
Solution
First, let’s think about the shape of XTX. If X is an n×d matrix, then XT is a d×n matrix, and XTX is a d×d matrix. So, X and XTX have the same number of columns (d), but XTX is also square, which X doesn’t have to be.
The way we’ll proceed is to show that both matrices have the same null space. If we can show they both have the same null space, then the dimensions of both null spaces must be the same. Since the rank-nullity theorem tells us that
rank(A)+dim(nullsp(A))=number of columns of A
if we can show that dim(nullsp(XTX))=dim(nullsp(X)), then we’ll have shown that rank(XTX)=rank(X), since both X and XTX have the same number of columns.
To show that both X and XTX have the same null space, we need to show that any vector x in the null space of X is also in the null space of XTX, and vice versa.
Part 1: Show that v∈nullsp(X)⟹v∈nullsp(XTX)
Suppose v∈nullsp(X). Then, Xv=0. Multiplying both sides on the left by XT, we get
XTXv=XT0=0
So, v is also in the null space of XTX.
(This was the easier part.)
Part 2: Show that v∈nullsp(XTX)⟹v∈nullsp(X)
Suppose v∈nullsp(XTX). Then, XTXv=0. It’s not immediately clear how to use this to show that v is in the null space of X, so let’s try something different.
What if we left-multiply both sides of the equation by vT? This is equivalent to taking the dot product of both sides with v.
XTXv=0vTXTXv=vT0=0
Okay, so vTXTXv=0. What does this tell us? If we look closely, the left-hand side is really just (Xv)TXv, which is also just (Xv)⋅(Xv), which we also know is equal to ∥Xv∥2.
So, we have
∥Xv∥2=0
But, the only vector with a norm of 0 is the zero vector. So, Xv=0. So, we’ve now shown that if XTXv=0, then it has to be the case thatXv=0 also.
Now that we’ve shown both directions, we’ve shown that nullsp(XTX)=nullsp(X), because any vector in one of these sets is also in the other set, and so the two sets must be equal.
In the coming chapters, we’ll spend a lot of our time decomposing matrices into smaller pieces, each of which gives us some insight into the data stored in the matrix. This is not a new concept: in earlier math courses, you’ve learned to write a number as a product of prime factors, and to factor quadratics like x2−5x+6 into (x−2)(x−3).
The “ultimate” decomposition for the purposes of machine learning is the singular value decomposition (SVD), which decomposes a matrix into the product of three other matrices.
X=UΣVT
where U and V are orthogonal matrices and Σ is a diagonal matrix. This decomposition will allow us to solve the dimensionality reduction problem we first alluded to in Chapter 1.1.
We’re not yet ready for that; the SVD will be the final topic we study in this course. For now, we’ll introduce a decomposition that ties together ideas from this section, and allows us to understand why the number of linearly independent columns of A is equal to the number of linearly independent rows of A, i.e. that rank(A)=rank(AT).
Suppose A is an n×d matrix with rank r. This tells us that A has r linearly independent columns, and the remaining d−r columns can be written as linear combinations of the r “good” columns.
Let’s continue thinking about
A=⎣⎡503613−142021−141⎦⎤
Recall, rank(A)=2, and
column 3=column 1−column 2
Define the matrix C as containing just the linearly independent columns of Awhen taken from left to right, i.e. the columns that are obtained by running the algorithm in Chapter 4.2.
given v_1, v_2, ..., v_d # columns of A
initialize linearly independent set S = {v_1}
for i = 2 to d:
if v_i is not a linear combination of S:
add v_i to S
This means that C’s columns are a basis for colsp(A). Notice that C is a 5×2 matrix; its number of columns is equal to the rank of A.
C=⎣⎡503613−1420⎦⎤
I’d like to produce a matrix R such that A=CR. R will tell us how to “mix” the columns of C (which are linearly independent) to produce the columns of A. Since A is 5×3 and C is 5×2, R must be 2×3 in order for the dimensions to work out.
Column 1 of A is just 1(column 1 of C)
Column 2 of A is just 1(column 2 of C)
Column 3 of A is 1(column 1 of C)−1(column 2 of C)
We defined C such that its columns are linearly independent and a basis for colsp(A). But, observe: R’s rows are linearly independent too!
This is because the first two columns of R are [1001], which is I2, the 2×2 identity matrix. It’s impossible for any scalar multiple of the first row of R to produce the second row of R, because the second row has a non-zero value in a position where the first row has a zero value.
Not only are R’s rows linearly independent, they’re also a basis for the row space of A, colsp(AT). Every row of A can be written as a linear combination of R’s rows. The way to see why this is the case is to transpose A=CR to get AT=(CR)T=RTCT.
The rows of R are the columns of RT. The above decomposition of AT=RTCT tells us how to mix the rows of R to make the rows of A! For instance, it tells us that the first row of A is
first row of A5⎣⎡101⎦⎤+3⎣⎡01−1⎦⎤=⎣⎡532⎦⎤
To me, this is a bit magical: we started by picking off linearly independent columns and placing them in C, but the R we multiplied C by to recreate A ended up having linearly independent rows, which span the row space.
Sitting in front of us is an argument that the number of linearly independent columns of A is equal to the number of linearly independent rows of A! In other words, we’ve just argued that
rank(A)=rank(AT)
The gist of it was that if A=CR is a CR decomposition of A, then AT=(CR)T=RTCT is a CR decomposition of AT, and both CR decompositions use the same rank, r.
I never formally defined the CR decomposition; I started with an example.
In the first example we saw above, we constructed C by taking the first r linearly independent columns in A to form C, and placing the coefficients needed to write each column of A as a linear combination of the columns of C in R. When constructed this way, C’s columns are automatically linearly independent, and R’s rows are linearly independent too (using the identity matrix and positioning of 0’s argument from earlier).
A matrix has infinitely many CR decompositions, because we can choose different bases for colsp(A). Both of the following are CR decompositions of our A.
import numpy as np
C = np.array([[5, 2],
[0, 1],
[3, -1],
[6, 4],
[1, 1]])
R = np.array([[1, 1, 0],
[0, -1, 1]])
C @ R
You can think of r=rank(A) as being the minimum possible number of columns in C to make A=CR possible, where C is n×r and R is r×d.
Here, since rank(A)=2, the minimum possible number of columns in C is 2. No C with just a single column would allow for A=CR to work, and while a C with 3 columns would work, 2 is the minimum number of possible columns (and if C had more than 2 columns, C’s columns would no longer be linearly independent).
Why did I introduce the CR decomposition? In truth, it’s not used to solve practical problems. Instead, I think it gives us a nice way to understand what the rank of a matrix really means. For one, it gave us one way to see why rank(A)=rank(AT).
Let me try and get you excited about the CR decomposition with a more practical example. Suppose we have a large 1000×50 matrix with rank r=8. Storing the entire matrix requires storing 1000×50=50000 numbers. But, since its rank is 8, there exists a CR decomposition of A into A=CR where C is 1000×8 and R is 8×50. Storing C and R requires storing
size of C1000×8+size of R8×50=8400
numbers, far fewer than the 50,000 numbers required to store the entire matrix.
So, the CR decomposition is one way to compress a matrix; here, it relied on the fact that the rank of the matrix was much smaller than its number of columns. Hopefully this gives you some appreciation for why the rank of a matrix is such a fundamental property. Compression – of images, say – will be a recurring theme in Chapter 10, once we eventually make our way to the singular value decomposition.