Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

5.4. Null Space and the Rank-Nullity Theorem

Overview

As we saw in Chapter 5.3, colsp(A)\text{colsp}(A) is a subspace of Rn\mathbb{R}^n that contains all possible results of AxA \vec x for xRd\vec x \in \mathbb{R}^d. Think of colsp(A)\text{colsp}(A) as the “reach” of the columns of AA.

If the columns of AA are linearly independent, then the only way to create 0\vec 0 through a linear combination of the columns of AA is if all the coefficients in the linear combination are 0, i.e. if x=0\vec x = \vec 0 in AxA \vec x.

But, if AA’s columns aren’t all linearly independent, then there will be some non-zero vectors x\vec x where Ax=0A \vec x = \vec 0. This holds from the definition of linear independence, which says that if there’s a non-zero linear combination of a collection of vectors that produces 0\vec 0, the vectors aren’t linearly independent.

It turns out that it’s worthwhile to study the set of vectors x\vec x that get sent to 0\vec 0 when multiplied by AA.


Null Space

Sometimes, the null space is also called the kernel of AA, though we will mostly avoid that term in our class, since kernel often means something different in the context of machine learning.

Important: nullsp(A)\text{nullsp}(A) is a subspace of Rd\mathbb{R}^{\boxed{d}}, since it is made up of vectors that get multiplied by AA, an n×dn \times d matrix. Vectors in nullsp(A)\text{nullsp}(A) are in Rd\mathbb{R}^d, while vectors in colsp(A)\text{colsp}(A) are in Rn\mathbb{R}^n.

Let’s return to the example AA from earlier.

A=[532011341624101]A = \begin{bmatrix} 5 & 3 & 2 \\ 0 & -1 & 1 \\ 3 & 4 & -1 \\ 6 & 2 & 4 \\ 1 & 0 & 1 \end{bmatrix}

nullsp(A)\text{nullsp}(A) is the set of all vectors xR3\vec x \in \mathbb{R}^3 where Ax=0A \vec x = \vec 0. rank(A)=2\text{rank}(A) = 2, which is less than 3 (the number of columns of AA), so there will be some non-zero vectors x\vec x in the null space. But what are they?

The Systems of Equations Approach

To find them, we need to find the general solution to

[532011341624101]A[x1x2x3]x=[00000]0\underbrace{\begin{bmatrix} 5 & 3 & 2 \\ 0 & -1 & 1 \\ 3 & 4 & -1 \\ 6 & 2 & 4 \\ 1 & 0 & 1 \end{bmatrix}}_{A} \underbrace{\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}}_{\vec x} = \underbrace{\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}}_{\vec 0}

One way to proceed is to solve the system of equations

5x1+3x2+2x3=0x2+x3=03x1+4x2x3=06x1+2x2+4x3=0x1+x3=0\begin{align} 5x_1 &+ 3x_2 &+ 2x_3 &= 0 \tag{1} \\ &- x_2 &+ x_3 &= 0 \tag{2} \\ 3x_1 &+ 4x_2 &- x_3 &= 0 \tag{3} \\ 6x_1 &+ 2x_2 &+ 4x_3 &= 0 \tag{4} \\ x_1 & &+ x_3 &= 0 \tag{5} \end{align}

Equation (2) tells us x2=x3x_2 = x_3, and equation (5) tells us x1=x3x_1 = -x_3. So, any vector in nullsp(A)\text{nullsp}(A) is of the form

[x1x2x3]=[x3x3x3]=x3[111],x3R\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} -x_3 \\ x_3 \\ x_3 \end{bmatrix} = x_3 \begin{bmatrix} -1 \\ 1 \\ 1 \end{bmatrix}, \quad x_3 \in \mathbb{R}

So,

nullsp(A)=span({[111]})\text{nullsp}(A) = \text{span}\left( \left\{ \begin{bmatrix} -1 \\ 1 \\ 1 \end{bmatrix} \right\} \right)

For example, [555]nullsp(A)\begin{bmatrix} -5 \\ 5 \\ 5 \end{bmatrix} \in \text{nullsp}(A). Remember that vectors in nullsp(A)\text{nullsp}(A) are in R3\mathbb{R}^3, since AA is a 5×35 \times 3 matrix, while vectors in colsp(A)\text{colsp}(A) are in R5\mathbb{R}^5.

The Shortcut

There’s a shortcut to finding the null space of a matrix, which works if you happen to already know the relationship between the columns of AA, or if the relationship is simple enough to eyeball. In Chapter 5.3, we saw that

column 3=column 1column 2\text{column 3} = \text{column 1} - \text{column 2}

or, in other words,

column 1column 2column 3=0\text{column 1} - \text{column 2} - \text{column 3} = \vec 0

This, right away, tells us that taking 1 of column 1, subtracting 1 of column 2, and subtracting 1 of column 3 will give us the zero vector. But AxA \vec x is just a linear combination of the columns in AA using the coefficients in x\vec x, so this means that

A[111]=0A \begin{bmatrix} 1 \\ -1 \\ -1 \end{bmatrix} = \vec 0

So, we’ve found a vector in nullsp(A)\text{nullsp}(A). Any scalar multiple of this vector will also be in nullsp(A)\text{nullsp}(A).

In this example, nullsp(A)\text{nullsp}(A) is a 1-dimensional subspace of R3\mathbb{R}^3. We also know from earlier that rank(A)=2\text{rank}(A) = 2. And curiously, 1+2=31 + 2 = 3, the number of columns in AA. This is not a coincidence, and sheds light on an important theorem.


Rank-Nullity Theorem

The proof of this theorem is beyond the scope of our course. But, this is such an important theorem that it’s sometimes called the fundamental theorem of linear algebra. It tells us, for one, that the dimension of the null space is equal to the number of columns minus the rank. “Nullity” is just another word for the dimension of the null space.

Let’s see how it can be used in practice. Some of these examples are taken from Gilbert Strang’s book.


Examples

Example: Linearly Independent Columns

Let A=[3193063211]A = \begin{bmatrix} 3 & 1 \\ 9 & -3 \\ 0 & 6 \\ 3 & 2 \\ 1 & 1 \end{bmatrix}.

What is nullsp(A)\text{nullsp}(A)?

Example: Describing Spaces

Describe the column space, row space, and null space of the matrix

A=[123246]A = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \end{bmatrix}

Additionally, find a basis for the null space.

Example: Thinking Abstractly

Suppose AA is a 3×43 \times 4 matrix with rank 3. Describe colsp(A)\text{colsp}(A) and nullsp(AT)\text{nullsp}(A^T). The latter is the null space of ATA^T, and is sometimes called the left null space of AA, as it’s a null space of AA when multiplied by vectors on the left, like in yTA\vec y^T A (which performs the same calculation as ATyA^T \vec y; the results are just transposed).

Example: Thinking Even More Abstractly

If AA is a 7×97 \times 9 matrix with rank 5, find the dimensions of each of the following.

  1. colsp(A)\text{colsp}(A)

  2. colsp(AT)\text{colsp}(A^T)

  3. nullsp(A)\text{nullsp}(A)

  4. nullsp(AT)\text{nullsp}(A^T)

Example: Constructing a Matrix

Find a matrix AA such that [31]colsp(A)\begin{bmatrix} 3 \\ 1 \end{bmatrix} \in \text{colsp}(A) and [13]nullsp(A)\begin{bmatrix} 1 \\ 3 \end{bmatrix} \in \text{nullsp}(A).

Example: Rank of ABAB vs. Rank of AA or BB

Suppose AA is an n×dn \times d matrix, and BB is a d×pd \times p matrix. Explain why the column space of ABAB is a subset of the column space of AA, and the row space of ABAB is a subset of the row space of BB. What does this imply about the rank of ABAB?

Example: Orthogonal Complements 🎥

Suppose rcolsp(AT)\vec r \in \text{colsp}(A^T) and nnullsp(A)\vec n \in \text{nullsp}(A), meaning r\vec r is in the row space of AA and n\vec n is in the null space of AA.

Prove that r\vec r and n\vec n must be orthogonal.

Example: Rank of XTXX^TX 🎥

Prove that rank(XTX)=rank(X)\text{rank}(X^T X) = \text{rank}(X) for any n×dn \times d matrix XX.

The matrix XTXX^TX is hugely important for our regression problem, and you’ll also see in a homework that it helps define the covariance matrix of our data.

Summary


Matrix Decompositions

In the coming chapters, we’ll spend a lot of our time decomposing matrices into smaller pieces, each of which gives us some insight into the data stored in the matrix. This is not a new concept: in earlier math courses, you’ve learned to write a number as a product of prime factors, and to factor quadratics like x25x+6x^2 - 5x + 6 into (x2)(x3)(x-2)(x-3).

The “ultimate” decomposition for the purposes of machine learning is the singular value decomposition (SVD), which decomposes a matrix into the product of three other matrices.

X=UΣVTX = U \Sigma V^T

where UU and VV are orthogonal matrices and Σ\Sigma is a diagonal matrix. This decomposition will allow us to solve the dimensionality reduction problem we first alluded to in Chapter 1.1.

We’re not yet ready for that; the SVD will be the final topic we study in this course. For now, we’ll introduce a decomposition that ties together ideas from this section, and allows us to understand why the number of linearly independent columns of AA is equal to the number of linearly independent rows of AA, i.e. that rank(A)=rank(AT)\text{rank}(A) = \text{rank}(A^T).

CR Decomposition

Suppose AA is an n×dn \times d matrix with rank rr. This tells us that AA has rr linearly independent columns, and the remaining drd - r columns can be written as linear combinations of the rr “good” columns.

Let’s continue thinking about

A=[532011341624101]A = \begin{bmatrix} 5 & 3 & 2 \\ 0 & -1 & 1 \\ 3 & 4 & -1 \\ 6 & 2 & 4 \\ 1 & 0 & 1 \end{bmatrix}

Recall, rank(A)=2\text{rank}(A) = 2, and

column 3=column 1column 2\text{column 3} = \text{column 1} - \text{column 2}

Define the matrix CC as containing just the linearly independent columns of AA when taken from left to right, i.e. the columns that are obtained by running the algorithm in Chapter 4.2.

given v_1, v_2, ..., v_d # columns of A
initialize linearly independent set S = {v_1}
for i = 2 to d:
    if v_i is not a linear combination of S:
        add v_i to S

This means that CC’s columns are a basis for colsp(A)\text{colsp}(A). Notice that CC is a 5×25 \times 2 matrix; its number of columns is equal to the rank of AA.

C=[5301346210]C = \begin{bmatrix} 5 & 3 \\ 0 & -1 \\ 3 & 4 \\ 6 & 2 \\ 1 & 0 \end{bmatrix}

I’d like to produce a matrix RR such that A=CRA = CR. RR will tell us how to “mix” the columns of CC (which are linearly independent) to produce the columns of AA. Since AA is 5×35 \times 3 and CC is 5×25 \times 2, RR must be 2×32 \times 3 in order for the dimensions to work out.

  • Column 1 of AA is just 1(column 1 of C)1 (\text{column 1 of } C)

  • Column 2 of AA is just 1(column 2 of C)1 (\text{column 2 of } C)

  • Column 3 of AA is 1(column 1 of C)1(column 2 of C){\color{orange}1} (\text{column 1 of } C) - {\color{orange}1} (\text{column 2 of } C)

So, RR must be

R=[101011]R = \begin{bmatrix} 1 & 0 & {\color{orange}1} \\ 0 & 1 & {\color{orange}-1} \end{bmatrix}

So, a CR decomposition of AA is

A=[532011341624101]A=[5301346210]C[101011]RA = \underbrace{\begin{bmatrix} 5 & 3 & 2 \\ 0 & -1 & 1 \\ 3 & 4 & -1 \\ 6 & 2 & 4 \\ 1 & 0 & 1 \end{bmatrix}}_{A} = \underbrace{\begin{bmatrix} 5 & 3 \\ 0 & -1 \\ 3 & 4 \\ 6 & 2 \\ 1 & 0 \end{bmatrix}}_{C} \underbrace{\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & -1 \end{bmatrix}}_{R}

We defined CC such that its columns are linearly independent and a basis for colsp(A)\text{colsp}(A). But, observe: RR’s rows are linearly independent too!

This is because the first two columns of RR are [1001]\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, which is I2I_2, the 2×22 \times 2 identity matrix. It’s impossible for any scalar multiple of the first row of RR to produce the second row of RR, because the second row has a non-zero value in a position where the first row has a zero value.

Why is rank(A)=rank(AT)\text{rank}(A) = \text{rank}(A^T)?

Not only are RR’s rows linearly independent, they’re also a basis for the row space of AA, colsp(AT)\text{colsp}(A^T). Every row of AA can be written as a linear combination of RR’s rows. The way to see why this is the case is to transpose A=CRA = CR to get AT=(CR)T=RTCTA^T = (CR)^T = R^T C^T.

A=[532011341624101]A=[5301346210]C[101011]RA = \underbrace{\begin{bmatrix} 5 & 3 & 2 \\ 0 & -1 & 1 \\ 3 & 4 & -1 \\ 6 & 2 & 4 \\ 1 & 0 & 1 \end{bmatrix}}_{A} = \underbrace{\begin{bmatrix} 5 & 3 \\ 0 & -1 \\ 3 & 4 \\ 6 & 2 \\ 1 & 0 \end{bmatrix}}_{C} \underbrace{\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & -1 \end{bmatrix}}_{R}
AT=[503613142021141]AT=[100111]RT[5036131420]CTA^T = \underbrace{\begin{bmatrix} 5 & 0 & 3 & 6 & 1 \\ 3 & -1 & 4 & 2 & 0 \\ 2 & 1 & -1 & 4 & 1 \end{bmatrix}}_{A^T} = \underbrace{\begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & -1 \end{bmatrix}}_{R^T} \underbrace{\begin{bmatrix} 5 & 0 & 3 & 6 & 1 \\ 3 & -1 & 4 & 2 & 0 \end{bmatrix}}_{C^T}

The rows of RR are the columns of RTR^T. The above decomposition of AT=RTCTA^T = R^T C^T tells us how to mix the rows of RR to make the rows of AA! For instance, it tells us that the first row of AA is

5[101]+3[011]=[532]first row of A\underbrace{5 \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} + 3 \begin{bmatrix} 0 \\ 1 \\ -1 \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \\ 2 \end{bmatrix}}_{\text{first row of } A}

To me, this is a bit magical: we started by picking off linearly independent columns and placing them in CC, but the RR we multiplied CC by to recreate AA ended up having linearly independent rows, which span the row space.

Sitting in front of us is an argument that the number of linearly independent columns of AA is equal to the number of linearly independent rows of AA! In other words, we’ve just argued that

rank(A)=rank(AT)\text{rank}(A) = \text{rank}(A^T)

The gist of it was that if A=CRA = CR is a CR decomposition of AA, then AT=(CR)T=RTCTA^T = (CR)^T = R^T C^T is a CR decomposition of ATA^T, and both CR decompositions use the same rank, rr.

Non-Uniqueness of the CR Decomposition

I never formally defined the CR decomposition; I started with an example.

In the first example we saw above, we constructed CC by taking the first rr linearly independent columns in AA to form CC, and placing the coefficients needed to write each column of AA as a linear combination of the columns of CC in RR. When constructed this way, CC’s columns are automatically linearly independent, and RR’s rows are linearly independent too (using the identity matrix and positioning of 0’s argument from earlier).

A matrix has infinitely many CR decompositions, because we can choose different bases for colsp(A)\text{colsp}(A). Both of the following are CR decompositions of our AA.

C = np.array([[5, 2],
              [0, 1],
              [3, -1],
              [6, 4],
              [1, 1]])

R = np.array([[1, 1, 0],
              [0, -1, 1]])

C @ R
array([[ 5, 3, 2], [ 0, -1, 1], [ 3, 4, -1], [ 6, 2, 4], [ 1, 0, 1]])
# Original decomposition
C = np.array([[5, 3],
              [0, -1],
              [3, 4],
              [6, 2],
              [1, 0]])

R = np.array([[1, 0, 1],
              [0, 1, -1]])

C @ R
array([[ 5, 3, 2], [ 0, -1, 1], [ 3, 4, -1], [ 6, 2, 4], [ 1, 0, 1]])

You can think of r=rank(A)r = \text{rank}(A) as being the minimum possible number of columns in CC to make A=CRA = CR possible, where CC is n×rn \times r and RR is r×dr \times d.

Here, since rank(A)=2\text{rank}(A) = 2, the minimum possible number of columns in CC is 2. No CC with just a single column would allow for A=CRA = CR to work, and while a CC with 3 columns would work, 2 is the minimum number of possible columns (and if CC had more than 2 columns, CC’s columns would no longer be linearly independent).

What’s the Point?

Why did I introduce the CR decomposition? In truth, it’s not used to solve practical problems. Instead, I think it gives us a nice way to understand what the rank of a matrix really means. For one, it gave us one way to see why rank(A)=rank(AT)\text{rank}(A) = \text{rank}(A^T).

Let me try and get you excited about the CR decomposition with a more practical example. Suppose we have a large 1000×501000 \times 50 matrix with rank r=8r = 8. Storing the entire matrix requires storing 1000×50=500001000 \times 50 = 50000 numbers. But, since its rank is 8, there exists a CR decomposition of AA into A=CRA = CR where CC is 1000×81000 \times 8 and RR is 8×508 \times 50. Storing CC and RR requires storing

1000×8size of C+8×50size of R=8400\underbrace{1000 \times 8}_{\text{size of } C} + \underbrace{8 \times 50}_{\text{size of } R} = 8400

numbers, far fewer than the 50,000 numbers required to store the entire matrix.

So, the CR decomposition is one way to compress a matrix; here, it relied on the fact that the rank of the matrix was much smaller than its number of columns. Hopefully this gives you some appreciation for why the rank of a matrix is such a fundamental property. Compression – of images, say – will be a recurring theme in Chapter 10, once we eventually make our way to the singular value decomposition.