The Singular Value Decomposition: Structure of Arbitrary Matrices
Every real m x n matrix factors as A = U Sigma V^T, where U and V are orthogonal and Sigma is diagonal — this is the singular value decomposition. We prove existence, show how the SVD yields optimal low-rank approximations (Eckart–Young theorem), and construct the Moore–Penrose pseudoinverse for least-squares solutions.
The spectral theorem gave us a powerful decomposition for symmetric matrices: every real symmetric matrix A can be written as A=QΛQT, where Q is orthogonal and Λ is diagonal. But this result applies only to square matrices, and indeed only to the symmetric ones among those. In applications, we routinely encounter matrices that are not square — a data matrix with more observations than variables, or a linear map between spaces of different dimensions. Eigenvalues are not even defined for such matrices. We need a more general tool.
The key observation is this: even when A itself is rectangular, the products ATA and AAT are always symmetric and square. If A∈Mm×n(R), then ATA is n×n and AAT is m×m, and both are symmetric. The spectral theorem applies to each of them. The singular value decomposition (SVD) exploits this fact to decompose A itself, revealing the fundamental geometric action of any linear map: a rotation, followed by a coordinate-wise scaling, followed by another rotation.
2. Singular Values
We begin by examining the matrix ATA more carefully.
Lemma 1 (Positive semidefiniteness ofATA).
For any A∈Mm×n(R), the matrix ATA is positive semidefinite. In particular, all eigenvalues of ATA are nonnegative.
Proof.
For any x∈Rn, we have
xT(ATA)x=(Ax)T(Ax)=∥Ax∥2≥0.
Since ATA is symmetric, the spectral theorem guarantees that all its eigenvalues are real. If λ is an eigenvalue with eigenvector v=0, then λ∥v∥2=vT(ATA)v≥0, so λ≥0.□
Definition 1 (Singular values).
Let A∈Mm×n(R). Let λ1≥λ2≥⋯≥λn≥0 be the eigenvalues of ATA, listed in decreasing order and counted with multiplicity. The singular values of A are the numbers
σi=λi,i=1,…,n.
We write σ1≥σ2≥⋯≥σn≥0. The number of nonzero singular values equals rank(A).
Lemma 3.
Let A∈Mm×n(R) with singular values σ1≥⋯≥σn≥0. Then rank(A)=rank(ATA), which equals the number of nonzero singular values.
Proof.
It suffices to show ker(A)=ker(ATA). If Ax=0, then ATAx=0. Conversely, if ATAx=0, then ∥Ax∥2=xTATAx=0, so Ax=0. Since the two null spaces coincide, the rank-nullity theorem gives rank(A)=rank(ATA). The rank of ATA is the number of its nonzero eigenvalues, which is the number of nonzero singular values.□
Example 2.
Let A=101110. Then
ATA=(110110)101110=(2112).
The eigenvalues are λ=3 and λ=1, so the singular values are σ1=3 and σ2=1.
Remark 3.
One can equally well define singular values via the eigenvalues of AAT. The nonzero eigenvalues of ATA and AAT are the same: if ATAv=λv with λ=0, then AAT(Av)=A(ATAv)=λ(Av), so λ is also an eigenvalue of AAT.
3. The SVD Theorem
We now state and prove the central result of this chapter.
Theorem 6 (Singular Value Decomposition).
Let A∈Mm×n(R) with rank(A)=r. Then there exist an orthogonal matrix U∈Mm×m(R), an orthogonal matrix V∈Mn×n(R), and a matrix Σ∈Mm×n(R) such that
A=UΣVT,
where Σ has the form
Σ=σ1⋱σr0⋱,
with σ1≥σ2≥⋯≥σr>0 on the diagonal and all other entries zero.
Proof.
Since ATA is a real symmetric n×n matrix, the spectral theorem provides an orthonormal basis {v1,…,vn} of Rn consisting of eigenvectors:
ATAvi=λivi,λ1≥⋯≥λr>0=λr+1=⋯=λn.
Set σi=λi for i=1,…,r, and let V=(v1⋯vn).Step 1: Constructing the left singular vectors. For i=1,…,r, define
ui=σi1Avi.
We claim that {u1,…,ur} is an orthonormal set in Rm. Indeed, for 1≤i,j≤r,
Since the vi are orthonormal, viTvj=δij, and when i=j this gives λi/σi2=1. So uiTuj=δij.Step 2: Extending to an orthonormal basis. The vectors u1,…,ur form an orthonormal set of r vectors in Rm. We extend this to an orthonormal basis {u1,…,um} of Rm (by Gram–Schmidt on any completion, for instance). Let U=(u1⋯um).Step 3: Verifying the decomposition. We must show A=UΣVT, or equivalently AV=UΣ. The i-th column of AV is Avi, and the i-th column of UΣ is σiui when i≤r and 0 when i>r. For i≤r:
Avi=σiui=σiui,
which holds by the definition of ui. For i>r: since vi∈ker(ATA)=ker(A), we have Avi=0, and the i-th column of UΣ is also 0. Therefore AV=UΣ, and since V is orthogonal, A=UΣVT.□
Remark 4.
The columns v1,…,vn of V are called the right singular vectors of A, and the columns u1,…,um of U are the left singular vectors. The right singular vectors are eigenvectors of ATA, and the left singular vectors are eigenvectors of AAT. One can verify the latter directly: for i≤r, AATui=σi1AATAvi=σi1A(λivi)=σiAvi=σi2ui.
The SVD can also be written in outer product form. Since A=UΣVT, expanding columnwise gives
A=i=1∑rσiuiviT.
Each term σiuiviT is a rank-one matrix, and A is expressed as a sum of r rank-one matrices weighted by the singular values. This form is the starting point for low-rank approximation.
4. Geometric Interpretation
The SVD reveals the geometry of any linear map x↦Ax with crystalline clarity. Writing A=UΣVT, the map decomposes into three stages:
VT: rotate (and possibly reflect) the input — this is an orthogonal transformation of Rn.
Σ: scale each coordinate axis independently by the corresponding singular value, mapping Rn into Rm.
U: rotate (and possibly reflect) the output — this is an orthogonal transformation of Rm.
The geometric meaning becomes vivid when we consider the image of the unit sphere Sn−1={x∈Rn:∥x∥=1} under A.
Theorem 8 (Image of the unit sphere).
Let A∈Mm×n(R) have singular values σ1≥⋯≥σr>0 and SVD A=UΣVT. The image A(Sn−1) is an ellipsoid in Rm (possibly degenerate if r<n) whose semi-axes have lengths σ1,…,σr and point in the directions u1,…,ur.
Proof.
Let x∈Sn−1 and write y=VTx. Since VT is orthogonal, ∥y∥=∥x∥=1, so y ranges over all of Sn−1 as x does. The point Σy has coordinates (σ1y1,…,σryr,0,…,0)T in Rm. As y ranges over Sn−1, the constraint y12+⋯+yn2=1 ensures that the image under Σ satisfies
σ12z12+⋯+σr2zr2≤1,
with equality on the surface. This is an ellipsoid in the subspace spanned by the first r coordinate vectors, with semi-axes σ1,…,σr. The final rotation U maps the i-th standard basis vector to ui, so the semi-axes of the image ellipsoid point in the directions u1,…,ur.□
Example 5.
A 2×2 matrix with singular values 3 and 1 maps the unit circle to an ellipse with semi-major axis 3 and semi-minor axis 1. The directions of these axes are determined by the left singular vectors.
5. Full SVD Computation for a3×2Matrix
Let us compute the full SVD of the matrix
A=101110.
Step 1: ComputeATAand its eigenvalues.
ATA=(2112).
The characteristic polynomial is det(ATA−λI)=(2−λ)2−1=λ2−4λ+3=(λ−3)(λ−1). The eigenvalues are λ1=3 and λ2=1, giving singular values σ1=3 and σ2=1.Step 2: Find the right singular vectors. For λ1=3: solving (ATA−3I)v=0 gives v1=21(11). For λ2=1: solving (ATA−I)v=0 gives v2=21(1−1). Thus
V=21(111−1).
Step 3: Find the left singular vectors. We compute ui=σi1Avi:
u1=31Av1=31⋅21211=61211,
u2=11Av2=210−11=210−11.
We extend to an orthonormal basis of R3 by finding u3⊥u1,u2. One verifies that u3=31−111 is orthogonal to both u1 and u2 and has unit norm. Thus
Step 4: Verify. One can check directly that UΣVT=A. For instance, the first column of UΣVT is 21(3u1+1⋅u2)=212/21/21/2+0−1/21/2. Working this through confirms the result.
6. Low-Rank Approximation
The outer product form A=∑i=1rσiuiviT suggests a natural way to approximate A: keep only the terms with the largest singular values. For k≤r, define the truncated SVD
Ak=i=1∑kσiuiviT.
This is a rank-k matrix (since it is a sum of k rank-one matrices with orthogonal column and row spaces). The question is: how good is this approximation?
Theorem 10 (Eckart–Young–Mirsky).
Let A∈Mm×n(R) have singular values σ1≥⋯≥σr>0, and let Ak be its rank-k truncated SVD. Then for any matrix B of rank at most k,
∥A−Ak∥2≤∥A−B∥2and∥A−Ak∥F≤∥A−B∥F,
where ∥⋅∥2 is the operator norm and ∥⋅∥F is the Frobenius norm. In particular,
∥A−Ak∥2=σk+1and∥A−Ak∥F=σk+12+⋯+σr2.
The Eckart–Young–Mirsky theorem tells us that the truncated SVD solves the best rank-k approximation problem in a very strong sense: it is simultaneously optimal in both the operator norm and the Frobenius norm. The geometric intuition is clear: Ak retains the k most important “{}stretching directions”{} of the linear map, discarding those along which the map acts most weakly. The approximation error is controlled by the discarded singular values.
Remark 6.
This theorem is the mathematical foundation of principal component analysis (PCA) in statistics and of many data compression algorithms. If the singular values decay rapidly — that is, if σk+1 is much smaller than σ1— then a low-rank approximation captures most of the structure of A.
Example 7.
Returning to our matrix A with σ1=3 and σ2=1, the best rank-one approximation is
The SVD provides a clean way to define an inverse-like object for any matrix, even one that is rectangular or singular.
Definition 8 (Pseudoinverse).
Let A∈Mm×n(R) have SVD A=UΣVT with r nonzero singular values. Define Σ+∈Mn×m(R) by
(Σ+)ij={1/σi0if i=j≤r,otherwise.
The Moore–Penrose pseudoinverse of A is
A+=VΣ+UT.
When A is square and invertible, A+=A−1, since Σ+=Σ−1 and A+=VΣ−1UT=(UΣVT)−1. In general, A+ is the unique matrix satisfying the four Moore–Penrose conditions: (i) AA+A=A, (ii) A+AA+=A+, (iii) (AA+)T=AA+, (iv) (A+A)T=A+A.
Theorem 14 (Pseudoinverse and least squares).
Let A∈Mm×n(R) and b∈Rm. The vector x+=A+b is the minimum-norm least-squares solution to Ax=b. That is, among all x minimizing ∥Ax−b∥, the vector x+ has the smallest norm ∥x∥.
Proof.
Write A=UΣVT and let c=UTb, y=VTx. Since U is orthogonal,
The second sum is independent of x, and the first sum is minimized by choosing yi=ci/σi for i=1,…,r. The components yr+1,…,yn do not appear in ∥Ax−b∥2, so any choice of these components gives a least-squares solution. To minimize ∥x∥2=∥y∥2=∑i=1nyi2, we set yi=0 for i>r. The resulting y is exactly Σ+c=Σ+UTb, so x+=Vy=VΣ+UTb=A+b.□
Example 9.
For our running example, A+=VΣ+UT where
Σ+=(1/300100).
If b=111, then the minimum-norm least-squares solution is x+=A+b. One can verify that A+b=31(22), and indeed Ax+=31422 is the orthogonal projection of b onto the column space of A.
8. Matrix Norms and the Condition Number
The singular values provide natural measures of the “{}size”{} of a matrix.
Definition 10 (Operator norm).
For A∈Mm×n(R), the operator norm (or spectral norm, or 2-norm) is
∥A∥2=x=0max∥x∥∥Ax∥=∥x∥=1max∥Ax∥.
Theorem 17.
∥A∥2=σ1, the largest singular value of A.
Proof.
By Theorem 8, the image of the unit sphere under A is an ellipsoid with semi-axes of lengths σ1≥⋯≥σr. The maximum distance from the origin to a point on this ellipsoid is σ1, attained in the direction u1. Therefore max∥x∥=1∥Ax∥=σ1.More explicitly, write x=∑i=1ncivi with ∑ci2=1. Then Ax=∑i=1rciσiui, so ∥Ax∥2=∑i=1rci2σi2≤σ12∑i=1rci2≤σ12. Equality holds when c1=1 and all other ci=0, i.e., x=v1.□
Definition 11 (Frobenius norm).
The Frobenius norm of A=(aij) is ∥A∥F=∑i,jaij2.
Theorem 19.
∥A∥F=σ12+⋯+σr2.
Proof.
Since U and V are orthogonal, ∥A∥F=∥UΣVT∥F=∥Σ∥F. The entries of Σ are zero except for σ1,…,σr on the diagonal, so ∥Σ∥F2=σ12+⋯+σr2. (The fact that orthogonal transformations preserve the Frobenius norm follows from ∥UAV∥F2=tr((UAV)TUAV)=tr(VTATUTUAV)=tr(VTATAV)=tr(ATA)=∥A∥F2, using the cyclic property of the trace in the last step.)□
Definition 12 (Condition number).
Let A∈Mm×n(R) with rank(A)=r. The condition number of A is
κ(A)=σrσ1,
the ratio of the largest to the smallest nonzero singular value.
The condition number measures how close A is to being rank-deficient. If κ(A) is large, then A nearly collapses some direction — it stretches the input space very unevenly. In numerical computation, a large condition number means that solving Ax=b is sensitive to perturbations in b or in A itself. When A is square and invertible, we have the precise estimate
∥x∥∥δx∥≤κ(A)∥b∥∥δb∥
for the relative error in the solution caused by a perturbation δb in the right-hand side.
Example 13.
For our matrix A with σ1=3 and σ2=1, the condition number is κ(A)=3≈1.73. This is a well-conditioned matrix. By contrast, the matrix B=(1111+10−10) has singular values approximately 2 and 5×10−11, giving κ(B)≈4×1010. Solving Bx=b is an ill-conditioned problem.
Remark 14.
The singular value decomposition unifies many threads from earlier in this text. Eigenvalues reveal the structure of a square matrix acting on its own space; singular values reveal the structure of any linear map between two spaces. The spectral theorem is the special case of the SVD when A is symmetric: the left and right singular vectors coincide with the eigenvectors, and the singular values are the absolute values of the eigenvalues. In this sense, the SVD is the natural culmination of the diagonalization program we have pursued throughout this book.