Skip to content

SVD Theory

dmckinnon edited this page Dec 18, 2019 · 8 revisions

Caveat: this page assumes a familiarity with basic linear algebra like matrix operations, etc.

Singular Value Decomposition

The singular value decomposition (SVD) is the factorisation of a matrix A into the product of three matrices A = U__D__V.transpose(). D is a diagonal matrix and the columns of U and V are orthonormal, which means they are all of unit length and orthogonal. The values of D are called the singular values, and columns of V are the left singular vectors, and the columns of U are the right singular vectors. This can be done for any matrix, and is a useful technique for finding the 'inverse' of a matrix that is non-invertible - called the pseudoinverse. But I digress. Let's focus on the singular values and vectors. What are these?

The best intuitive explanation I've heard, among many, is that the left singular vectors are contributions to the matrix A, with some vectors contributing 'more' than others. The opening chapter of this book, and in particular section 1.1, gives a good explanation of this, and I'll try my best to present that in my own words.

Right Singular Vectors

Let's define the singular vectors of a n x d matrix A. Let's say we are computing Av for some vector v, and for simplicity let's say v is of length 1 - that is, |v| = 1. If we want it to be more, we multiply by a scalar, but that doesn't affect the following. If we consider A as a collection of row vectors a_i, which for the purposes of this matrix multiplication it kind of is, then we get Av = a_i . v over all i, in vector notation. We know that the length of the projection of a_i onto v is |a_i . v|. We want to find the vector v that maximises the projection of all a_i onto v - that is, the vector that retains the biggest contribution, in other words, of A. So then the first singular vector, or greatest or biggest or whatever, is v1 = argmax|Av| over all |v| = 1. The first singular value of A is the scalar for this contribution, s1 = |Av1|.
This can also be described as the best fit line to all the row vectors of A, that minimises the sum of squared distances of the points that are the row vectors of A to the line v. You can think of this as "given only one vector, how do we best fit to A?"

The next singular vector and value carry the same definition with the extra constraint that v2 is orthogonal to v1, and so on - the next after that must be orthogonal to both, etc. You can see that as we add more vectors, we can fit to A closer and closer. This has an upper bound - we don't need more vectors than A has dimensions, but often it can be done with fewer, if A is not square. Indeed, Theorem 1.1 (found on page 5) essentially says that if A is n x d and has r singular vectors (where A is rank r - that is, has r linearly independent columns vectors and the dimension of the image is r), then for 1 <= k <= r, the k-dimensional subspace spanned by the top k singular vectors is the best-fit k dimensional subspace for A.

Left Singular Vectors

So we've defined the right singular vectors, and the singular values (which have no left or right), above. What about left singular vectors? Well, page six defines the left singular vectors for us. Let's first note that every vector v can be written as a linear combination of v1, v2, ..., v_r and a vector perpendicular to all the v_i (why only this many? Because A is rank r). We can create a similar set by multiplying by A: Av1, Av2, ..., Av_r. We normalise them to unit length and get u_i = (1/si) * Av_i

These vectors u_r are the left singular vectors of A. It's worth calling out that all the left singular vectors are orthogonal. Why are they important? Why are we defining them thing way? Let's now get to the crux of this all, the Singular Value Decomposition Theorem.

SVD Theorem

The Singular Value Decomposition Theorem, theorem 1.5 on page 8 plus accompanying proof, simply states that if a n x d matrix A with right singular vectors v1, v2, ..., v_r, left singular vectors u_i as defined above, and corresponding singular values s1, s2, ..., sr, then A = sum from 1 to r of siu_i v_i.transpose().

That's not elegantly written there, but note that r vectors x_i and y_i being multiplied like so, x_i y_i.transpose(), amounts to the matrix multiplication X Y where the rows of X are the x_i and ditto for Y. Therefore, we have that A = U S V.transpose(), where U's rows are u_i, V's rows are v_i, and S is a diagonal matrix with elements si.

This is what the singular value decomposition of a matrix is. When spelled out like this, it may just seem like a weird mathematical trick, or bit of notation. So what? We got some vectors out. But knowing that these are "the most contributing vectors", or that a matrix can be split into two orthonormal matrices and a diagonal matrix, is immensely helpful.

Uses

One notable use that I called out above is computing the 'inverse', or pseudoinverse, of a matrix. Not all matrices are invertible. But you might need the matrix A' such that A A' = I, and A can't be traditionally inverted. Bring on SVD! Since each of U and V are orthonormal, they can be inverted. Easy. And as for S, well, what we do here is that for all nonzero elements on the diagonal, we invert these, and leave all zero elements as zero. This may seem like cheating to get around regular inversion, and it sort of is, but it works.

Another example I called out in the home page of the wiki - if you have a large matrix of values or correlations, like customers to purchases, and you want to know only the 50 most meaningful transactions. Use the 50 biggest singular values and vectors. Drastically reduces your matrix size. Then there's what I implemented here: Image compression.

Clone this wiki locally