diff --git a/README.md b/README.md index 384ac69..e87b212 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,7 @@ The chain rule and associativity: forward vs. reverse differentiation. The cha In part 2 (last few minutes), began setting up some example problems involving matrix functions of matrices, and "[vectorization](https://en.wikipedia.org/wiki/Vectorization_(mathematics))" of matrices to vectors and linear operators to [Kronecker products](https://en.wikipedia.org/wiki/Kronecker_product). To be continued. -**Further reading**: The terms "forward-mode" and "reverse-mode" differentiation are most prevalent in [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), which will will cover later in this course. You can find many, many articles online about [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) in neural networks. There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a [pushforward](https://en.wikipedia.org/wiki/Pushforward_(differential)), whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a [pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)). +**Further reading**: The terms "forward-mode" and "reverse-mode" differentiation are most prevalent in [automatic differentiation (AD)](https://en.wikipedia.org/wiki/Automatic_differentiation), which will will cover later in this course. You can find many, many articles online about [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) in neural networks. There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a [pushforward](https://en.wikipedia.org/wiki/Pushforward_(differential)), whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a [pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)). This [video on the principles of AD in Julia](https://www.youtube.com/watch?v=UqymrMG-Qi4) by [Mohamed Tarek](https://github.com/mohamed82008) also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD. ## Lecture 3 (Jan 14) @@ -90,11 +90,33 @@ Automatic differentiation, guest lecture by [Dr. Chris Rackauckas](https://chris # Lecture 6 (Jan 24) -* part 1: derivatives of eigenproblems [(html)](https://rawcdn.githack.com/mitmath/matrixcalc/69cbb521f357a9633ea4e5f77079eb399986240a/symeig.jl.html) [(julia source)](symeig.jl) +* part 1: derivatives of eigenproblems [(html)](https://rawcdn.githack.com/mitmath/matrixcalc/61a7b3e0cbebd0ccdc126fbe831d1398154e272b/symeig.jl.html) [(julia source)](symeig.jl) * part 2: second derivatives, bilinear forms, and Hessian matrices [(notes)](https://www.dropbox.com/s/tde5cow6wuais8y/Hessians.pdf?dl=0) * [video](https://mit.zoom.us/rec/share/Av05JqxIoIO9woSRjxZVh5AA4cXGeq1mkTUFyqqM5ZU2YElVbsPupJ_zrWwwemUU.8Ten6yxTyONNy8ya?startTime=1643039961000) **Further reading (part 1)**: Computing derivatives on curved surfaces ("manifolds") is closely related to [tangent spaces](https://en.wikipedia.org/wiki/Tangent_space) in differential geometry. The effect of constraints can also be expressed in terms of [Lagrange multipliers](https://en.wikipedia.org/wiki/Lagrange_multiplier), which are useful in expressing optimization problems with constraints (see also chapter 5 of [Convex Optimization](https://web.stanford.edu/~boyd/cvxbook/) by Boyd and Vandenberghe). In physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of ["time-independent" perturbation theory](https://en.wikipedia.org/wiki/Perturbation_theory_(quantum_mechanics)#Time-independent_perturbation_theory) in quantum mechanics, or as the [Hellmann–Feynmann theorem](https://en.wikipedia.org/wiki/Hellmann%E2%80%93Feynman_theorem) for the case of dλ. The derivative of an eigenvector involves *all* of the other eigenvectors, but a much simpler "vector–Jacobian product" (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a *scalar function* of an eigenvector, as reviewed in the [18.335 notes on adjoint methods](https://github.com/mitmath/18335/blob/spring21/notes/adjoint/adjoint.pdf). -**Further reading (part 2)**: [Bilinear forms](https://en.wikipedia.org/wiki/Bilinear_form) are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a [symmetric bilinear forms](https://en.wikipedia.org/wiki/Symmetric_bilinear_form). This is closely related to a [quadratic form](https://en.wikipedia.org/wiki/Quadratic_form), which is just what we get by plugging in the same vector twice, e.g. the f''(x)[δx,δx]/2 that appears in quadratic approximations for f(x+δx) is a quadratic form. The most familar multivariate version of f''(x) is the [Hessian matrix](https://en.wikipedia.org/wiki/Hessian_matrix); Khan academy has an elementary [introduction to quadratic approximation](https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/quadratic-approximations/a/quadratic-approximation) using the Hessian. \ No newline at end of file +**Further reading (part 2)**: [Bilinear forms](https://en.wikipedia.org/wiki/Bilinear_form) are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a [symmetric bilinear forms](https://en.wikipedia.org/wiki/Symmetric_bilinear_form). This is closely related to a [quadratic form](https://en.wikipedia.org/wiki/Quadratic_form), which is just what we get by plugging in the same vector twice, e.g. the f''(x)[δx,δx]/2 that appears in quadratic approximations for f(x+δx) is a quadratic form. The most familar multivariate version of f''(x) is the [Hessian matrix](https://en.wikipedia.org/wiki/Hessian_matrix); Khan academy has an elementary [introduction to quadratic approximation](https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/quadratic-approxi + +# Lecture 7 (Jan 26) + +* part 1: continued [Hessian notes](https://www.dropbox.com/s/tde5cow6wuais8y/Hessians.pdf?dl=0) from previous lecture: minima/maxima and f″ definiteness, Hessian conditioning and steepest-descent convergence +* part 2: derivatives and backpropagation on graphs and linear operators (to be posted) +* [video](https://mit.zoom.us/rec/share/DblFFU72Nary_yKfaQis0WaDoFEznD-92EPr52LHE1QBKcVWPUlmBPgApjre2uf9.oqtYrgEg73glPWx-?startTime=1643212653000) +* [pset 2 solutions](hw2sol.pdf) and computational [notebook](https://nbviewer.org/github/mitmath/matrixcalc/blob/main/hw2sol.ipynb) + +**Further reading (part 1)**: [Positive-definite](https://en.wikipedia.org/wiki/Definite_matrix) Hessian matrices, or more generally [definite quadratic forms](https://en.wikipedia.org/wiki/Definite_quadratic_form) f″, appear at extrema (f′=0) of scalar-valued functions f(x) that are local minima; there a lot [more formal treatments](http://www.columbia.edu/~md3405/Unconstrained_Optimization.pdf) of the same idea, and conversely Khan academy has the [simple 2-variable version](https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/optimizing-multivariable-functions/a/second-partial-derivative-test) where you can check the sign of the 2×2 eigenvalues just by looking at the determinant and a single entry (or the trace). There's a nice [stackexchange discussion](https://math.stackexchange.com/questions/2285282/relating-condition-number-of-hessian-to-the-rate-of-convergence) on why an [ill-conditioned](https://nhigham.com/2020/03/19/what-is-a-condition-number/) Hessian tends to make steepest descent converge slowly; some Toronto [course notes on the topic](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/slides/lec07.pdf) may also be helpful. + +**Further reading (part 2)**: See this blog post on [calculus on computational graphs](https://colah.github.io/posts/2015-08-Backprop/) for a gentle review, and these Columbia [course notes](http://www.cs.columbia.edu/~mcollins/ff2.pdf) for a more formal approach. + + +# Lecture 8 (Jan 28) + +* part 1: continued [Hessian notes](https://www.dropbox.com/s/tde5cow6wuais8y/Hessians.pdf?dl=0): using Hessians for optimization (Newton & quasi-Newton & Newton–Krylov methods), cost of Hessians +* part 2: adjoint differentiation of ODES (guest lecture by Dr. Chris Rackauckas): [notes from 18.337](https://rawcdn.githack.com/mitmath/18337/7b0e890e1211bfa253782f7862389aeaa321e8d7/lecture11/adjoints.html) +* [video](https://mit.zoom.us/rec/share/4yJfjUXsAhykKGebU8lw1wqOb_TIwK-WISEHwEbG8WyB2bLlyYLjkZXQHonT2Tte.lLaSTRDjVxYocEmz?startTime=1643385485000) + +**Further reading (part 1)**: See e.g. these Stanford notes on [sequential quadratic optimization](https://web.stanford.edu/class/ee364b/lectures/seq_notes.pdf) using trust regions (sec. 2.2). See 18.335 [notes on BFGS quasi-Newton methods](https://github.com/mitmath/18335/blob/spring21/notes/BFGS.pdf) (also [video](https://mit.zoom.us/rec/share/naqcRgSkZ0VNeDp0ht8QmB566mPowuHJ8k0LcaAmZ7XxaCT1ch4j_O4Khzi-taXm.CXI8xFthag4RvvoC?startTime=1620241284000)). The fact that a quadratic optimization problem in a sphere has [strong duality](https://en.wikipedia.org/wiki/Strong_duality) and hence is efficiently solvable is discussed in section 5.2.4 of the [*Convex Optimization* book](https://web.stanford.edu/~boyd/cvxbook/). There has been a lot of work on [automatic Hessian computation](https://en.wikipedia.org/wiki/Hessian_automatic_differentiation), but for large-scale problems you can ultimately only compute Hessian–vector products efficiently in general, which are equivalent to a directional derivative of the gradient, and can be used e.g. for [Newton–Krylov methods](https://en.wikipedia.org/wiki/Newton%E2%80%93Krylov_method). + +**Further reading (part 2)**: A very general reference on adjoint-method (reverse-mode/backpropagation) differentiation of ODEs (and generalizations thereof), using notation similar to that of Chris R. today, is [Cao et al (2003)](https://epubs.siam.org/doi/10.1137/S1064827501380630) ([pdf](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.455&rep=rep1&type=pdf)). See also the [adjoint sensitivity analysis](https://diffeq.sciml.ai/stable/extras/sensitivity_math/) and [automatic sensitivity analysis](https://diffeq.sciml.ai/stable/analysis/sensitivity/) sections of Chris's amazing [DifferentialEquations.jl software suite](https://diffeq.sciml.ai/stable/) for numerical solution of ODEs in Julia. There is a nice YouTube [lecture on adjoint sensitivity of ODEs](https://www.youtube.com/watch?v=k6s2G5MZv-I), again using a similar notation. diff --git a/hw2.pdf b/hw2.pdf index c72620a..c8fc977 100644 Binary files a/hw2.pdf and b/hw2.pdf differ diff --git a/hw2sol.ipynb b/hw2sol.ipynb new file mode 100644 index 0000000..878e88c --- /dev/null +++ b/hw2sol.ipynb @@ -0,0 +1,799 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "eb604661", + "metadata": {}, + "source": [ + "Similar to homework 1, let's define a relative-error function for quantifying the error from finite differences:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "48cd6341", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "relerr (generic function with 1 method)" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "using LinearAlgebra # for norm, kron, I(n) etc.\n", + "\n", + "relerr(approx, exact) = norm(approx - exact) / norm(exact)" + ] + }, + { + "cell_type": "markdown", + "id": "3ccad3f3", + "metadata": {}, + "source": [ + "Let's also define the $\\otimes$ operator (type it by `\\otimes` followed by TAB) to be the Kronecker product (the `kron` function in the `LinearAlgebra` library) so we can use math notation `A ⊗ B` instead of `kron(A, B)`:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "24b5fd38", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "kron (generic function with 21 methods)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "const ⊗ = kron" + ] + }, + { + "cell_type": "markdown", + "id": "43df9092", + "metadata": {}, + "source": [ + "We'll also load the Zygote AD library for analytical comparisons:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "0a1fc52b", + "metadata": {}, + "outputs": [], + "source": [ + "using Zygote" + ] + }, + { + "cell_type": "markdown", + "id": "4c8e37b5", + "metadata": {}, + "source": [ + "# Problem 2\n", + "\n", + "Here, $f(A) = \\sqrt{A}$, and the homework says the Jacobian (acting on $\\operatorname{vec}(A)$) should be $(I\\otimes \\sqrt{A} + \\sqrt{A}^T \\otimes I)^{-1}$. Let's try it out for a random positive-definite $A$:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "63dc982c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1×1 Matrix{Float64}:\n", + " 1.3731197861011764" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "n = 1 # should be big enough to be interesting\n", + "B = randn(n,n)\n", + "A = B'B # random positive definite" + ] + }, + { + "cell_type": "markdown", + "id": "993f6868", + "metadata": {}, + "source": [ + "The matrix square root is just `sqrt(A)` in Julia:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "cb14ff97", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check that sqrt(A)² ≈ A, up to roundoff errors:\n", + "relerr(sqrt(A)^2, A)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "0b3cb733", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1×1 Matrix{Float64}:\n", + " 0.42669326873498054" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Jacobian:\n", + "J = (I(n) ⊗ sqrt(A) + sqrt(A)' ⊗ I(n))^-1" + ] + }, + { + "cell_type": "markdown", + "id": "3b89a4f5", + "metadata": {}, + "source": [ + "Now, let's check this against finite difference for a small random $dA$:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "0d188b50", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3.726301788530775e-8" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dA = randn(n,n) * 1e-8\n", + "\n", + "relerr( vec(sqrt(A+dA) - sqrt(A)), # finite-difference directional derivative\n", + " J * vec(dA) ) # vs. exact expression" + ] + }, + { + "cell_type": "markdown", + "id": "b5574a08", + "metadata": {}, + "source": [ + "Hooray, it worked!" + ] + }, + { + "cell_type": "markdown", + "id": "4cd27f68", + "metadata": {}, + "source": [ + "# Problem 3" + ] + }, + { + "cell_type": "markdown", + "id": "e3d8f59a", + "metadata": {}, + "source": [ + "Let's define our function $f(p)$ as in the problem. We'll pass the extra parameters $A_0$ etcetera explicitly (or we could use global variables):" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c2b586ab", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "f (generic function with 1 method)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "function f(p, A₀, a, B₀, b, F)\n", + " A = A₀ + diagm(p)\n", + " x = A \\ a # A⁻¹a\n", + " B = B₀ + diagm(x.*x)\n", + " y = B \\ b # B⁻¹b\n", + " return y'*F*y\n", + "end" + ] + }, + { + "cell_type": "markdown", + "id": "1ba4a2c9", + "metadata": {}, + "source": [ + "Now, we'll pick some random parameters and try it out:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "2147c2b2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "49.975441720918774" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "n = 5\n", + "A₀ = randn(n,n)\n", + "a = randn(5)\n", + "B₀ = randn(n,n)\n", + "b = randn(5)\n", + "F = randn(n,n); F = F + F'\n", + "p = rand(5)\n", + "f(p, A₀, a, B₀, b, F)" + ] + }, + { + "cell_type": "markdown", + "id": "2484c042", + "metadata": {}, + "source": [ + "Now, let's implement the *manual* \"adjoint\" gradient from the solutions, using the same notation:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "002b28b3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "∇f (generic function with 1 method)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "function ∇f(p, A₀, a, B₀, b, F)\n", + " # the forward solution, copied from above\n", + " # (note that in serious computation we would want to re-use this from f)\n", + " A = A₀ + diagm(p)\n", + " x = A \\ a # A⁻¹a\n", + " B = B₀ + diagm(x.*x)\n", + " y = B \\ b # B⁻¹b\n", + " \n", + " # the reverse-mode gradient\n", + " g′ᵀ = 2F*y # step (i)\n", + " u = B' \\ -g′ᵀ # step (ii): adjoint problem 1\n", + " w = 2u .* x .* y # step (iii)\n", + " z = A' \\ -w # step (iv): adjoint problem 2\n", + " return z .* x # step (v): ∇f\n", + "end" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5fa14097", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "5-element Vector{Float64}:\n", + " 262.55155103380923\n", + " -273.8595112733488\n", + " 96.32881037878508\n", + " 217.43756038923235\n", + " -171.79342086396048" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "∇f(p, A₀, a, B₀, b, F)" + ] + }, + { + "cell_type": "markdown", + "id": "3c1b030e", + "metadata": {}, + "source": [ + "## problem 3b\n", + "\n", + "First, we'll compare it against finite differences for a random small `dp`:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "657849bd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.428248665134681e-7" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dp = randn(n) * 1e-8\n", + "\n", + "relerr( f(p+dp, A₀, a, B₀, b, F) - f(p, A₀, a, B₀, b, F), # finite difference\n", + " ∇f(p, A₀, a, B₀, b, F)'dp) # exact directional derivative" + ] + }, + { + "cell_type": "markdown", + "id": "b5371853", + "metadata": {}, + "source": [ + "Hooray, it matches!" + ] + }, + { + "cell_type": "markdown", + "id": "cae7ae8e", + "metadata": {}, + "source": [ + "## problem 3c" + ] + }, + { + "cell_type": "markdown", + "id": "f15be256", + "metadata": {}, + "source": [ + "To compute $\\nabla f$ with `Zygote`, we have to give Zygote a function of a *single* parameter vector `p` that we want to differentiate with respect to, along with the point `p` at which we want the derivative. To do that, we will define an [anonymous function](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) with `p -> ...` that captures the other parameters (also called a [closure](https://en.wikipedia.org/wiki/Closure_(computer_programming)) in computer science):" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "2329ed32", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "([262.5515510338092, -273.8595112733488, 96.32881037878506, 217.43756038923235, -171.7934208639604],)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Zygote.gradient(p -> f(p, A₀, a, B₀, b, F), p)" + ] + }, + { + "cell_type": "markdown", + "id": "09a8a25a", + "metadata": {}, + "source": [ + "(Zygote returns a 1-component tuple of outputs, because it can potentially differentiate with respect to multiple arguments, though here we are just asking for 1.)\n", + "\n", + "The above looks pretty good if we \"eyeball\" it compared to `∇f` above. Let's compare it quantitatively:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c0c3d50f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.1572017312010847e-16" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "relerr(∇f(p, A₀, a, B₀, b, F), # manual ∇f\n", + " Zygote.gradient(p -> f(p, A₀, a, B₀, b, F), p)[1]) # vs AD" + ] + }, + { + "cell_type": "markdown", + "id": "33c8c7d5", + "metadata": {}, + "source": [ + "Hooray, it matches up to the limits of roundoff error (to essentially [machine precision](https://en.wikipedia.org/wiki/Machine_epsilon))!" + ] + }, + { + "cell_type": "markdown", + "id": "4cb40856", + "metadata": {}, + "source": [ + "# Problem 4c" + ] + }, + { + "cell_type": "markdown", + "id": "8834f86c", + "metadata": {}, + "source": [ + "Demonstrate numerically that $d(e^A) = \\sum_{k=0}^\\infty \\! \\! \\frac{1}{k!} (\\sum_{\\ell=0}^{k-1} (A^T)^{k-\\ell-1} \\otimes A^\\ell )dA$" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "36ca240e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3×3 Matrix{Float64}:\n", + " 0.989401 0.372355 0.475427\n", + " 0.94298 0.68495 0.646221\n", + " 0.794569 0.13317 0.353663" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "A = rand(3,3)" + ] + }, + { + "cell_type": "markdown", + "id": "df6620d8", + "metadata": {}, + "source": [ + "### Using ForwardDiff" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "a7dfb9f4", + "metadata": {}, + "outputs": [], + "source": [ + "using ForwardDiff" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d3176e9d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "e (generic function with 1 method)" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "e(A) = sum(A^k/factorial(k) for k=0:20) # hmm exp doesn't work, i'll go to k=20" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "c1536121", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.2971024122201743e-15" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "relerr(e(A), exp(A)) # check that our sum matches the built-in exp(A) function" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "8b7817cb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9×9 Matrix{Float64}:\n", + " 3.35707 0.5389 0.688955 1.50121 … 1.04456 0.123629 0.158014\n", + " 1.50121 2.82372 0.907286 0.466706 0.338304 0.92634 0.209389\n", + " 1.04456 0.265598 2.45771 0.338304 0.244335 0.0576795 0.83731\n", + " 0.5389 0.0624931 0.0798733 2.82372 0.265598 0.0290732 0.0371575\n", + " 0.170853 0.479248 0.105876 1.3384 0.079272 0.237991 0.0492994\n", + " 0.123629 0.0290732 0.434115 0.92634 … 0.0576795 0.0134119 0.216817\n", + " 0.688955 0.0798733 0.102087 0.907286 2.45771 0.434115 0.555028\n", + " 0.218367 0.612715 0.135322 0.289541 1.21467 2.02443 0.729772\n", + " 0.158014 0.0371575 0.555028 0.209389 0.83731 0.216817 1.73404" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "J_AD = ForwardDiff.jacobian(e,A)" + ] + }, + { + "cell_type": "markdown", + "id": "84969d50", + "metadata": {}, + "source": [ + "### .. or using Finite Differences" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "7b35a6b2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9×9 Matrix{Float64}:\n", + " 3.35707 0.538899 0.688955 1.50121 … 1.04456 0.123629 0.158014\n", + " 1.50121 2.82372 0.907286 0.466706 0.338304 0.92634 0.209389\n", + " 1.04456 0.265598 2.45771 0.338304 0.244335 0.0576795 0.83731\n", + " 0.5389 0.062493 0.0798732 2.82372 0.265598 0.0290732 0.0371576\n", + " 0.170853 0.479248 0.105875 1.3384 0.0792719 0.237991 0.0492994\n", + " 0.123629 0.0290732 0.434115 0.92634 … 0.0576795 0.0134118 0.216817\n", + " 0.688955 0.0798732 0.102087 0.907286 2.45771 0.434115 0.555028\n", + " 0.218367 0.612715 0.135321 0.289541 1.21467 2.02443 0.729772\n", + " 0.158014 0.0371575 0.555028 0.209389 0.83731 0.216817 1.73404" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ϵ = 1e-8\n", + "J = zeros(9,0) # initialize 9x9 Jacobian with 9 rows and no columns\n", + "for j=1:3, i=1:3\n", + " dA = zeros(3,3)\n", + " dA[i,j] = ϵ # perturb the (i,j) entry only\n", + " df = exp(A+dA)-exp(A) # see the perturbed exp\n", + " J = [J vec(df)] # append this to J\n", + "end\n", + "J_FD = J/ϵ" + ] + }, + { + "cell_type": "markdown", + "id": "d6973563", + "metadata": {}, + "source": [ + "### The theoretical answer summing to 20" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "aabb597f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9×9 Matrix{Float64}:\n", + " 3.35707 0.5389 0.688955 1.50121 … 1.04456 0.123629 0.158014\n", + " 1.50121 2.82372 0.907286 0.466706 0.338304 0.92634 0.209389\n", + " 1.04456 0.265598 2.45771 0.338304 0.244335 0.0576795 0.83731\n", + " 0.5389 0.0624931 0.0798733 2.82372 0.265598 0.0290732 0.0371575\n", + " 0.170853 0.479248 0.105876 1.3384 0.079272 0.237991 0.0492994\n", + " 0.123629 0.0290732 0.434115 0.92634 … 0.0576795 0.0134119 0.216817\n", + " 0.688955 0.0798733 0.102087 0.907286 2.45771 0.434115 0.555028\n", + " 0.218367 0.612715 0.135322 0.289541 1.21467 2.02443 0.729772\n", + " 0.158014 0.0371575 0.555028 0.209389 0.83731 0.216817 1.73404" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# written as nested sum calls:\n", + "sum(sum( (A')^(k-ℓ-1) ⊗ A^ℓ for ℓ=0:(k-1))/factorial(k) for k=1:20)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "75eb21e5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9×9 Matrix{Float64}:\n", + " 3.35707 0.5389 0.688955 1.50121 … 1.04456 0.123629 0.158014\n", + " 1.50121 2.82372 0.907286 0.466706 0.338304 0.92634 0.209389\n", + " 1.04456 0.265598 2.45771 0.338304 0.244335 0.0576795 0.83731\n", + " 0.5389 0.0624931 0.0798733 2.82372 0.265598 0.0290732 0.0371575\n", + " 0.170853 0.479248 0.105876 1.3384 0.079272 0.237991 0.0492994\n", + " 0.123629 0.0290732 0.434115 0.92634 … 0.0576795 0.0134119 0.216817\n", + " 0.688955 0.0798733 0.102087 0.907286 2.45771 0.434115 0.555028\n", + " 0.218367 0.612715 0.135322 0.289541 1.21467 2.02443 0.729772\n", + " 0.158014 0.0371575 0.555028 0.209389 0.83731 0.216817 1.73404" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# same thing written another way:\n", + "J_20 = sum( (A')^(k-ℓ-1) ⊗ A^ℓ / factorial(k) for ℓ=0:20, k=1:20 if k>ℓ)" + ] + }, + { + "cell_type": "markdown", + "id": "bd39711b", + "metadata": {}, + "source": [ + "Looks good let's check the match quantitatively:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d3c6f007", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.3738128345184809e-15" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "relerr(J_20, J_AD) # should match to nearly machine precision" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "6c1d30c8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.293987776257493e-7" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "relerr(J_20, J_FD) # should match to ≈ 7 digits" + ] + }, + { + "cell_type": "markdown", + "id": "49ebb275", + "metadata": {}, + "source": [ + "Hooray, math works!" + ] + } + ], + "metadata": { + "@webio": { + "lastCommId": null, + "lastKernelId": null + }, + "kernelspec": { + "display_name": "Julia 1.7.1", + "language": "julia", + "name": "julia-1.7" + }, + "language_info": { + "file_extension": ".jl", + "mimetype": "application/julia", + "name": "julia", + "version": "1.7.1" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/hw2sol.pdf b/hw2sol.pdf new file mode 100644 index 0000000..1faf98a Binary files /dev/null and b/hw2sol.pdf differ diff --git a/hw2sol.tex b/hw2sol.tex new file mode 100644 index 0000000..cfad4b4 --- /dev/null +++ b/hw2sol.tex @@ -0,0 +1,157 @@ +\documentclass[10pt,oneside]{article} +\usepackage{amsmath} +\usepackage{graphicx} +\usepackage{subcaption} +\usepackage{amsfonts} +\usepackage{amssymb} + +\newcommand{\tr}{\operatorname{trace}} +\newcommand{\vecm}{\operatorname{vec}} +\newcommand{\diagm}{\operatorname{diagm}} +\newcommand{\dotstar}{\mathbin{.*}} + +\usepackage{polyglossia} + + + + +\usepackage[ + backend=biber +]{biblatex} +\addbibresource{biblio.bib} + + +\usepackage[ + letterpaper, + left=1cm, + right=1cm, + top=1.5cm, + bottom=1.5cm +]{geometry} + + +\usepackage[ + final, + unicode, + colorlinks=true, + citecolor=blue, + linkcolor=blue, + plainpages=false, + urlcolor=blue, + pdfpagelabels=true, + pdfsubject={Cálculo}, + pdfauthor={José Doroteo Arango Arámbula}, + pdftitle={Tarea 1}, + pdfkeywords={UNAM, FES Acatlán, 2021-I} +]{hyperref} + +\usepackage{booktabs} + + +\usepackage{algpseudocode} +\usepackage{algorithm} +\floatname{algorithm}{Algoritmo} + +\usepackage{enumitem} + +\usepackage{lastpage} +\usepackage{fancyhdr} +\fancyhf{} +\pagestyle{fancy} +\fancyhf{} +\fancyhead[L]{MIT IAP Winter 2022} +\fancyhead[C]{Matrix Calculus} +\fancyhead[R]{Profs. Edelman and Johnson} +\fancyfoot[R]{} +% https://tex.stackexchange.com/questions/227/how-can-i-add-page-of-on-my-document +\fancyfoot[C]{\thepage\ of \pageref*{LastPage}} +%\fancyfoot[C]{ of } +\fancyfoot[L]{} +\renewcommand{\headrulewidth}{2pt} +\renewcommand{\footrulewidth}{2pt} + +\usepackage[newfloat=true]{minted} + +\author{} +\title{Homework 2} + +\date{\today} + + +\DeclareCaptionFormat{mitedFormat}{% + \textbf{#1#2}#3} +\DeclareCaptionStyle{minetdStyle}{skip=0cm,width=.85\textwidth,justification=centering, + font=footnotesize,singlelinecheck=off,format=mitedFormat,labelsep=space} +\newenvironment{mintedCode}{\captionsetup{type=listing,style=minetdStyle}}{} + +\usepackage{dirtytalk} + +\SetupFloatingEnvironment{listing}{} + +\usepackage[parfill]{parskip} + +\usepackage{csquotes} + +\begin{document} +\maketitle +\thispagestyle{fancy} + +{\bf Please submit your HW on Canvas; include a PDF printout of any code and results, clearly labeled, e.g. from a Jupyter notebook. It is due Wednesday January 26th by 11:59pm EST. } + +\subsection*{Problem 1} + +Let $f(x): \mathbb{R}^n \to \mathbb{R}^n$ be an invertible map from inputs $x \in \mathbb{R}^n$ ($n$-component column vectors) to outputs in $\mathbb{R}^n$. If $g(x)$ is the inverse function, so that $g(f(x)) = x = f(g(x))$, use the chain rule to show how the Jacobians $f'(x)$ and $g'(x)$ are related. + +\subsection*{Problem 2} + +Consider $f(A) = \sqrt{A}$ where $A$ is an $n\times n$ matrix. (Recall that the matrix square root is a matrix such that $(\sqrt{A})^2 = A$. In 18.06, we might compute this from a diagonalization of $A$ by taking the square roots of all the eigenvalues.) + +\begin{enumerate}[label=\alph*)] + \item Give a formula for the Jacobian of $f$, acting on $\vecm(A)$, in terms of Kronecker products and matrix powers only (no eigenvalues required!). (Hint: see problem 1.) + + \item Check your formula numerically against a finite-difference approximation. (To ensure that no complex numbers show up in the matrix square root, pick a random positive-definite $A = B^T B$ for some random square $B$.) +\end{enumerate} + +\subsection*{Problem 3} + +Suppose that $A(p) = A_0 + \diagm(p)$ constructs an $n\times n$ matrix $A$ from $p \in \mathbb{R}^n$, where $\diagm(p) = \begin{pmatrix} p_1 & & \\ & p_2 & \\ & & \ddots \end{pmatrix}$ is the diagonal matrix with the entries of $p$ along its diagonal, and $A_0$ is some constant matrix. We now perform the following sequence of steps: + +\begin{enumerate} + \item solve $A(p) x = a$ for $x \in \mathbb{R}^n$, where $a$ is some vector. + \item form $B(x) = B_0 + \diagm(x \dotstar x)$ where $\dotstar$ is the elementwise product and $B_0$ is some $n\times n$ matrix. + \item solve $By = b$ for $y \in \mathbb{R}^n$, where $b$ is some vector. + \item compute $f = y^T F y$ where $F = F^T$ is some symmetric $n \times n$ matrix. +\end{enumerate} + +Now, suppose that we want to optimize $f(p)$ as a function of the parameters $p$ that went into the first step. We need to compute the gradient $\nabla f$ for any kind of large-scale problem. If $n$ is huge, though, we must be careful to do this efficiently, using ``reverse-mode'' or ``adjoint'' calculations in which we apply the chain rule from left to right. + +\begin{enumerate}[label=\alph*)] + +\item Explain the sequence of steps to compute $\nabla f$. Your answer should show that $\nabla f$ can \emph{also} be computed by solving only \emph{two} $n \times n$ linear systems, similar to $f$ itself. + +\item Check your answer numerically against finite differences (for some randomly chosen $A_0, B_0, a, b, F$). + +\item Check your answer against the result of a reverse-mode AD software (e.g. Zygote in Julia). + +\end{enumerate} + +\subsection*{Problem 4} + +\begin{enumerate}[label=\alph*)] + +\item Write down some $4 \times 4$ matrix that is not the +Kronecker product of two $2 \times 2$ matrices. Convince +us this is true. + +\item Prove that if $A$ ($m \times m$) and $B$ ($n \times n$) are orthogonal (i.e., +$A^TA=I_m$ and $B^TB=I_n$) then $A \otimes B$ ($mn \times mn$) is orthogonal. + +\item If $f(A) = e^A= \sum_{k=0}^\infty A^k/k!$, write down +a power series involving Kronecker products for the +Jacobian $f'(A)$. Check your answer with a numerical example, e.g.~against finite differences. + +\end{enumerate} + + + +\end{document}