forked from mitmath/matrixcalc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw1sol.tex
251 lines (199 loc) · 13.3 KB
/
hw1sol.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
\documentclass[10pt,oneside]{article}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{amsfonts}
\usepackage{amssymb}
\newcommand{\tr}{\operatorname{trace}}
\newcommand{\dotstar}{\mathbin{.*}}
\usepackage{polyglossia}
\usepackage[
backend=biber
]{biblatex}
\addbibresource{biblio.bib}
\usepackage[
letterpaper,
left=1cm,
right=1cm,
top=1.5cm,
bottom=1.5cm
]{geometry}
\usepackage[
final,
unicode,
colorlinks=true,
citecolor=blue,
linkcolor=blue,
plainpages=false,
urlcolor=blue,
pdfpagelabels=true,
pdfsubject={Cálculo},
pdfauthor={José Doroteo Arango Arámbula},
pdftitle={Tarea 1},
pdfkeywords={UNAM, FES Acatlán, 2021-I}
]{hyperref}
\usepackage{booktabs}
\usepackage{algpseudocode}
\usepackage{algorithm}
\floatname{algorithm}{Algoritmo}
\usepackage{enumitem}
\usepackage{lastpage}
\usepackage{fancyhdr}
\fancyhf{}
\pagestyle{fancy}
\fancyhf{}
\fancyhead[L]{MIT IAP Winter 2022}
\fancyhead[C]{Matrix Calculus}
\fancyhead[R]{Profs. Edelman and Johnson}
\fancyfoot[R]{}
% https://tex.stackexchange.com/questions/227/how-can-i-add-page-of-on-my-document
\fancyfoot[C]{\thepage\ of \pageref*{LastPage}}
%\fancyfoot[C]{ of }
\fancyfoot[L]{}
\renewcommand{\headrulewidth}{2pt}
\renewcommand{\footrulewidth}{2pt}
\usepackage[newfloat=true]{minted}
\author{}
\title{Homework 1 Solutions}
\date{\today}
\DeclareCaptionFormat{mitedFormat}{%
\textbf{#1#2}#3}
\DeclareCaptionStyle{minetdStyle}{skip=0cm,width=.85\textwidth,justification=centering,
font=footnotesize,singlelinecheck=off,format=mitedFormat,labelsep=space}
\newenvironment{mintedCode}{\captionsetup{type=listing,style=minetdStyle}}{}
\usepackage{dirtytalk}
\SetupFloatingEnvironment{listing}{}
\usepackage[parfill]{parskip}
\usepackage{csquotes}
\begin{document}
\maketitle
\thispagestyle{fancy}
\subsection*{Problem 1 [$6\times{}$(3 points analytical + 1 point numerical)]}
See the \strong{supplementary Julia notebook} for example numerical validations.
\begin{enumerate}[label=\alph*)]
\item $f(x)=(x^T x)^4$: The result is a scalar, which makes it easier to apply the chain rule without worrying about commuting things: $df = 4(x^T x)^3 d(x^T x) = 4(x^T x)^3(dx^T x + x^T dx) = 8(x^T x)^3 x^T dx$, and as in class we can also write this as a gradient column vector $\nabla f = 8(x^T x)^3 x$, such that $df = (\nabla f)^T dx$.
\item $f(x) = \cos(x^T Ax)$: again applying the chain rule, and using the derivative of $x^T Ax$ from class, we immediately get $\nabla f = -\sin(x^T Ax)(A+A^T)x$
\item $f(A)=\tr( A^4)$: here, we have to be a little careful with the product rule for $d(A^4) = dA\, A^3 + A\,dA\,A^2+A^2dA\,A+A^3\,dA$ because $A$ and $dA$ do not commute. $df = \tr d(A^4)$ (by linearity of $\tr$), which $= \tr(dA\, A^3) + \tr(A\,dA\,A^2)+\tr(A^2dA\,A)+\tr(A^3dA)$, again by linearity. But by the cyclic property of the trace we have $\tr(A\,dA\,A^2) = \tr(A^3dA)$ etc., and hence $df = 4\tr(A^3dA)$. Equivalently, from class we can write $\nabla f = 4(A^3)^T$.
\item $f(A)=A^4$: From above, $df = dA\, A^3 + A\,dA\,A^2+A^2dA\,A+A^3\,dA$, which is a linear operator acting on $dA$.
\item $f(A) = \theta^T A$: this is simply $df =\theta^T dA$, which is a linear operator on $dA$. (Easy, right? But this problem reportedly drastically confused 6.036 students who weren't used to thinking of derivatives as linearization.)
\item $f(x) = \sin.(x)$ (elementwise sine): The derivative is simply $df = \cos.(x) \dotstar dx$, where $\dotstar$ denotes the elementwise "Hadamard" product as in Julia or Matlab. \\
\\
More generally, we can easily derive the rule that $d(g.(x)) = g'.(x) \dotstar dx$ for any scalar function $g$ applied elementwise. One way to see it is to write it out component-wise: $$
d(g.(x)) = d\begin{pmatrix} g(x_1) \\ g(x_2) \\ \vdots \\ g(x_n) \end{pmatrix} = \begin{pmatrix} g'(x_1) dx_1 \\ g'(x_2) dx_2 \\ \vdots \\ g'(x_n) dx_n \end{pmatrix} = g'.(x) \dotstar dx \, . $$ Elementwise functions are somewhat alien to ordinary linear algebra (among other things, they are basis-dependent), but are common in many fields from machine learning (``activation'' functions are applied elementwise to a layer of a neural network) to data science, and once we write out the rules we see that they are quite simple.\\
\\
To write this as a Jacobian matrix, realize that the elementwise product $a \dotstar dx$ is equivalent to multiplying $dx$ on the left by a \emph{diagonal matrix} whose diagonal entries are the components of the vector $a$. This matrix is denoted $\operatorname{diagm}(a)$ in Julia, and $\operatorname{diag}(a)$ in Matlab and Numpy, so we could therefore see that our Jacobian matrix is $\operatorname{diagm}(g'.(x))$, or
$$
f'(x) = \operatorname{diagm}(\cos.(x)) = \begin{pmatrix}
\cos(x_1) & & & \\
& \cos(x_2) & &\\
& & \ddots & \\
& & & \cos(x_n)
\end{pmatrix}
$$
in this case.
\end{enumerate}
\pagebreak
\subsection*{Problem 2 (15+5 points)}
\subsubsection*{part (a):}
We have $dL = d[ (x_N(p) - y_0)^T (x_N(p) - y_0)] = 2 (x_N(p) - y_0)^T dx_N $ (similar to our derivative $d(x^T x)=2x^T dx$ in class), and differentiating the recurrence for each $k$, we also have $dx_k = \frac{\partial f_k}{\partial p} dp + \frac{\partial f_k}{\partial x_{k-1}} dx_{k-1}$ as in class. Putting these two together, we have:
\begin{align*}
dL &= 2 (x_N(p) - y_0)^T \left( \frac{\partial f_N}{\partial p} dp + \frac{\partial f_N}{\partial x_{N-1}} dx_{N-1}
\right) \\
&= 2 (x_N(p) - y_0)^T \left( \frac{\partial f_N}{\partial p} dp + \frac{\partial f_N}{\partial x_{N-1}} \left( \frac{\partial f_{N-1}}{\partial p} dp + \frac{\partial f_{N-1}}{\partial x_{N-2}} dx_{N-2}
\right) \right) \\
&= 2 (x_N(p) - y_0)^T \left( \frac{\partial f_N}{\partial p} dp + \frac{\partial f_N}{\partial x_{N-1}} \left( \frac{\partial f_{N-1}}{\partial p} dp + \frac{\partial f_{N-1}}{\partial x_{N-2}} \left( \frac{\partial f_{N-2}}{\partial p} dp + \frac{\partial f_{N-2}}{\partial x_{N-3}}
\left( \cdots \right)
\right)
\right) \right)
\end{align*}
As discussed in class, since we have a single scalar output and many inputs $p$, the trick is to \emph{evaluate left-to-right} so that we are always doing vector--matrix multiplications and \emph{never} matrix--matrix multiplications. This is straightforward computationally, but writing it out formally as a recurrence relation involves a little thought.
We have to keep track of the row vector that we accumulate, starting from the left-most row vector $2 (x_N(p) - y_0)^T$. Let's give this a name:
$$
\boxed{z_N^T = 2 (x_N(p) - y_0)^T} \, .
$$
Then, let's define subsequent elements
of $z_k$ by grouping vector--matrix products in $dL$:
\begin{align*}
dL &= z_N^T \left( \frac{\partial f_N}{\partial p} dp + \frac{\partial f_N}{\partial x_{N-1}} \underbrace{(\cdots)}_{dx_{N-1}}
\right) \\
& = z_N^T \frac{\partial f_N}{\partial p} dp + \underbrace{z_N^T \frac{\partial f_N}{\partial x_{N-1}}}_{z_{N-1}} \underbrace{\left( \frac{\partial f_{N-1}}{\partial p} dp + \frac{\partial f_{N-1}}{\partial x_{N-2}} \underbrace{(\cdots)}_{dx_{N-2}} \right)}_{dx_{N-1}}
\\
& = z_N^T \frac{\partial f_N}{\partial p} dp +
z_{N-1}^T \frac{\partial f_{N-1}}{\partial p} dp
+
\underbrace{z_{N-1}^T \frac{\partial f_{N-1}}{\partial x_{N-2}}}_{z_{N-2}} \underbrace{(\cdots)}_{dx_{N-2}}
\\
&= \cdots = \underbrace{\left(\sum_{k=1}^N z_k^T \frac{\partial f_k}{\partial p} \right)}_{(\nabla L)^T} dp \, ,
\end{align*}
where we have defined a (linear!) recurrence:
$$
\boxed{z_{k-1}^T = z_k^T \frac{\partial f_k}{\partial x_{k-1}}} \, .
$$
Each step of this ``backwards'' (\strong{descending} $k$) recurrence (``backpropagation'', ``reverse-mode''), starting with $z_N$ and then computing $z_{N-1}$ and so on, only involves a vector--matrix multiply, so it is fast. We then have the gradient:
$$
\boxed{\nabla L = \sum_{k=1}^N \left(\frac{\partial f_k}{\partial p}\right)^T z_k} \, .
$$
which again involves $N$ matrix--vector multiplications.
\subsubsection*{part (b):}
In general, if $\partial f_k/ \partial x_{k-1}$ and/or $\partial f_k / \partial p$ are sparse (mostly zeros), the vector--matrix and matrix--vector multiplications above (respectively) can be simplified because you only need to multiply by the nonzero parts of the matrix.
In particular, the problem asks you what happens if $f_k$ only depends upon the $k$-th block of parameters $p_k$. Suppose $x_k \in \mathbb{R}^{m_k}$ (the size of the NN ``layer''), so that $f_k$ has $m_k$ outputs. Then we obtain a sparse $\partial f_k /\partial p$ with the following $m_k \times Nn$ form:
$$
\begin{pmatrix}
0 & 0 & \cdots & \underbrace{\frac{\partial f_k}{\partial p_k}}_{m_k \times n} & 0 & 0 & \cdots & 0
\end{pmatrix} ,
$$
where each ``$0$'' represents an $m_k \times n$ \emph{block} of zeros, and the only nonzero block is $\partial f_k/ \partial p_k$, which appears in the $k$-th column block.
In consequence, we gain efficiency in the computation of $\nabla L$ above by skipping the multiplications by these zero blocks. Furthermore, in the $\nabla L$ summation, most of the elements of each vector in the sum is zero, and we can skip the addition of zeros. Finally, we obtain the simplification:
$$
\boxed{
\nabla L =
\begin{pmatrix}
\left(\frac{\partial f_1}{\partial p_1}\right)^T z_1 \\
\left(\frac{\partial f_2}{\partial p_2}\right)^T z_2 \\
\vdots
\\
\left(\frac{\partial f_N}{\partial p_N}\right)^T z_N
\end{pmatrix} } \, .
$$
\subsection*{Problem 3 (5+10+5+5 points)}
\begin{enumerate}[label=\alph*)]
\item $f(v) = \int_0^1 \sin(v(x)) dx$, so $$df = f'(v)[dv] = f(v+dv)-f(v) = \int_0^1 \left( \sin(v+dv) - \sin(v)\right) dx = \boxed{ \int_0^1 \cos(v(x))\,dv(x) \, dx} \, ,$$ which is a linear operator acting on perturbation \emph{functions} $dv(x)$ (or $\delta v$, dropping terms beyond first order).
\item $g(v) = \int_0^1 \sqrt{1 + v'(x)^2} dx$, so similarly expanding the integrand for $v' + dv'$ to first order in $dv'$ (just a scalar Tayor expansion of $\sqrt{1+y^2}$ in $y$), we obtain the linearization $dg = g'(v)[dv] = \int_0^1 \frac{v'(x)}{\sqrt{1 + v'(x)^2}} dv'(x)\, dx$. In order to get the integral in terms of $dv$, not $dv'$, we integrate by parts:
$$
dg = \left. \frac{v'(x)}{\sqrt{1 + v'(x)^2}} dv \right|_0^1 - \int_0^1 \left(\frac{v'(x)}{\sqrt{1 + v'(x)^2}}\right)' dv(x) dx \, ,
$$
where the first term is \emph{zero} because $dv \in V$ vanishes on the endpoints. After a bit of high-school calculus to take the derivative and put the integrand over a common denominator, we obtain:
$$
\boxed{dg = g'(v)[dv] = \int_0^1 \frac{- v''(x)}{(1+v'(x)^2)^{3/2}} dv(x) \, dx}\, .
$$
\item To have $dg = 0$ for \emph{any} $dv$, it is clear that we must have $$ \frac{ v''(x)}{(1+v'(x)^2)^{3/2}} = 0 $$ for $x \in [0,1]$ (almost everywhere) in the previous part. But this only happens if $\boxed{v'' = 0}$, a straight line! Furthermore, since we required $v(0)=v(1)=0$, the only possible straight line is $\boxed{v(x) = 0}$. In this case $g(v) = 0$, which is clearly the \strong{minimum} value of $g$ since $g \ge 0$ by definition.
\item Geometrically, $g(v)$ is the \strong{length} of the curve $v(x)$ (this $g(v)$ integral should be a familiar formula from 18.01!), and so its \strong{minimum} occurs when $v(x)$ is a \strong{straight line between the endpoints}.
\end{enumerate}
This is a lot of effort just to find a solution that is a straight line, but is an example of a powerful technique from ``\strong{calculus of variations}'': if $f(v) = \int F(v,v')dx$ for some integrand $F$, we linearize $df$ and integrate by parts to obtain a linear functional $df = f'(v)[dv]$ as some integral of $dv$. Then by setting the integrand to zero, we obtain a second-order differential equation in $v$ called the ``Euler--Lagrange'' equation, which can then be solved using 18.03 techniques (and beyond) and/or numerical methods to determine the $v$ for which $f(v)$ is maximal/minimal.
\subsection*{Problem 4 (10 points)}
For $Y(S) = A^T S A$, we have $dY = A^T dS A$, a linear operation on the input perturbation $dS$. If we write this out explicitly for $2\times 2$ symmetric matrices and doing all of the multiplications by hand, we obtain:
\begin{align*}
dY &= \begin{pmatrix} dy_{11} & dy_{12} \\ dy_{12} & dy_{22} \end{pmatrix} =\begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix}^T \begin{pmatrix} ds_{11} & ds_{12} \\ ds_{12} & ds_{22} \end{pmatrix}
\begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \\
&=\begin{pmatrix} a_{11} & a_{21} \\ a_{12} & a_{22} \end{pmatrix} \begin{pmatrix} ds_{11} a_{11} + ds_{12} a_{21} & ds_{11} a_{12} + ds_{12} a_{22 \\ ds_{12} a_{11} + ds_{22} a_{21} & ds_{12} a_{12} + ds_{22} a_{22}} \end{pmatrix} \\
&= \begin{pmatrix}
a_{11}^2 ds_{11} + 2a_{21}a_{11}ds_{12} + a_{21}^2 ds_{22} & a_{11}a_{12}ds_{11} + (a_{11}a_{22}+a_{21}a_{12})ds_{12} + a_{21}a_{22} ds_{22} \\
\text{(symmetric)} &
a_{12}^2 ds_{11} + 2a_{12}a_{22}ds_{12} + a_{22}^2 ds_{22}
\end{pmatrix} \, .
\end{align*}
If we now write the inputs and outputs as 3-component vectors,
we can then read off the entries of the Jacobian matrix (outputs=rows, inputs=columns!) from above:
$$
\begin{pmatrix} dy_{11} \\ dy_{12} \\ dy_{22} \end{pmatrix}
=
\boxed{\begin{pmatrix}
a_{11}^2 & 2a_{21}a_{11} & a_{21}^2 \\
a_{11}a_{12} & a_{11}a_{22}+a_{21}a_{12} & a_{22}a_{21} \\
a_{12}^2 & 2a_{12}a_{22} & a_{22}^2
\end{pmatrix}}
\begin{pmatrix} ds_{11} \\ ds_{12} \\ ds_{22} \end{pmatrix} \, ,
$$
where the boxed term is the desired Jacobian.
% \printbibliography
\end{document}