-
Notifications
You must be signed in to change notification settings - Fork 0
/
tiny-supfail.tex
505 lines (412 loc) · 21.9 KB
/
tiny-supfail.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
\documentclass{farlamp}
% latexmk -pvc -pdfxe -bibtex -interaction=nonstopmode -outdir=build tiny-supfail.tex
\addbibresource{references.bib}
\graphicspath{ {./tiny-supfail-pics/} }
\subject{Report}
\title{Training a tiny SupAmp model on easy tasks}
\subtitle{The influence of failure rate on learning curves}
\author{Richard Möhn}
\date{\today}
\addtitledatatopdf
\begin{document}
\maketitle
\tableofcontents
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
(For a project overview and a glossary, see the \FarlampRepo.)
\textsc{Bluf}: The learning behaviour is as roughly expected. \Xpa\ compensates
for some overseer failures, although not yet enough to be reliable. Mistakes I
made prevent me from drawing further interesting conclusions from the experiment
results so far. \OQsymbol\ and \TODOsymbol\ mark questions and tasks for future
research.
I ran experiments with a tiny SupAmp model on easy tasks. Small data, short
training times and local execution allow for a tight feedback loop with few
moving parts. Thus I could understand and adapt the code, and fix mistakes
quickly.
The drawback is that it doesn't yield many insights. One result
(\hyperlink{interesting}{click to jump there}), however, was encouraging: \Xpa\
does learn to be more reliable than its overseer \AmpHp. The rest of the results
fall into two categories: Results confirming my understanding of the code and
learning behaviour. And results made useless by mistakes and misunderstandings
of mine. Although the latter sounds bad, it is actually useful to have exposed
these mistakes before running more expensive experiments.
In this report I quickly note and discuss the methods, observations and lessons
learned before I move on. ‘Quickly’ also means that you have to content yourself
with screenshots of TensorBoard instead of pretty plots. Please excuse the
unpolished prose and other rough edges – I had to move on and start preparing
for my \textsc{aisc} interview.
% TODO: Mention the striking result and my having made mistakes.
% - To get a feeling for what is going on.
% - And solve problems with the code with a short feedback loop. → Without having
% to deal with big data, long training times and working in the cloud.
% - Ran experiments with a tiny model on easy tasks.
% - Still found results that are useful, because they open questions for further
% research. See \OQ and \TODO markers.
% - Quickly writing the interesting parts up before I move on.
% - Quickly also entails that I use screenshots of TensorBoard instead of nice
% plots.
\subsection{Terms}
% $X$ & The model that is trained on question-answer pairs. \\
% \Xpa & A predictor derived from $X$ by Polyak averaging and used by
% \Amplify{H'}{\Xpa} to answer sub-questions. \\
Questions are answered by \AmpHp. The resulting (question, answer) pairs are
used to train $X$. From $X$ \Term{\Xpa} is derived by Polyak averaging. This is
the same as in \textcite{CSASupAmp}, except that I introduce the symbol \Xpa\ in
order to separate the learner from the assistant. I distinguish them here,
because they are distinguished in the code. From the code also derive the labels
in the diagrams that you will see later. Table \ref{tab:code-symbols} explains
these using the terms and symbols of this report.
\begin{table}
\label{tab:code-symbols}
\caption{Explanations of the labels in the diagrams}
\begin{tabulary}{\textwidth}{>{\ttfamily}lL}
\firsthline
\normalfont label in diagrams & explanation using the terms of this report \\
\hline
accuracy/targets & validation accuracy of \AmpHp \\
accuracy/teacher & validation accuracy of \Xpa \\
accuracy/train & training accuracy of $X$ \\
accuracy/validation & validation accuracy of $X$ \\
accuracy\_on/<d>/targets & validation accuracy of \AmpHp\ on
questions/answers of depth $d$ \\
accuracy\_on/<d>/teacher & validation accuracy of \Xpa\ on
questions/answers of depth $d$ \\
accuracy/asker/q/train & training accuracy of $H'$ on imitating the
sub-questions that $H$ would ask \\
accuracy/asker/q/validation & validation accuracy of $H'$ on imitating
the sub-questions that $H$ would ask \\
accuracy/asker/a/train & training accuracy of $H'$ on imitating the
root answers that $H$ would give \\
accuracy/asker/a/validation & validation accuracy of $H'$ on imitating
the root answers that $H$ would give \\
\lasthline
\end{tabulary}
\end{table}
The \Term{depth} $d$ of a question is equal to its height plus one. The height
is defined in \Overfail. They have the same definition, just a different
starting point. Henceforth I will use depth instead of height, in order to be
consistent with \textcite{CSASupAmp} and the SupAmp code.
$\Term{N_a} = N + 1$ is the number of answers a learner can give. – Any of the
numbers of the domain plus a symbol for ‘I don't know’.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Method}
I ran SupAmp on the permutation powering task from \textcite{CSASupAmp}. The
code was written originally by Paul Christiano and Buck Shlegeris, then modified
by William Saunders, then by me. It can be found on \AmplificationRepo. Fifteen
trials were run in total, three each with error probability $p_F = 0, 0.01, 0.1,
0.3, 1.0$ and the following configuration:
\begin{itemize}
\item One-layer transformer for both $H'$ and $X$ (if I understand it
right).
\item Permutations of four numbers ($N = 4$), up to the seventh power ($2
\leqq k \leqq 7$).
\item Train $X$ with 4000 batches à 50 contexts with 4 root questions and
answers each. After $X$ is done with 4000 batches, also quit the
training of $H'$, no matter how far it got.
\item Other hyperparameters as they were when I got the code.
\end{itemize}
% - SupAmp on permutation powering.
% - Tiny model. Means a one-layer transformer for both H' and X, if I understand
% correctly.
% - Permutation of four numbers, power up to seven.
% - Train for 4000 steps, that is 4000 batches à 50 contexts with 4 questions each
% to train X. However many steps the other threads got done in parallel.
% - Then the same with $p_F \in {0.01, 0.1, 0.3, 1.0}$.
% - Each of these three times.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Observations and discussion}
In the following screenshots from TensorBoard the smoothing factor is 0.6 unless
noted otherwise.
\subsection{Without failures ($p_F = 0$)}
\begin{LCRow}
% LC [r"accuracy/asker/[aq]/validation"]
% {"trials": r"iterate/0/.*"}
\LearningCurve{pf0-all-asker-select}
% LC [r"accuracy.*/targets"]
% {"trials": r"iterate/0/.*"}
\LearningCurve{pf0-all-targets-all}
% LC [r"accuracy.*/teacher"]
% {"trials": r"iterate/0/.*"}
\LearningCurve{pf0-all-teacher-all}
\end{LCRow}
\Obs The learning curves for all three trials are similar, the variation between
trials is low.
% Include picture of H' and train learning curves for pf = 0 and 0.3.
% When I include pictures, make the names generic, so that I can replace them as
% needed.
\Disc This allows me to select only the first trial for further analysis.
\begin{LCRow}
% LC [r"accuracy/asker/q/.*"]
% {"smoothing": 0.0}
\LearningCurve{asker-bouncing-q}
% LC [r"accuracy/asker/a/.*"]
% {"smoothing": 0.0}
\LearningCurve{asker-bouncing-a}
\end{LCRow}
\Obs (Smoothing turned off.) The accuracy of $H'$ bounces around in the
beginning.
\Disc \OQ{}I don't know why. One reason is that it is measured only every fifty
batches, much more seldom than the other accuracies. The bouncing is no problem,
since $H'$ reaches perfect accuracy quickly.
\begin{LCRow}
% LC [r"accuracy/asker/a/validation", r"accuracy.*/targets"]
\LearningCurve{pf0-asker-a-validation-targets-all}
% LC [r"accuracy/(train|targets|teacher)"]
\LearningCurve{pf0-student-validation-targets-total-teacher-total}
\end{LCRow}
\Obs $H'$ reaches 1.0 quickest (talking about accuracy), then \AmpHp\ on depth-2
questions. Thereafter, \AmpHp\ on depth-3 questions and on all questions reach
1.0 at the same time. $X$ lags behind \AmpHp, and \Xpa\ lags behind $X$.
\Disc This is what I roughly expected. These accuracies all depend on each
other. $H'$ needs to become good enough for the amplification to generate
sensible training data for $X$. At first the only correct inputs \AmpHp\ gets
are answers to depth-1 (ie. primitive) questions. With the answers to depth-1
questions, \AmpHp\ can answer depth-2 questions, which is the next accuracy that
goes up. Once depth-2 answers hit the mark often enough, the training for
depth-3 answers becomes effective. Depth-2 and depth-3 answers make up all
answers, so the accuracy on all answers is in between.
\AmpHp\ delivers the training data for $X$, and \Xpa\ is derived from $X$. This
explains the respective lags.
\begin{LCRow}
% LC [r"accuracy/asker/a/.*"]
\LearningCurve{pf0-asker-a-all}
% LC [r"accuracy/asker/q/.*"]
\LearningCurve{pf0-asker-q-all}
% LC [r"accuracy/(train|validation)"]
\LearningCurve{pf0-student-all}
\end{LCRow}
\Obs Training and validation accuracy curves are remarkably close both
for $X$ and $H'$.
% Include pictures of the mentioned accuracies.
\Disc I don't know why this is. \OQ It might be that the validation data
are too similar to the already trained data and therefore not good enough. This
might be because the domains are too small. Or perhaps the learners barely
overfit, because the training data are always random and new.
\begin{LCRow}
% LC [r"accuracy.*/targets"]
\LearningCurve{pf0-targets-all}
% LC [r"accuracy.*/teacher"]
\LearningCurve{pf0-teacher-all}
% LC [r"accuracy/(validation|teacher)"]
\LearningCurve{pf0-student-validation-teacher-total}
\end{LCRow}
\Obs The learning curves of \Xpa\ are smoother than those of \AmpHp and $X$.
\Disc I guess that this is because of the Polyak averaging. It averages
the weights of the last 1000 steps. This biases the predictions to be similar to
the predictions during the last 1000 steps, and therefore reduces the variance
in the accuracy.
\begin{LCRow}
% LC [r"asker/a/validation", r"accuracy_on/[23]/targets"]
% {"zoom": True}
\LearningCurve{dip-asker-23targets}
% LC [r"accuracy/targets", r"accuracy/train", r"accuracy/teacher"]
% {"zoom": True}
\LearningCurve{dip-targets-answerer-teacher}
\end{LCRow}
\Obs At the beginning of the training \AmpHp, $X$ and \Xpa, in
this order, go through a section of less-than-chance accuracy. (The accuracy of
random outputs would be $\frac{1}{N_a}$.)
% Include pictures of the appropriate learning curves.
\Disc I don't know why this happens. \OQ It might have to do with $H'$ not being
fully trained, or with the special response ‘I don't know’, which is given
non-randomly. It is not important to find out now.
\subsection{With failures injected ($p_F > 0$)}
Unfortunately I made two mistakes: I let the trials run for too few steps,
stopping it before training converged for depth-3 questions. This means that the
accuracy numbers of \Xpa\ for these are almost useless. And I was confused about
the injecting of errors, putting it in the wrong place or measuring accuracy
wrongly or calculating the theoretical accuracy wrongly – or all of these
together. \TODO{}I haven't decided yet how to resolve the issue. The result is
that the accuracy numbers of \AmpHp\ on depth-3 questions are almost useless as
well.
For now I don't have time to clarify my thoughts, fix the code and rerun the
experiments. So I will only note and discuss the few valid observations that are
left.
% See the bottom of the file for what I'd written.
\begin{LCRow}
% LC [r"accuracy/targets"]
% {"trials": r"iterate/0\.(01|1|3)"}
\LearningCurve{pfg0-all-targets-all}
% LC [r"accuracy/teacher"]
% {"trials": r"iterate/0\.(01|1|3)"}
\LearningCurve{pfg0-all-teacher-all}
\Obs Also here the results for equal configurations are almost the same.
\Disc I will only include the first trial of each configuration.
\end{LCRow}
\begin{LCRow}
% LC [r"accuracy/teacher"]
% {"trials": r"iterate/1\..*"}
\LearningCurve{pf1-teacher}
\end{LCRow}
\Obs The accuracy for $p_F = 1.0$ starts at slightly better than chance, then
goes down to as good as chance.
\Disc Ending at as-good-as-chance is not surprising. $p_F = 1.0$ means that the
training data are random. With random training data it is impossible to achieve
better-than-chance accuracy. \OQ{}The start at a higher accuracy probably has to
do with ‘I don't know’ responses. I don't want to take time now to understand it
completely. And I will leave the training curve for $p_F = 1.0$ out of the
further discussion.
\begin{LCRow}
% LC [r"accuracy/teacher"]
% {"trials": r"iterate/0.*/0/.*"}
\LearningCurve{pf-all-rise}
\Obs The learning curves for all $p_F$, including $p_F = 0$ rise at about the
same time. Only the invisible ceiling that they hit differs.
\Disc This observations appears rather uninteresting. But at least it shows that
in this configuration the injected failures only influence the attainable
accuracy and don't destabilize the training.
\end{LCRow}
\hypertarget{interesting}{}
\begin{LCRow}
% LC [r"accuracy_on/2/(targets|teacher)"]
% {"trials": r"iterate/0\.(01|1|3)/0/.*", "zoom": True}
\LearningCurve{xpa-surpasses}
\end{LCRow}
\Obs The accuracy of \AmpHp\ on depth-2 questions matches the theoretically
calculated accuracy. The accuracy of \Xpa\ surpasses it in all remaining settings
of $p_F$.
\Disc The former observation is not surprising, since for depth-2 questions $H'$
receives correct inputs (sub-answers to primitve questions) and only its outputs
get laced with randomness. There is nothing between the mechanism that injects
randomness and the accuracy measurement. Therefore the measured accuracy must
match what one would theoretically calculate, given the error injection
mechanism.
\hypertarget{bla}{The latter observation}, however, is striking.
(Perhaps the only worthwhile fact in this report.) \Xpa\ achieves a higher
accuracy than \AmpHp, from which its training data come. This supports my
hypothesis that regularization in the distillation step can compensate for some
of the failures introduced in the amplification. (What regularization? SupAmp
uses Dropout, batch normalization and layer normalization \parencite{CSASupAmp}.
Polyak averaging as a quasi-ensemble method \parencite{BrowPolyakAvg} probably
acts as a regularizer as well.)
% I'm not completely sure whether it actually supports the hypothesis, though. –
% When training data contain failures with probability $p_F$.
\begin{LCRow}
% LC [r"accuracy_on/2/teacher"]
% {"trials": r"iterate/(0|0.01)/0/.*"}
\LearningCurve{imperfect-0_01}
\end{LCRow}
\Obs With $p_F = 0.01$, the final accuracy of \Xpa\ on depth-2 questions is
0.9972 (according to TensorBoard, with the smoothing parameter set to 0.6). This
is lower than 0.9997 when $p_F = 0$.
\Disc This speaks against my main hypothesis. – If there was a failure
tolerance, how low can it be? – The reason might be that even the tiny SupAmp
model has too great capacity for this easy task, or, relatedly, that we need
more regularization.
The hypothesis that more regularization leads to a greater failure tolerance,
can be tested. \TODO For example, one could increase the dropout probability
(currently 0.1) and see if the accuracy rises back to $p_F = 0$ levels. Of
course, the higher $p_F$, the harder it will be to find a dropout rate that
regularizes away faulty inputs while still allowing learning progress.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Points to improve}
\subsection{Length of training}
I noted above that not all learners converged within the 4000 steps that each
trial ran for. In order to ensure convergence and avoid useless results, I could
increase the number of training steps indiscriminately.\TODO\ Better would be to
have the weights saved at the end of training, or even at regular checkpoints,
so that not-yet-converged trials can be continued. This will be vital if I move
the training to \textsc{Aws} Spot instances, which I'm considering.
\subsection{Performance}
Much computation is wasted: Measuring code is run frequently, often at every
batch. And a large part of it runs in Python or NumPy space, rather than more
optimized TensorFlow space. $H'$ reaches perfect accuracy within the time it
takes $X$ to complete 1000 training steps, but keeps getting trained at full
speed. $X$ reaches perfect accuracy on each successive question depth, yet
questions of all depths continue being sampled uniformly. (The latter might be
different when curriculum learning is turned on.) One should be able to reduce
training time by concentrating ressources where they're needed.
\printbibliography
\end{document}
Material collection
-------------------
- What hyperparameters?
- observations
- low variance between learning curves for same hyperparameters → need to
look at only one trial per configuration
- all accuracies approach 1 quickly and without detours
- behaviour makes sense/almost nothing surprising
- first H' up to about 0.9, then on/2
- first on/2 up, then on/3, /targets in between
- train/validate are rather close
- The teacher learning curves look smoother than the others. This might be
because of Polyak averaging.
- observations with pf = 0
- expected final accuracies at pf = 0.01, 0.1, 0.3, 1.0
- compared to actual final accuracies
- Why is acc_on/3 less than predicted theoretically?
- Perhaps not trained to convergence.
- Even with pf = 0.01 the model doesn't get trained to overlook the errors.
- Small evidence against the main hypothesis. – Would expect that the
threshold is above 0.01, yet we do get the errors.
- Might just be that the model has too great capacity/we're not regularizing
enough.
→ Hypothesis: If I turn up regularization, the accuracies will rise. To a
point where they don't fall exponentially in the question class/height
anymore.
- Of course, at some point the model can't be trained anymore. The error
probability needs to be low enough that it can be countered with
regularization weak enough as to not prevent learning.
- → X learns all the mistakes that H' makes
- The acc_on/2/teacher are higher than the acc_on/2/targets.
- Can't say anything about acc_on/3/*, because training hadn't fully
converged.
- Is it because Polyak averaging acts as regularization?
- Training and validation curves are remarkably close.
- Are the validation data not good enough? Too similar to the already
trained data?
- Maybe barely any overfitting, because there are always random new training
data.
- performance conclusions see case 72
- need to train for longer/add checkpointing
- How do the model errors come about, though? The training data contain randomly
wrong inputs. What does the model learn to do differently when it receives
randomly wrong inputs?
% What I had to throw away, because I made mistakes.
\paragraph{Observation} (When I'm writing about \AmpHp\ and \Xpa, I'm referring
to their final accuracies.) On depth-2 questions \AmpHp\ matches the theoretical
predictions, \Xpa\ surpasses them. On depth-3 questions, both don't reach the
theoretical predictions. Except for one value at $p_F = 0.01$, which I consider
an outlier. Even a failure probability as low as $p_F = 0.01$ prevents the
predictor from reaching perfect accuracy. See table \ref{tab:pred-acc}.
\begin{table}
\label{tab:pred-acc}
\caption{Tensorboard smoothing 0.6. In percent.}
% TODO ideally: Highlight the relevant information.
\begin{tabular}[.|...|...]
$p_F$ & $\Pr(a_1 = a_1^\ast)$ & acc. on depth-2 & $\Pr(a_2 = a_2^\ast)$
& acc. on depth-3 \\
0.01 & 99.20 & 99.27 & 99.72 & 98.40 & 98.52 & 97.03 \\
0.10 & 92.00 & 92.31 & 94.50 & 84.80 & 82.02 & 74.28 \\
0.30 & 76.00 & 75.83 & 78.47 & 59.20 & 51.33 & 37.76 \\
\end{tabular}
\end{table}
What can I actually say? I made mistakes both in my thinking about what is going
on and in training for too short. It will take some time to fix them. So for now
I can only note that \Xpe reaches a higher accuracy than the training data on
depth-2 questions. (Is this right? Make sure that the ground truth doesn't
contain errors.) (What about $X$? What accuracy does it reach?) This is evidence
for my main hypothesis
% Include pictures of learning curves.
\paragraph{Discussion} Why does \AmpHp\ not match the theoretical predictions on
depth-3 questions? \OQ Did I overlook something in the theoretical calculations?
It might also be because the training hadn't converged yet. I need to train for
longer next time. \TODO
The fact that the learners can't compensate $p_F = 0.01$ is evidence against my
main hypothesis. – If there was a failure tolerance, how low can it be? – The
reason might be that even the tiny SupAmp model has too great capacity for this
easy task, or, relatedly, that we need more regularization. This might explain
why \Xpa\ has greater-than-theoretical accuracy for depth-2 questions. Because of
the Polyak averaging, \Xpa\ acts as an ensemble of neural networks
\parencite{BrowPolyakAvg}. Ensembling is a form of regularization and might
thus compensate for the errors of $H'$.
The hypothesis that more regularization leads to a greater failure tolerance,
can be tested. \TODO For example, one could increase the dropout probability
(currently 0.1) and see if the accuracy rises back to $p_F = 0$ levels. Of
course the higher $p_F$, the harder it will be to find a dropout rate that
regularizes away faulty inputs while still allowing learning progress.
Questions that I can't yet answer: What does $X$/\Xpa\ do differently when
$p_F > 0$? Does it learn to return random answers once in a while like $H'$
does? Or does the noise in the training data just limit learning progress? But
why then does progress stop at exactly the… (Here I noticed that I had made some
mistakes with the failure injection and accuracy measurement.)