Hacker News new | past | comments | ask | show | jobs | submit login
What is backpropagation and what is it doing? [video] (youtube.com)
645 points by adamnemecek on Nov 4, 2017 | hide | past | favorite | 99 comments



We need more people teaching math through visual intuition. Life a friend of mine said, "if you want to do computation fast, phrase it as a problem for your GPU, er, visual cortex".

Here is a tool you can play with to visualize this: http://playground.tensorflow.org

If you liked this video, try this different visual intuition of what a neural network does that I find even better:

http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Also, remember that back propagation is a very general algorithm, it works not only on linear transformation weights but on any direct acyclic computation graph that is differentiable in its weights.


This can also go wrong, for example visualising probability distributions in low dimensions leads to very wrong intuitions about the behavior of high-dimensional dimensional distributions.


I think this is a very good point. Years ago people were worried about gradient descent getting stuck at a local minima, plausibly because this problem is very obvious in a 3 dimensional space. In higher dimensions however this problem seems to go away more or less and a lot of worrying about the issue seems to be the result of lower dimensional intuitions wrongly extrapolated to higher dimensions.


Stuff got stuck in local minima for years before we learned about stuff like momentum and dropout and dropped a ton of GPU power on it.


When I was implementing a neural network for a university assignment (2 years ago so my memory might fail me), we had to run our algorithm multiple time with different starting positions, then take the minimum of those local minima.

I'm not sure what momentum and dropout are, but I agree with Eleizer, without these things (which I didn't use) local minima are a problem.


Dropout is where you randomly remove neurons from your network during training, which prevents them from depending too much on specific neurons (making the output more generalizable). It was developed in 2014 so it would have been brand new tech back when you were in your class.


I think you're misinterpreting the parent who is saying that local minima are not a problem in high dimensions because there is always a dimension to move in that reduces the loss (unlike in lower dimensions where you can get stuck in a point across all dimensions that cannot be locally improved upon)


I still don't understand what the parent is talking about then. Could you please restate the explanation using math notation/terminology?


[flagged]


Not the person you replied to, but your comment was both rude and incorrect enough that I feel the need to reply. See for example http://www.offconvex.org/2016/03/22/saddlepoints/ for some discussion on this.


Say you have N random walks.

The probability that the second derivative at any point is of the same sign for all walks decreases with N.

Right?


Here is a good short document[0] that can help unlock the missing visual intuition of high dimensional data. It does require a lot more thinking than 3d space, but eventually along side the maths it help you to get a "feel" for it.

[0] https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/ch...


That means you need to keep track of what properties actually hold under projection to lower dimensions.


Keeping track of properties is not really something 'visualization' necessarily helps with though, more symbolic reasoning through proofs.


I agree it isn't 100% perfect for every situation. But I can think of plenty of instances where ditching colors, lowering resolution, etc. have been totally fine (and often essential) for gaining a level of intuition. As some other comments noted, this intuition may be flawed, mostly because it's hard to know what you don't know. As a result, you are still more informed than you might have been before, but you also might not know much more.

however, being able to mix and match properties is really what most good plotting and visualization is all about. having done a bunch of ray tracing, my intuition around lighting and light is much better. I am not even good 'anechdata' so take that for what its worth, but I found visualization to be much more intuitive than reading e&m textbooks/lectures. I'm not sure if knowing a phenomena as bottom up or top down is really a guarantee (nor am I suggesting that symbolic reasoning is bottom up or down), but seeing something is just so efficient for some people. Like anything powerful, it just needs to be used judiciously and with asterisks.


I don't mean that spatial reasoning helps with that, I mean that if you do it, you can still apply your spatial reasoning where it's appropriate.


I'm not sure I know what you are talking about but let's not throw away the baby with the bath water.


What he means is that we only really have intuition for 1, 2, and 2.5D visuals, but many areas of mathematics don’t map into low dimensions very well, or do but lose essential properties in the process. Building a low dimensional projection of he problem might prime intuition, but it will also introduce fundamental biases as well.

For example, learning geography by flat map projections only. No matter what projection you use there is a trade off, and you end up instilling both the pro and the con of that trade off as intuition.


Flat map projections work fine if you provide enough of them.

A video from LEO or a rotating map projection provides very different intuition than a single static map. https://www.youtube.com/watch?v=EPyl1LgNtoQ

Very high zoom levels also work out nicely.


yeah, as I ad-hoc-ly commented. It's all about compression.


I would reiterate that simply because there are problems with naive visualization, we shouldn't discredit visual thinking.

There are several key elements to effective visual thinking. The primary importance is to keep it grounded in proofs and theorems, so you know exactly what are your limitations. Often you can use a geometric argument on top of a few theorems and you get a very strong result intuitively, and then use this intuition with a tiny amount of algebra to prove it (which might take you forever to arrive from a purely algebraic perspective). Another key is that there are several ways of visualizing things. You can almost always transform a problem into an equivalent one that is easy to visualize (just need a little bit of care with the transformation, etc).

---

For example, you can show functions form a vector space, visualizing some interesting algebraic properties about them, even if it constitutes an infinite-dimensional space.

You can show several operators (such as d/dx) are linear, you can give it a norm, internal product, etc. This trick lets you use visual tools (and linear algebra tools) with arbitrary functions. You can visualize projection of a function into a subspace, or into some non-canonical basis -- yielding useful applications -- such as Fourier analysis.

Fourier analysis itself is a fertile ground for visual thinking. You'll be finding trivial arguments for seemingly difficult decisions such as "Does this linear system have a bounded output for any bounded input?". There isn't one right way of thinking about anything.

---

On the other hand, it can't be stressed enough the importance of keeping track of formal assumptions, axioms, definitions, theorems to construct valid, correct proofs. That way you minimize the risk of fooling yourself, and can safely use your intuition.

This 3B1B video exemplifies many of those elements:

https://www.youtube.com/watch?v=zwAD6dRSVyI&t=633s


Yes, exactly, thanks.


I didn't understand the distributions part.


Part of it may be that in higher dimensions the bulk of a volume is concentrated near the surface.

I found https://blogs.msdn.microsoft.com/ericlippert/2005/05/13/high... with a quick search.


The article here (and the links in the comments) might clarify the connection to machine learning:

http://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres

https://news.ycombinator.com/item?id=3995615


interesting

the volume of an n-ball peaks at n=4 dimensions and quickly drops to zero around n=20. cf. https://news.ycombinator.com/item?id=3995930

The comment refers to lebesgue measure (I don't even what), but I'd intuitively and ignorantly assume we count all faces of all n-1 balls (recursively) whereas the Volumes overlap and so the total (in lebesgue ...space?) is less than the sum of it's parts (in euclidean space) - how far off am I? (will delete if too far)


It's related to the geometric problems the parent described because probability distributions roughly describe geometric regions (of high probability density) where observations are likely.


> learning geography by flat map projections only

well, the map is flat more or less at closer zoom levels, so the general problem seems to be purely about lossy compression.


...what? I don’t understand. It’s accurate at “close zoom” because the limit as you scale in to the surface of a sphere is a flat surface. I’m not sure what compression of any sort has to do with this.


I don't think knowing exactly what parent comment is talking about is required to see that they weren't suggesting we should do away with all visualizations just because there are some cases where they might not be the best tool for teaching.


please elaborate - are you thinking of the curse of dimensionality ?


There are many examples, one I came across recently is that the large majority of the probability mass of a high-dimensional gaussian distribution is in a shell at a distance from the mean, the mass at the center is actually quite low.

Also anything related to topology, which is important when you are looking at decision boundaries, becomes counterintuitive in high dimensions, because so many things can be adjacent at the same time.


Can you please try and explain why that is?

If true, you're very correct that lower-dimensional intuition does not transfer into higher-dimensional spaces: my intuition tells me that a Gaussian distribution drops off as you fall away from the mean, and it's quite easy for me to imagine that in 2 dimensions, 3 dimensions (e.g. by imagining a mound on a plane) and 4 dimensions (e.g. a cloud in 3-space with increased density around the mean).

Is my intuition wrong in any of those cases? If so, why? If not, how many dimensions do we need before it becomes wrong?


Because an outlier in any single dimension will put the point outside the "center" of the distribution, and as the number of dimensions increases there's more of a chance of that happening.

Say you have an N-dimensional gaussian where each dimension has mean 0 and standard deviation 1. Define the center as the N-dimensional cube whose edges go from -3 to +3 in each dimension. A normally distributed value is within 3 standard deviations of the mean with probability 0.9973, so the probability that an N-dimensional point being in the center is 0.9973^N. With N=4 that's 0.989 which matches your intuition, but at N=1000 it's 0.067 and at N=10000 it's 1.81e-12.


The center of the distribution always has the highest density, but the ratio of 'probability mass close to centroid' / 'total probability mass' drops off as number of dimensions grows.

This is somewhat related to another 'curse of dimensionality' observation, which is that the volume of a hyperball / volume of hyperspace tends towards zero as dimensions grow -- there's just a lot more volume that's in some sense 'far' from the center.


>If true, you're very correct that lower-dimensional intuition does not transfer into higher-dimensional spaces: my intuition tells me that a Gaussian distribution drops off as you fall away from the mean, and it's quite easy for me to imagine that in 2 dimensions, 3 dimensions (e.g. by imagining a mound on a plane) and 4 dimensions (e.g. a cloud in 3-space with increased density around the mean).

Density is different from mass. Namely, mass is the integral of density. So your intuition is roughly correct for density, but you need to make it accord with a good intuition for mass.

Since getting the mass requires an integral, getting the mass over N-dimensional distributions requires integrating an N-dimensional region, which means N integrations for N dimensions. Each integration is, intuitively, a kind of sum. Integrating out many dimensions happens recursively; looped or recursive addition is multiplication. So on some level, to take the probability mass of a region in N-dimensional space, you need to "multiply" a density.

Since the total probability mass is fixed (1.0), adding more dimensions means you need to "multiply" the density by a larger number to get the mass, which means you need to divide the mass by a larger number to get the density, which means that despite the density peaking at the mean, the available density at any given point gets smaller as the dimensionality rises.


> it's quite easy for me to imagine that in 2 dimensions

It starts to fail really badly when dimension grows.

Two simple examples:

1) Consider 3 dimensional unit sphere centered at origin and unit cube centered at origin. Cube is clearly completely inside the sphere. Now generalize to n-dimensions. Hyperdimensional volume of hypercube with side length 1 moves almost completely outside the n-sphere with radius 1 when n-grows.

2) Alternatively almost all volume of n-sphere is close to the surface.

These are all very counterintuitive, yet simple to check toy examples. When you start to integrate over more complex multidimensional function, things get weird really fast.


>Alternatively almost all volume of n-sphere is close to the surface.

How does this go against intuition?

Intuition from 1/2/3d tells me that the volume of an N-ball is O(r^N), and indeed it is the case in higher dimensions. Therefore it’s easy to see that the difference between the volume of an N-ball of radius r and an N-ball of radius (r + epsilon) will grow exponentially with N.


isn't this just because we're comparing n-dimensional objects by a 2-norm ? i.e. the dimension of the space grows but we're keeping the dimension of the norm fixed, but if we used the p-norm of the same dimension as the space, then maybe that would return intuitive results ?


This is the problem that makes my brain melt when I try to think about genetics and mutational load. The naive idea is that a species, S, has a correct genome, G, but mutations build up, increasing with each generation. Presumably mutations build up until they are common enough that back-mutations are a thing. Then there is an equilibrium. In a fecund species, each individual has many children, but most have a higher mutational load, many have the save mutational load, and a lucky few have a smaller mutational load, closer to G, the correct genome. Differential reproductive success then maintains the equilibrium.

I don't see how the numbers are supposed to work out for large mammals, with each female having under a dozen offspring. To have a decent chance of a back-mutation, the typical member of the species would need one twelfth of their genome to be deleterious mutations.

Meanwhile, people are thinking about using CRISPR to correct the human genome, creating unusually happy, healthy people. The underlying thought is that the correct genome is best. But why do we think that the correct genome works at all?

Most of the population is in a shell at a distance from the correct genome, the number at the center is actually quite low. Given the combinatorics, with two to the millions of possible genomes, but populations in the millions, the number at the center, or even close, is actually zero. Maybe the correct genome codes for a sickly, miserable individual?

My current guess is that the evolution of large mammals with few offspring is constrained by genetic load considerations. It is not sufficient, (or even necessary) for the correct genome to be any good. There needs to be a big blob of mediocrity in genome space. The species exists as a shell of individuals on the edge of the blob of mediocrity. The blob needs to be huge, so that individuals whose genome is one twelfth mutations are still in the blob. Then there can be an equilibrium between back-mutations, taking offspring towards the interior of the blob and other mutations, taking offspring out of the blob and out of the gene pool.

This potentially solves the Fermi paradox. Can creatures such as humans actually exist in this universe? It is not enough for natural selection to discover a good genome. Natural selection has to discover a huge blob of mediocrity. Such blobs might be vastly rarer than we realize.

This potentially shits on the CRISPR master race. There might be nothing special about the interior of the huge blob of mediocrity.


What is the "correct genome"? Seems like you could only define it as a local minimum in fitness space, or some kind of attractor.


I don't know.There is medical perspective which focuses on deleterious mutations causing disease. This is a black-and-white perspective which sees mutations as either wholly bad or entirely unimportant. A genome without any deleterious mutations is a correct genome.

But what happens if you step back from black-and-white thinking and ask about mutations with ambiguous effects. Which is the mutation and which is the correct genome? It becomes unclear.

An alternative perspective asks: how well separated are the local minima in fitness space? Perhaps the typical separation is as large as the gaps between species. Then each species has only its own local minimum, which defines its correct genome. Or perhaps fitness space is littered with local minima, such that a single species has genetically healthy individuals in several different minima plus other individuals, perhaps not quite so healthy, nearby.


Theoretically, yes. Can you give a more concrete example? Many hard high dimensional and general topology problems can be visualized through their 2D special cases.


Yeah, but the distance from the center is itself just a one-dimensional gaussian...


Related: Hamming's "The Art of Doing Science and Engineering" chapter 9, N-Dimensional Space

http://worrydream.com/refs/Hamming-TheArtOfDoingScienceAndEn...

(I assume Bret Victor has permission to host the PDF on his website, he is far from an anonymous pirate)


> We need more people teaching math through visual intuition.

I would modify that slightly and say rather just through "intuition". Visualization helps a lot, but you can also have great intuition from situations, stories, feelings, etc (anything that hits the non-reasoning part of the brain i.e. your "gut feeling"). IMHO one of the biggest problems in mathematics and science education is that we spend too much time working on things which humans are bad at (precise calculations) and far too little doing the 'rough estimation' and 'intuition' work which we have been evolutionarily optimized for and which is essential to us for actually remembering and understanding how things work.


I'm learning linear algebra and found that watching a Strang OCW lecture, internalizing it in "his voice" and then doing a few problems from his textbook book while "listening" to him (in my head) has helped my intuition more than anything else. Reading the book "in his voice" makes it easier to understand, too.


I think that might be a placebo. Maybe you mean his attitude


it stands to reason that language has a strong affinity to sound. Because, well, there's magnitudes more time in history to develop that compared to text-reading skill.


Visual intuition fails when you have more than 3 dimensions or 4 in some cases. I can visualise a right angle in 2 and 3 dimensions. In 4 dimensions I can't visualise it and the whole idea of an angle no longer seems to make sense. And yet I still need to use this notion of a right angle in more than 3 dimensions in order to reason about certain problems.


What percentage of teachers use visual intuition, and what percentage should?


There needs to be some sort of organized push for visualization tools. I know, I might be bringing the proverbial owls to the proverbial Athens with saying that here, but I really do feel that if done right this could impact the course of the world like nothing else. This could be as important as idk, invention of book press or smth. Make computer "the visualization machine".

I think that one of the fundamental problems is that to be a visualization machine, you need to have easy access of the GPU and OpenGL is provides anything but. I think that shadertoy (shadertoy.com) is the thing that comes the closest but the learning curve is kinda steep.

I know that people like Alan Kay, Bret Victor or Michael Nielsen (his post was on the fp the other day https://news.ycombinator.com/item?id=15616637) share these sentiments but this is a task bigger than a single people.

Idk what I really mean by "organized push". I'm not sure if the problem is well defined too


In deep learning, TensorBoard (https://www.tensorflow.org/get_started/summaries_and_tensorb...) works with TensorFlow and Keras to show what the model is doing. However, it ends up being more complicated/unintuitive than a YouTube video, so it's not as useful.


The problem is that this is an ad hoc solution. What I'm talking about would be some formalization of visualization (I guess kinda like grammar of graphics without the statistical aspect) so you can visualize just about anything.


“Visualizing just about anything” isn’t helpful if you want to learn from the visualization, though. (c.f the /r/dataisbeautiful subreddit nowadays: https://www.reddit.com/r/dataisbeautiful/top/?sort=top&t=mon...)

That’s not to say that a purely artistic data visualization has no value, but it’s not academic. (I admit I am guilty of that at times)


Data visualization is only a part of it. I'm talking about visualizing concepts.


Structure is data


I feel like visualizations rely too much on the existence of a meaningful isomorphism. That is, once a problem is visualized effectively it becomes trivial and though applicable to future similar problems the isomorphism itself is too domain specific to be generalized. It feels like trying to find an analogy that will help you find all future analogies.


I do agree with this sentiment very much but at the same time I do feel like no one has really given it a good shake.


Isn't that what category theory is about, on the meta level, and in the result in case of specific isomorphisms, too?

edit: at that I still have John C. Baez, Mike Stay - "Physics, Topology, Logic and Computation: A Rosetta Stone" on my reading list https://arxiv.org/abs/0903.0340


There was a big organized push for visualization and more precisely augmented visualized thinking at HARC. It’s really too bad HARC didn’t work out, but many of us are still very interested in this problem.


Uh, details/links please on how/why HARC failed.


It was just an unexpected funding problem, it’s not my place to say more than that. Also, some of the groups are still going, but not through HARC.


I agree. Visualization is often key to understanding and identifying non-trivial issues.

Here's a tool a colleague of mine made for inline "visual debugging" for e.g. computer vision, written in c++: https://github.com/lightbits/vdb. I haven't used it myself, but when he presented it I think it made a lot of sense to have these sorts of tools for processing data in real time.


processing(.js, etc)


In my opinion this author produces the best math videos on youtube.

If you can afford it and enjoyed this video, consider supporting him on Patreon. https://www.patreon.com/3blue1brown


Someone on here (I think) recommended his videos on linear algebra a while back and I've since watched them all, several times.

A couple of hours of watching time built an intuition and understanding of linear algebra and the broader maths around it that 4 years of university training didn't give me. That's a little unfair, because I obviously learnt a lot on the courses that make these videos easier to understand, but man, they're so well done.


Totally agree. Had exactly the same experience. Became a patron as a way to offer my thanks!


If you liked him you will love acko.net. Try https://acko.net/blog/how-to-fold-a-julia-fractal/


If you enjoy his videos (and other creators'), please consider signing up on Patreon and supporting them.


This was very timely for me and for anyone else learning, here are the first few videos of the series:

https://www.youtube.com/watch?v=aircAruvnKk

https://www.youtube.com/watch?v=IHZwWFHWa-w

https://www.youtube.com/watch?v=Ilg3gGewQ5U (Original video)

https://www.youtube.com/watch?v=tIeHLnjs5U8


The entire YouTube channel is fantastic. 3Blue1Brown's series on linear algebra is the best I've seen anywhere.


Agreed. Pretty much every video on that channel is just as good as this one.


A bit of a side-note, but I think it is an interesting piece of marketing that Amplify Partners decided to sponsor[1] the previous video in this series. I wonder (and hope) we'll see more VCs sponsoring open educational content relevant to their focus.

[1] https://www.youtube.com/watch?v=IHZwWFHWa-w&t=1205


In 1988 Teuvo Kohonen had an "animation" with rotating disks how the perceptron learns https://youtu.be/Qy3h7kT3P5I?t=42m24s. Did not help comprehension much.


What tools does a person use to make a video like this? I've been wanting to do the same on my topic of expertise for a while now.


He uses custom tools.

However, he actually recommends against using his tools. He suggests a better option is to use traditional animation tools.

I'm actually not sure what one would use for more traditional animations of his style though. I mean, theoretically you can use blender/etc for most 3d things, but how easy would it be to make something math-based there? Hopefully someone with some real animation experience can chime in.


On the Manim Github he has some suggestions: "For 9/10 math animation needs, you'd probably be better off using a more well-maintained tool, like matplotlib, mathematica or even going a non-programatic route with something like After Effects. I also happen to think the program "Grapher" built into osx is really great, and surprisingly versatile for many needs."


> I also happen to think the program "Grapher" built into osx is really great

I didn't know about this, and it's a nice find


There's a really cool story about how the software wasn't supposed to ship with macOS, but the devs got it on anyway: http://www.pacifict.com/Story/


Great story, thank you for sharing! :D


Is there any way to verify that story?


I didn't know that Grapher was still around.



He creates each animation using a set of Python tools and libraries he wrote. You can find them published here: https://github.com/3b1b/manim


Each one of these videos consists of thousands of lines of code. The attention to detail is impressive.


have you looked at visualizations done in mathematica? LoC is not a good measure here. http://community.wolfram.com/content?curTag=graphics%20and%2...


oh my god, as soon as i saw this video was from 3Blue1Brown i immediately thought "this gonna be good!". I didn't realize he was posting a Deep Learning series.


I played around with ML a few years back. I took the Andrew Ng course on Coursera and spent some time with some python notebooks - but I never did anything with it beyond just proving that I could follow the examples and implement my own ML solutions for some simple training sets.

Now I have some problems I'd like to solve with ML. So assuming that I understand the basic concepts, what's the HN recommendation for a good library/system to get started with on doing some practical ML with neural nets?

Would TensorFlow be the best way to get into it?


Keras is probably the best library to get started - Ternsorflow is mainly needed if you need to mess with lower-level details, which for many things is not really necessary.


Would you still recommend Keras for non-image work?


Checkout MXNet as well



I dove into this not knowing anything about neural networks, but the feeling I came out of it with was incredible: I love it when something blurry and obscure slowly morphs into a sharper picture in your mind, it's so empowering.


Thank you very much. I must code a neural network with backpropagation for my AI class. Can anyone recommend a book?


This book is a good practical introduction that walks you through the basic ideas as you develop some basic functionality. http://neuralnetworksanddeeplearning.com/

I'm often pretty skeptical of e-books and self publications, but the above link is pretty good (and the video series linked here references it as well.) The Goodfellow book that another commenter mentioned is a high-quality survey of the field and a nice, high-level overview of different research directions in deep learning, but isn't as pragmatic as an introduction.


If you're looking to understand the underlying theory behind deep learning, the Deep Learning book by Goodfellow et al. is awesome.

If you're interested in general machine learning, the Elements of Statistical Learning, by Tibsihirani et. al is great; a more applied book is Applied Statistical Learning by the same author. For a more applied view, I'd check out Tensorflow or PyTorch tutorials; there's no good book, as far as I'm aware, because the tech changes so quickly.

I've done a series of videos on how to do deep learning that might be useful; if you're interested, there's a link in my profile.


COMP 551?


This video series is amazing and I wish it existed long ago.


There needs to be some sort of intense advocation for visualization tools.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: