5

There are proofs for the universal approximation theorem with just 1 hidden layer.

The proof goes like this:

  1. Create a "bump" function using 2 neurons.

  2. Create (infinitely) many of these step functions with different angles in order to create a tower-like shape.

  3. Decrease the step/radius to a very small value in order to approximate a cylinder. This is what I'm not convinced of

  4. Using these cylinders one can approximate any shape. (At this point it's basically just a packing problem like this.

In this video, minute 42, the lecturer says

In the limit that's going to be a perfect cylinder. If the cylinder is small enough. It's gonna be a perfect cylinder. Right ? I have control over the radius.

Here are the slides.

enter image description here

Here is a pdf version from another university, so you do not have to watch the video.

Why am I not convinced?

I created a program to plot this, and even if I decrease the radius by orders of magnitude it still has the same shape.

Let's start with a simple tower of radius 0.1:

enter image description here

Now let's decrease the radius to 0.01:

enter image description here

Now, you might think that it gets close to a cylinder, but it just looks like it is approximating a perfect cylinder, because of the zoomed out effect.

Let's zoom in:

enter image description here

Let's decrease the radius to 0.0000001.

enter image description here

Still not a perfect cylinder. In fact, the "quality" of the cylinder is the same.

Python code to reproduce (requires NumPy and matplotlib): https://pastebin.com/CMXFXvNj.

So my questions are:

Q1 Is it true that we can get a perfect cylinder solely by decreasing the radius of the tower to 0 ?

Q2 If this true, why is there no difference when I plot it with different radii(0.1, vs 1e-7) ?

Both towers have the same shape

Clarification: What do I mean with: same shape ? Let's say we calculate the volume of an actual cylinder(Vc) with the same raius and height as our tower and divide it by the volume of the tower(Vt) .

Vc = Volume Cylinder

Vt = Volume Tower

ratio(r) = Vc/Vt

What this documents/lectures claim that is the ratio of these 2 volumes depends on the radius but in my view it's just constant.

So what they are saying is that: lim r -> 0 for ratio(r) = 1 But my experiments show that: lim r -> 0 for ratio(r) = const and don't depend on the radius at all.

Q3 Preface

An objection i got multiple times once from Dutta and once from D.W is that just decreasing the radious and plotting it isn't mathematical rigorous.

So let's assume in the limit of r=0 it's really a perfect cylinder.

One possible explanation for this would be that the limit is a special case and one can't approximate towards it

But if that is true this would imply that there is no use for it since it's impossible to have a radius of exactly zero. It would only be useful if we could get gradually closer to a perfect cylinder by decreasing the radius.

Q3 So why should we even care about this then ?

Further Clarifications

The original universal approximation theorem proof for single hidden layer neural networks was done by G. Cybenko. Then I think people tried to make some visual explations for it. I am NOT questioning the paper ! But i am questioning the visual explanation given in the linked lecutre/pdf (made by other people)

KoKlA
  • 133
  • 6
  • 1
    What's your question? I don't see a question here. A question usually ends with a "?". You can't verify or disprove a statement about what happens "in the limit" by looking at finite instances. – D.W. Jan 19 '21 at 17:45
  • There is a question mark in the title. Well if I decrease the width to a very small value the shape of the tower should change and get closer to a cylinder imo. If there doesn't happen any transformation of the shape there is no use for the universal approximation theorem. It's only useful if the tower gets closer to a cylinder when the step width decreases. If it's only a perfect cylinder when the step width is exactly zero, but a completly different shape when it's 0.0000000000001 above zero there is no use for it. – KoKlA Jan 19 '21 at 18:39
  • Let me try to put it into mathematical terms as good as i can. Let;s define ts as the shape of the tower. ts depends on the number of neurons used, and on the step width(w). Let's ignore the neurons for now. So we have ts(w) . So you are saying that ts(w=0) is a perfect cylinder ? But ts(w=1) = ts(w=0.1) = ts(w=0.01) = ts(w=0.001) are all the same shape. So ts would be a discontinous function. Well, if this is really true, then what's the point of the Universal Approximation for a single hidden layer ? – KoKlA Jan 19 '21 at 19:01
  • Could you [edit] the question to include a self-contained statement of the theorem, and then tell us what kind of answer you are looking for? I don't really want to watch an external video to understand what you are asking; and we discourage questions that rely on us to click external links to understand what is being asked. Slides and informal statements like "nearly perfect cylinder" are not a substitute for an actual mathematical exposition with a precise statement of the theorem. – D.W. Jan 19 '21 at 19:57
  • It seems like the fact that you have to zoom in farther and farther to see the inaccuracy in the approximation is an indication that there is a sense in which it is becoming a closer approximation to a perfect cylinder. No, it will never be a perfect cylinder for any finite value, but that is not a contradiction. I am not sure what kind of answer you are looking for, though, or exactly what the question is. – D.W. Jan 19 '21 at 20:00
  • When I zoom I can see that the shape of the tower is exactly the same. So it doesn't get any closer to a cylinder. It's the same shape just smaller. There is no difference in the quality of the shape. – KoKlA Jan 19 '21 at 20:10
  • So it's not becoming a closer approximation of a cylinder. I'm aware that it will never become a perfect cylinder BUT i expect the shape to change. BUT it does NOT change it's exactly the same shape just smaller. – KoKlA Jan 19 '21 at 20:23
  • I'll update the question later to include some clarifications. – KoKlA Jan 19 '21 at 20:24
  • 1
    I suspect you've misunderstood the claim. I'll look forward to the revision. – D.W. Jan 19 '21 at 21:25
  • I saw your clarification. It seems you are unaware of the concepts of 'limits'. Don't worry, even big mathematicians are also confused about what happens at infinity (,we just use standard axioms). But anyways there are certain ways to tackle such things, without knowing that you simply wouldn't be able to understand the proof (if they have used limits in their proof) –  Jan 21 '21 at 12:56
  • And about why care about this? Integration, differentiation are based on exactly the same concept. Your doubts should have arisen while studying calculus. What you are asking is more of a (topic called) Real Analysis question rather than Neural Nets. –  Jan 21 '21 at 12:58
  • It's really unfortunate that you started a bounty. I didn't yet follow the discussions above, but it looks like you now have many different questions in this post. It's not a good thing to have multiple questions in the same post. You should focus on the simplest problem that you have that you need to solve before moving to the next and ask a question about that one. Everything else should be moved to the next post. Please, read [this](https://meta.stackexchange.com/a/39224/287113). – nbro Jan 21 '21 at 14:35
  • So, what I suggest that you do, although you already started a bounty, is to simplify as much as possible this post and ask the first simplest question that you need an answer to, while providing all the necessary context for us to be able to answer such a question. If you have further questions, you should ask them in separate posts. In any case, I understand that when someone is confused about something, often, we don't really know what the best first question to ask is, and that's why often we end up asking multiple questions (this also happened to me). – nbro Jan 21 '21 at 14:37
  • Hey @KoKia! Did this question ever get resolved? I realise that I have the exact same question. It seems to be me that the method of stacking lots of bumps doesn't yield a "circular tower", not matter how small you make the radius. Have a look at my question, which is the same as yours: https://datascience.stackexchange.com/questions/106905/constructing-circular-towers-to-show-that-single-hidden-layer-feedforward-neural – Just_a_fool Jan 10 '22 at 12:34
  • 1
    @Just_a_fool Hey I did a lot of work related to this topic. I haven't had yet time to fully write up everything. But i am planning to do that in the future. I am going to have a look at your question soon. – KoKlA Jan 10 '22 at 20:05
  • @KoKlA Amazing. Thank you for your willingness to share. I am looking forward to see what you have found out whenever its done! – Just_a_fool Jan 11 '22 at 10:33

2 Answers2

1

The more I think about it the more convinced I am that the visual explanation from the linked lecture is wrong. But the good news is there are still some ways to get close to the cylinder but not before the activation of the last neuron but instead afterwards. I haven't done it with simgoid. But I tried with ReLu instead for now.

We can cut the tower at the very top (thanks to ReLu and a bias). The closer to the top we cut it the more it will be like a cylinder.

We can control the height of the tower with the weights.

First in 2d:

enter image description here

Unfortunately the closer we put this towers together the more they will start two influence each other.

enter image description here

But we can counter that with a negative tower between them. enter image description here

Now in 3d:

enter image description here

This Answer is work in progress I will update it when i find out something new.

KoKlA
  • 133
  • 6
0

I think you misunderstood that part of the proof: you first need a limit on the number of neurons to get closer and closer to a cylinder. You are keeping them constant at 1000, thus indeed not getting any closer to the cylinder and exponential vanishing behavior.

Once you have the "epsilon-perfect" circles/cylinders, then you make them smaller and smaller, thus needing more and more copies of the "1-circle" setup.

My understanding is that this proof has those two numbers going to infinity: neurons-to-approximate-cylinder, and number-of-cylinders. You took into account the latter, but not the former.

etal
  • 176
  • 3