The following is my understanding of what happens: if I take a "two dimensional problem" e.g. I have $X$ as inputs and Y as the outcome and I add a feature $x^2$. This gives a problem an additional dimension and the linear fit on the $x$ and $y$ values define a line as well as the linear fit on $x^2$ and $y$ values and the two lines define a plane which is the best fit. Is this correct? How does this translate back to the 2 dimensional space? Does this somehow show up in two dimensions as curvy? How?
Solved – What makes linear regression with polynomial features curvy
polynomialregression
Related Solutions
Here's the deal:
Technically you did write true sentences(both models can approximate any 'not too crazy' function given enough parameters), but those sentences do not get you anywhere at all!
Why is that? Well, take a closer look at the universal approximation theory, or any other formal proof that a neural network can compute any f(x) if there are ENOUGH neurons.
All of those kind of proofs which I have seen use only one hidden layer.
Take a quick look here http://neuralnetworksanddeeplearning.com/chap5.html for some intuition. There are works showing that in a sense the number of neurons needed grow exponentially if you are just using one layer.
So, while in theory you are right, in practice, you do not have infinite amount of memory, so you don't really want to train a 2^1000 neurons net,do you? Even if you did have infinite amount of memory,that net will overfit for sure.
To my mind, the most important point of ML is the practical point! Let's expand a little on that. The real big issue here isn't just how polynomials increase/decrease very quickly outside the training set. Not at all. As a quick example, any picture's pixel is within a very specific range ([0,255] for each RGB color) so you can rest assured that any new sample will be within your training set range of values. No. The big deal is: This comparison is not useful to begin with(!).
I suggest that you will experiment a bit with MNIST, and try and see the actual results you can come up with by using just one single layer.
Practical nets use way more than one hidden layers, sometimes dozens (well, Resnet even more...) of layers. For a reason. That reason is not proved, and in general, choosing an architecture for a neural net is a hot area of research. In other words, while we still need to know more, both models which you have compared(linear regression and NN with just one hidden layer ), for many datasets, are not useful whatsoever!
By the way, in case you will get into ML, there is another useless theorem which is actually a current 'area of research'- PAC (probably approximately correct)/VC dimension. I will expand on that as a bonus:
If the universal approximation basically states that given infinite amount of neurons we can approximate any function (thank you very much?), what PAC says in practical terms is, given (practically!) infinite amount of labelled examples we can get as close as we want to the best hypothesis within our model. It was absolutely hilarious when I calculated the actual amount of examples needed for a practical net to be within some practical desired error rate with some okish probability :) It was more than the number of electrons in the universe. P.S. to boost it also assumes that the samples are IID (that is never ever true!).
Best Answer
This is a piece of a plane in 3D.
Here is the same plane with coordinates shown and a set of points selected along its $x$ axis.
The third coordinate is used to plot the squares of these $x$ values, producing points along a parabola at the base of the coordinate box.
A vertical "curtain" through the parabola intersects the plane at all the points directly above the parabola. This intersection is a curve.
A polynomial model supposes the response $y$ (graphed in the vertical direction) differs from the height of this plane by random amounts. The values of $y$ corresponding to these $x$ coordinates are shown as red dots.
Consequently, the $(x,y)$ points lie along a curve--this projection--rather than a line, even though the model of the response is based on the plane originally shown.
Moral