How to “see” that calculus works for multidimensional problems

calculusgradient descentintuitionmultivariable-calculus

Let's say I have some function f(x) = x^2 + b. I can see what's going on, I can count the slope geometrically even without knowing the rules of derivatives.

When I need to minimize some cost function for linear problem (linear regression with gradient descent), I just need to picture 3d, and I can "see", and be quite confident why and how it works.

How can I "see" or get intuition that calculus works for multidimensional problems? Let's say I have a problem with many variables like:

f(area_km, bedrooms, ...) = theta + theta1 * area_km + theta2 * bedrooms etc

If I want to apply gradient descent, I know I need to calculate partial derivatives and multiply it with a learning rate etc. It works. But it's kinda magical that it works.

I am sorry this is a silly question, I am just beginning.

Best Answer

For the most general case, think about a mixing board.

Courtesy "I G", cc-by 2.0. Flickr: https://www.flickr.com/photos/qubodup/14730512076/

Each input argument to the function is represented by a slider with an associated piece of a real number line along one side, just like in the picture. If you are thinking of a function which can accept arbitrary real number inputs, the slider will have to be infinitely long, of course, which of course is not possible in real life, but is in the imaginary, ideal world of mathematics. This mixing board also has a dial on it, which displays the number corresponding to the function's output.

The partial derivative of the function with respect to one of its input arguments corresponds to how sensitive the readout on the dial is if you wiggle the slider representing that argument just a little bit around wherever it's currently set - that is, how much more or less dramatic the changes in what is shown are compared to the size of your wiggle. If you wiggle a slider by, say, 0.0001, and the value changes by a factor 0.0002, the partial derivative with respect to that variable at the given setting is (likely only approximately) 2. If the value changes in an opposite sense, i.e. goes down when you move the slider up, the derivative is negative.

The gradient, then, is the ordered list of signed proportions by which you have to "wiggle" all the sliders so as to achieve the strongest possible, but still small, positive wiggle in the value on the dial. This is a vector, because you can think of vectors as ordered lists of quantities for which we can subject to elementwise addition and elementwise multiplication by a single number.

And of course, when I say "small" here I mean "ideally small" - i.e. "just on the cusp of being zero" which, of course, you can make formally rigorous in a number of ways, such as by using limits.