There are several options available to you:
Try to compute the derivatives by hand and then implement them in code.
Use a symbolic computation package like Maple, Mathematica, Wolfram Alpha, etc. to find the derivatives. Some of these packages will translate the resulting formulas directly into code.
Use an automatic differentiation tool that takes a program for computing the cost function and (using compiler like techniques) produces a program that computes the derivatives as well as the cost function.
Use finite difference formulas to approximate the derivatives.
For anything other than the simplest problems (like ordinary least squares), option 1 is a poor choice. Most experts on optimization will tell you that it is very common for users of optimization software to supply incorrect derivative formulas to optimization routines. This typically leads to slow convergence or no convergence at all.
Option 2 is a good one for most relatively simple cost functions. It doesn't require really exotic tools.
Option 3 really shines when the cost function is the result of a fairly complicated function for which you have the source code. However, AD tools are specialized and not many users of optimization software are familiar with them.
Option 4 is sometimes a necessary choice. If you have a "black box" function that you can't get source code for (or that is so badly written that AD tools can't handle it), finite difference approximations can save the day. However, using finite difference approximations has a significant cost in run time and in the accuracy of the derivatives and ultimately the solutions obtained.
For most machine learning applications, options 1 and 2 are perfectly adequate. The loss functions (least squares, logistic regression, etc.) and penalties (one-norm, two-norm, elastic net, etc.) are simple enough that the derivatives are easy to find.
Options 3 and 4 come into play more often in engineering optimization where the objective functions are more complicated.
The Adam paper says, "...many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions..." Here, they just mean that the objective function is a sum of errors over training examples, and training can be done on individual examples or minibatches. This is the same as in stochastic gradient descent (SGD), which is more efficient for large scale problems than batch training because parameter updates are more frequent.
As for why Adam works, it uses a few tricks.
One of these tricks is momentum, which can give faster convergence. Imagine an objective function that's shaped like a long, narrow canyon that gradually slopes toward a minimum. Say we want to minimize this function using gradient descent. If we start from some point on the canyon wall, the negative gradient will point in the direction of steepest descent, i.e. mostly toward the canyon floor. This is because the canyon walls are much steeper than the gradual slope of the canyon toward the minimum. If the learning rate (i.e. step size) is small, we could descend to the canyon floor, then follow it toward the minimum. But, progress would be slow. We could increase the learning rate, but this wouldn't change the direction of the steps. In this case, we'd overshoot the canyon floor and end up on the opposite wall. We would then repeat this pattern, oscillating from wall to wall while making slow progress toward the minimum. Momentum can help in this situation.
Momentum simply means that some fraction of the previous update is added to the current update, so that repeated updates in a particular direction compound; we build up momentum, moving faster and faster in that direction. In the case of the canyon, we'd build up momentum in the direction of the minimum, since all updates have a component in that direction. In contrast, moving back and forth across the canyon walls involves constantly reversing direction, so momentum would help to damp the oscillations in those directions.
Another trick that Adam uses is to adaptively select a separate learning rate for each parameter. Parameters that would ordinarily receive smaller or less frequent updates receive larger updates with Adam (the reverse is also true). This speeds learning in cases where the appropriate learning rates vary across parameters. For example, in deep networks, gradients can become small at early layers, and it make sense to increase learning rates for the corresponding parameters. Another benefit to this approach is that, because learning rates are adjusted automatically, manual tuning becomes less important. Standard SGD requires careful tuning (and possibly online adjustment) of learning rates, but this less true with Adam and related methods. It's still necessary to select hyperparameters, but performance is less sensitive to them than to SGD learning rates.
Related methods:
Momentum is often used with standard SGD. An improved version is called Nesterov momentum or Nesterov accelerated gradient. Other methods that use automatically tuned learning rates for each parameter include: Adagrad, RMSprop, and Adadelta. RMSprop and Adadelta solve a problem with Adagrad that could cause learning to stop. Adam is similar to RMSprop with momentum. Nadam modifies Adam to use Nesterov momentum instead of classical momentum.
References:
Kingma and Ba (2014). Adam: A Method for Stochastic Optimization.
Goodfellow et al. (2016). Deep learning, chapter 8.
Slides from Geoff Hinton's course
Dozat (2016). Incorporating Nesterov Momentum into Adam.
Best Answer
Answer to your question 1
Appending $\bf 1$ to matrix $\bf X$ is adding the "intercept" term. Suppose you have $p$ features in data, without adding $\bf 1$ term, you are actually fitting $$y=\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$
With appending $\bf 1$, you are fitting.
$$y=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$. Similarly, you can try following two models in R to see difference
vs.
where first formula gives a fit of $mpg=\theta_0+\theta_1*wt$ and second formula gives a fit of $mpg=\theta_1*wt$
Answer to your question 2
In the Coursera course Andrew Ng spent a lot of time on iterative methods, instead of "analytical solution" / "normal equations". Which means he is teaching you how to get those weights by some algorithms that runs in many iterations. Rather than deriving the answers from algebra and calculate the weights from the formula.
In linear regression case, using iterative method may not be necessary, (in fact R is not using it, R is using QR decomposition and solve it directly instead of gradient decent), but in many other complicated models, say neural network, iterative methods are extremely useful.
PS. just found that more information can be found in this related post