We have
$\frac{d}{d\beta} (y - X \beta)' (y - X\beta) = -2 X' (y - X \beta)$.
It can be shown by writing the equation explicitly with components. For example, write $(\beta_{1}, \ldots, \beta_{p})'$ instead of $\beta$. Then take derivatives with respect to $\beta_{1}$, $\beta_{2}$, ..., $\beta_{p}$ and stack everything to get the answer. For a quick and easy illustration, you can start with $p = 2$.
With experience one develops general rules, some of which are given, e.g., in that document.
Edit to guide for the added part of the question
With $p = 2$, we have
$(y - X \beta)'(y - X \beta) = (y_1 - x_{11} \beta_1 - x_{12} \beta_2)^2 +
(y_2 - x_{21}\beta_1 - x_{22} \beta_2)^2$
The derivative with respect to $\beta_1$ is
$-2x_{11}(y_1 - x_{11} \beta_1 - x_{12} \beta_2)-2x_{21}(y_2 - x_{21}\beta_1 - x_{22} \beta_2)$
Similarly, the derivative with respect to $\beta_2$ is
$-2x_{12}(y_1 - x_{11} \beta_1 - x_{12} \beta_2)-2x_{22}(y_2 - x_{21}\beta_1 - x_{22} \beta_2)$
Hence, the derivative with respect to $\beta = (\beta_1, \beta_2)'$ is
$
\left(
\begin{array}{c}
-2x_{11}(y_1 - x_{11} \beta_1 - x_{12} \beta_2)-2x_{21}(y_2 - x_{21}\beta_1 - x_{22} \beta_2) \\
-2x_{12}(y_1 - x_{11} \beta_1 - x_{12} \beta_2)-2x_{22}(y_2 - x_{21}\beta_1 - x_{22} \beta_2)
\end{array}
\right)
$
Now, observe you can rewrite the last expression as
$-2\left(
\begin{array}{cc}
x_{11} & x_{21} \\
x_{12} & x_{22}
\end{array}
\right)\left(
\begin{array}{c}
y_{1} - x_{11}\beta_{1} - x_{12}\beta_2 \\
y_{2} - x_{21}\beta_{1} - x_{22}\beta_2
\end{array}
\right) = -2 X' (y - X \beta)$
Of course, everything is done in the same way for a larger $p$.
The principle underlying least squares regression is that the sum of the squares of the errors is minimized. We can use calculus to find equations for the parameters $\beta_0$ and $\beta_1$ that minimize the sum of the squared errors, $S$.
$$S = \displaystyle\sum\limits_{i=1}^n \left(e_i \right)^2= \sum \left(y_i - \hat{y_i} \right)^2= \sum \left(y_i - \beta_0 - \beta_1x_i\right)^2$$
We want to find $\beta_0$ and $\beta_1$ that minimize the sum, $S$. We start by taking the partial derivative of $S$ with respect to $\beta_0$ and setting it to zero.
$$\frac{\partial{S}}{\partial{\beta_0}} = \sum 2\left(y_i - \beta_0 - \beta_1x_i\right)^1 (-1) = 0$$
$$\sum \left(y_i - \beta_0 - \beta_1x_i\right) = 0 $$
$$\sum \beta_0 = \sum y_i -\beta_1 \sum x_i $$
$$n\beta_0 = \sum y_i -\beta_1 \sum x_i $$
$$\beta_0 = \frac{1}{n}\sum y_i -\beta_1 \frac{1}{n}\sum x_i \tag{1}$$
$$\beta_0 = \bar y - \beta_1 \bar x \tag{*} $$
now take the partial of $S$ with respect to $\beta_1$ and set it to zero.
$$\frac{\partial{S}}{\partial{\beta_1}} = \sum 2\left(y_i - \beta_0 - \beta_1x_i\right)^1 (-x_i) = 0$$
$$\sum x_i \left(y_i - \beta_0 - \beta_1x_i\right) = 0$$
$$\sum x_iy_i - \beta_0 \sum x_i - \beta_1 \sum x_i^2 = 0 \tag{2}$$
substitute $(1)$ into $(2)$
$$\sum x_iy_i - \left( \frac{1}{n}\sum y_i -\beta_1 \frac{1}{n}\sum x_i\right) \sum x_i - \beta_1 \sum x_i^2 = 0 $$
$$\sum x_iy_i - \frac{1}{n} \sum x_i \sum y_i + \beta_1 \frac{1}{n} \left( \sum x_i \right) ^2 - \beta_1 \sum x_i^2 = 0 $$
$$\sum x_iy_i - \frac{1}{n} \sum x_i \sum y_i = - \beta_1 \frac{1}{n} \left( \sum x_i \right) ^2 + \beta_1 \sum x_i^2 $$
$$\sum x_iy_i - \frac{1}{n} \sum x_i \sum y_i = \beta_1 \left(\sum x_i^2 - \frac{1}{n} \left( \sum x_i \right) ^2 \right) $$
$$\beta_1 = \frac{\sum x_iy_i - \frac{1}{n} \sum x_i \sum y_i}{\sum x_i^2 - \frac{1}{n} \left( \sum x_i \right) ^2 } = \frac{cov(x,y)}{var(x)}\tag{*}$$
Best Answer
As you said, the residuals are defined by :
$e_i = y_i - \hat{y}_i = y_i - ( \hat{\beta}_0 + \hat{\beta_1}x_i ) = y_i - \hat{\beta}_0 - \hat{\beta_1}x_i$