I'm going to show this using partial differentiation.
Consider the assumed linear model
$$y_i = \mathbf{x}_i^{T}\boldsymbol\beta + \epsilon_i$$
where $y_i, \epsilon_i \in \mathbb{R}$ and $\mathbf{x}_i=\begin{bmatrix}
x_{i0} \\
x_{i1} \\
\vdots \\
x_{ip}
\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_p
\end{bmatrix} \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$, with $x_{i0} = 1$.
Our aim is to solve for $\hat{\boldsymbol\beta}$ by minimizing the residual sum of squares, or minimizing $$\text{RSS}(\boldsymbol\beta) = \sum_{i=1}^{n}(y_i-\mathbf{x}_i^{T}\boldsymbol\beta)^2\text{.}$$
To compute this sum, consider the vector of residuals
$$\mathbf{e}=\begin{bmatrix}
y_1 - \mathbf{x}_1^{T}\boldsymbol\beta \\
y_2 - \mathbf{x}_2^{T}\boldsymbol\beta \\
\vdots \\
y_n - \mathbf{x}_n^{T}\boldsymbol\beta
\end{bmatrix}$$
Then $\text{RSS}(\boldsymbol\beta) = \mathbf{e}^{T}\mathbf{e}$. Our next step is to find the "partial derivatives" of $\text{RSS}(\boldsymbol\beta)$.
To do this, note that for $k = 1, \dots, p$,
$$\dfrac{\partial \text{RSS}}{\partial \beta_k}=\dfrac{\partial}{\partial\beta_k}\left\{\sum_{i=1}^{n}\left[y_i- \sum_{j=0}^{p}\beta_jx_{ij}\right]^2 \right\}=-2\sum_{i=1}^{n}x_{ik}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right)\text{.}$$
"Stacking" these, we obtain
$$\begin{align}
\dfrac{\partial \text{RSS}}{\partial \boldsymbol\beta}&=\begin{bmatrix}
\dfrac{\partial \text{RSS}}{\partial \beta_0} \\
\dfrac{\partial \text{RSS}}{\partial \beta_1} \\
\vdots \\
\dfrac{\partial \text{RSS}}{\partial \beta_p}
\end{bmatrix} \\
&= \begin{bmatrix}
-2\sum_{i=1}^{n}x_{i0}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \\
-2\sum_{i=1}^{n}x_{i1}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \\
\vdots \\
-2\sum_{i=1}^{n}x_{ip}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right)
\end{bmatrix} \\
&= -2\begin{bmatrix}
\sum_{i=0}^{n}x_{i0}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta)\\
\sum_{i=0}^{n}x_{i1}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta) \\
\vdots \\
\sum_{i=0}^{n}x_{ip}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta)
\end{bmatrix} \\
&= -2\left(\begin{bmatrix}
\sum_{i=0}^{n}x_{i0}\mathbf{y}\\
\sum_{i=0}^{n}x_{i1}\mathbf{y} \\
\vdots \\
\sum_{i=0}^{n}x_{ip}\mathbf{y}
\end{bmatrix} - \begin{bmatrix}
\sum_{i=0}^{n}x_{i0}\mathbf{x}_1^{T}\boldsymbol\beta)\\
\sum_{i=0}^{n}x_{i1}\mathbf{x}_1^{T}\boldsymbol\beta) \\
\vdots \\
\sum_{i=0}^{n}x_{ip}\mathbf{x}_1^{T}\boldsymbol\beta)
\end{bmatrix}\right)\\
&= -2(\mathbf{X}^{T}\mathbf{y}-\mathbf{X}^{T}\mathbf{X}\boldsymbol\beta)\text{.}
\end{align}$$
where $$\mathbf{X} = \begin{bmatrix}
\mathbf{x}_1^{T} \\
\mathbf{x}_2^{T} \\
\vdots \\
\mathbf{x}_n^{T}
\end{bmatrix}\text{.}$$
Setting $\dfrac{\partial \text{RSS}}{\partial \boldsymbol\beta} = \mathbf{0}$, we obtain $$\mathbf{X}^{T}\mathbf{X}\boldsymbol\beta = \mathbf{X}^{T}\mathbf{y}$$
and assuming $\mathbf{X}^{T}\mathbf{X}$ is invertible,
$$\hat{\boldsymbol\beta} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}$$
I believe you’re asking for the intuition behind those three properties of the hat matrix, so I’ll try to rely on intuition alone and use as little math and higher level linear algebra concepts as possible.
Preliminaries
Start with the fact that the projection matrix $P$ allows you to obtain the orthogonal projection of an arbitrary vector onto the column space of X. Let’s use $v_p$ for the orthogonal projection of $v$:
$$
P v = v_p
$$
You can use $P$ to decompose any vector $v$ into two components that are orthogonal to each other. Think of $v_n$ as what is "left over" after the rest of $v$ is projected onto the column space of X, so it is orthogonal to the column space of X (and any vector in the column space of X).
$$
v = v_p + v_n
$$
$$
v_p \perp v_n
$$
1. Why does P * P = P?
Intuitively, projecting a vector onto a subspace twice in a row has the same effect as projecting it onto that subspace once. The second projection has no effect because the vector is already in the subspace from the first projection.
Less intuitive
If that isn’t intuitive, it may be easier to consider the equivalent question: why does $P * P v= P v$ for any arbitrary vector v?
Start by simplifying the left hand side:
$$ P * (P v) = P v_p $$
since $P v = v_p$.
Next consider $ P v_p $, which (by definition of P) projects $v_p$ onto the column space of X. This has no effect since $v_p$ is already entirely in the column space of X. Therefore
$$
P v_p = v_p
$$
Since $v_p = P v$, we conclude:
$$
P v_p = P v
$$
Chaining all these equations together gives:
$$
P * P v= P v
$$
2. Why is P symmetric?
Intuitively, consider two arbitrary vectors $v$ and $w$. Take the dot product of one vector with the projection of the other vector.
$$
(P v) \cdot w
$$
$$
v \cdot (P w)
$$
In both dot products, one term ($P v$ or $P w$) lies entirely in the ‘projected space’ (column space of X), so both dot products ignore everything that is not in the column space of X. This means both dot products are equal. Some simple dot product identities then imply that $P = P^T$, so $P$ is symmetric.
Less intuitive
If that isn't intuitive, we first prove that both dot products are equal. Decompose $v$ and $w$ as shown in the preliminaries above.
$$
v = v_p + v_n
$$
$$
w = w_p + w_n
$$
The projection of a vector lies in a subspace. The dot product of anything in this subspace with anything orthogonal to this subspace is zero. We use this fact on the dot product of one vector with the projection of the other vector:
$$
(P v) \cdot w \hspace{1cm} v \cdot (P w)
$$
$$
v_p \cdot w \hspace{1cm} v \cdot w_p
$$
$$
v_p \cdot (w_p + w_n) \hspace{1cm} (v_p + v_n) \cdot w_p
$$
$$
v_p \cdot w_p + v_p \cdot w_n \hspace{1cm} v_p \cdot w_p + v_n \cdot w_p
$$
$$
v_p \cdot w_p \hspace{1cm} v_p \cdot w_p
$$
Therefore
$$
(Pv) \cdot w = v \cdot (Pw)
$$
Next, we can show that a consequence of this equality is that the projection matrix P must be symmetric. Here we begin by expressing the dot product in terms of transposes and matrix multiplication (using the identity $x \cdot y = x^T y$ ):
$$
(P v) \cdot w = v \cdot (P w)
$$
$$
(P v)^T w = v^T (P w)
$$
$$
v^T P^T w = v^T P w
$$
Since v and w can be any vectors, the above equality implies:
$$
P^T = P
$$
3. Why is P positive semidefinite?
By definition a matrix $P$ is positive semidefinite if and only if for every non-zero column vector $v$:
$$
v^T P v >= 0
$$
or equivalently:
$$
v \cdot (P v) >= 0
$$
Intuitively, a dot product is a projection of one vector onto another vector, and then scaling by the length of the second vector. We want to show that this dot product is non-negative.
In the equation immediately above, $v \cdot (P v)$ means "project $v$ onto $P v$ and scale by $P v$". The first part of this, project $v$ onto $P v$, is equivalent to "project $v$ onto $v_p$", since $P v = v_p $.
Projecting $v$ onto $v_p$ projects $v$ onto something that lies entirely in the column space of X, so this projection is just $v_p$. Next, scaling this $v_p$ by $v_p$ squares its length. A squared length must be non-negative.
Less intuitive
If that isn't intuitive, the dot product can be simplified by decomposing $v$ into orthogonal components
$$
v \cdot (P v)
$$
$$
(v_p + v_n) \cdot (P v)
$$
$$
(v_p + v_n) \cdot v_p
$$
$$
v_p \cdot v_p + v_n \cdot v_p
$$
Since $v_p$ and $v_n$ are orthogonal, the second term is zero and we have only
$$
v_p \cdot v_p
$$
The quantity immediately above is the length of the vector $v_p$ squared (i.e., $\|v_p\|_2^2$ ). This must be a non-negative value.
$$
v_p \cdot v_p = \|v_p\|_2^2 >= 0
$$
Best Answer
In general $\textbf Y$ and $ \textbf X$ are known because you have a sample. This sample have a dataset of m points: $(x_{11},x_{12},\ldots,x_{1m},y_1), (x_{21},x_{22},\ldots,x_{2m},y_2), (x_{31},x_{32},\ldots,x_{3m},y_3), \ldots, (x_{n1},x_{n2},\ldots,x_{nm},y_m)$. The values of $x_{ij}$ are represented by $\textbf X$ and the values of $y_i$ are represented by $\textbf Y$. The observations are always a pair of an m x-value and a y-value. And it is true, that is has to be $n >m >k$.
You have to minimize $V(\beta)=||\textbf Y-\textbf X\beta||_2^2=(\textbf Y-\textbf X \beta)'\times(\textbf Y-\textbf X\beta)=(\textbf Y'- \beta' \textbf X' )\times(\textbf Y-\textbf X\beta)$
Multiplying out
$V(\beta)=\textbf Y'\textbf Y -\textbf Y'\textbf X\beta-\beta' \textbf X' \textbf Y +\beta' \textbf X' \textbf X\beta$
It is $\textbf Y'\textbf X\beta=\beta' \textbf X' \textbf Y$ Therefore
$V(\beta)=\textbf Y'\textbf Y -2\beta' \textbf X' \textbf Y +\beta' \textbf X' \textbf X\beta$
Differentiating w.r.t $\beta$
$\frac{\partial V}{\partial \beta}=-2 \textbf X' \textbf Y +2 \textbf X' \textbf X\beta=0$
$2\textbf X' \textbf X\beta=2\textbf{X}'\textbf{Y}$
Dividing both sides by 2
$\textbf X' \textbf X\beta=\textbf{X}'\textbf{Y}$
Bringing $\textbf X' \textbf X$ to the RHS.
$\beta=(\textbf X' \textbf X)^{-1}\textbf{X}'\textbf{Y}$
$\beta$ are the values of the coefficients, which minimize the (squared) difference between the observed x-values and the observed y-values.