Some hints, but not quite the full answer:
There is a difference between a parameter $\mu$ and an estimator of that parameter. So if we call the estimator $\hat{\mu}$ then you want to minimise $$\sum_i (y_i - \hat{\mu})^2$$ which is $$\sum_i y_i^2 - \sum_i 2 y_i \hat{\mu} +\sum_i \hat{\mu} ^2$$ and (as you suggest) this will be when its derivative with respect to $\hat{\mu}$ is zero. Strictly speaking you should check this is a minimum, but since the derivative is monotone increasing that is obvious.
Since $y_i = \mu + \epsilon_i$, you know $E[y_i] = E[\mu] + E[\epsilon_i]$, so it will be easy to find $E[\hat{\mu}]$.
As for $Var(\hat{\mu})$, you again have to multiply out a square, looking at $$E\left[\left(\hat{\mu}-E[\hat{\mu}]\right)^2\right].$$ You might want to use the fact that $y_i^2 = \mu^2 + 2 \mu \epsilon_i +\epsilon_i^2$ implies $E[y_i^2] = \mu^2 + \sigma^2$.
Here's the general idea - someone who has a better background than I do in statistics could probably give a better explanation. So you have this linear regression model:
$$Y = \alpha + \beta X + \epsilon $$
where $\epsilon$ follows a normal distribution with mean $0$.
What exactly does random mean? My back ground in statistics is very low level, but I understand that a random variable is defined as a mapping from a sample space to the real numbers. This definition makes sense, but the assumption of a zero mean is what I get tripped up on. How can we assume this fact?
Personally, I've always taken the idea that $\epsilon$ follows a normal distribution with mean $0$ as an axiom of sorts for the linear regression model. My understanding is that it's just something nice we would like the linear regression model to have and lends itself well to certain properties. Remember:
Essentially, all models are wrong, but some are useful.
which is attributed to George E.P. Box.
Why would we want such an axiom? Well... on average, it would be nice to have zero error.
In my honest opinion (this is based off the little measure-theoretic probability I have studied), it would be best to approach this idea of "randomness" intuitively, as you would in an undergraduate probability course.
The idea about anything that is random is that you will never know the value of it. So, in an undergraduate probability class, what you do is you assign probabilities to the values your quality of interest can take by creating a probabilistic model. Your model, 99% of the time, won't be perfect, but that doesn't stop anyone from not trying.
The normal distribution with mean 0 is just an example of a probabilistic model that statisticians feel is a suitable model for the error term. It isn't perfect, but it's suitable for most purposes. I worked with a professor whose focus is on assuming a skew-normal error term, which complicates things, but is usually more realistic, since, in reality, not everything looks like a bell curve.
My two cents. Hopefully I've helped somewhat.
Best Answer
It's the linearity of expectation. $E[X - Y] = E[X] - E[Y]$
In your case $\mathbb E\big[y_i - \mathbb E[y_i|x_i]\big|x_i\big]$ = $\mathbb E[y_i|x_i] - \mathbb E[\mathbb E[(y_i|x_i)]|x_i] = \mathbb E[y_i|x_i] - \mathbb E[y_i|x_i]$