The estimator is biased, regardless.
Note first that $\alpha$ is not identifiable because you cannot distinguish between $\alpha$ and $1-\alpha$. Let's accommodate this problem by allowing that we don't care which coin is which and stipulating (arbitrarily, but with no loss of generality), that $0 \le \alpha \le 1/2$.
It's reasonable, and conventional, to fix the estimator $g$ as follows:
$$\eqalign{
g(k,n) =& \frac{1 - \sqrt{\delta}}{2} \cr
\delta =& \max(0, 1 - 4 k / n)
}$$
No matter what you do, though, this will be a nonlinear function of the outcome $k$ and therefore is certain to be biased for almost all $\alpha$.
A better approach is to search for estimators $h(k,n)$ among some functional class of estimators (such as ones that are linear in $k$) that minimize the expectation of some loss function. In many situations an estimator that works well for quadratic loss also works well for many reasonable types of losses, so let's look at this. What we're talking about, then, is (for each $n$) to minimize the expectation $\mathbb{E}[(h(k,n) - \alpha)^2]$ among all estimators $h$.
Let's look graphically at what's going on. The bias of any estimator $h$ of the parameter $\alpha$ is the difference between its expectation and the parameter, $\mathbb{E}[h(k,n) - \alpha]$. We can study any proposed estimator, then, by graphing its bias (if we really care about that) and its loss. For any value of $n$ they are functions of $\alpha$, which (of course) is unknown. That's why we have to look at the entire graph.
Here are the bias (blue, dashed) and square root of the expected quadratic loss (red) for $g$ when $n=16$:
(I use the root of the loss because this is directly comparable to the bias.)
For example, $g$ is unbiased for $\alpha \approx 1/3$ but otherwise is biased, with the size of the bias largest for $\alpha = 1/2$. The root expected loss is roughly between 0.15 and 0.2 provided $\alpha$ exceeds $1/6$, approximately.
As an alternative, consider linear estimators $h_\lambda(k,n)$ of the form $h_\lambda(k,n) = \lambda(n) k/n$. Here is a plot of $h_2$ also for $n=16$ (but please note the change in scale on the vertical axis):
For most $\alpha$ its bias exceeds that of $g$, but for some $\alpha$ (near 0.4) it actually has less bias. For a wide range of $\alpha$, though, its root expected loss is less than that of $g$. Provided $\alpha \gt 1/5$ or so, this simple estimator clearly outperforms the "obvious" one!
This is not necessarily "the best" linear estimator, however. To illustrate, here is a plot of $h_{4/3}$:
It outperforms both $g$ and $h_2$ for $1/8 \lt \alpha \lt 3/8$, approximately. Note, though, that $g$ outperforms the $h_{\lambda}$ for sufficiently small $\alpha$.
These considerations suggest there is value in knowing something about what $\alpha$ might be: that will tell you which portions of the loss graphs to focus on in selecting among alternative estimators. If, in addition, you have a prior distribution for $\alpha$ you can compute the expected loss (this is now a single number) and use that to compare estimators: your task becomes one of finding an estimator with lowest possible expected loss. This, of course, is a Bayesian estimator.
Regardless, using plots of expected loss is a standard and effective way to compare estimators and to choose ones that are appropriate for any particular problem.
You can quantify the quality of the estimator by calculating the total surprisal of all of the coin flips.
Suppose that your expert makes predictions $q_i$ for each coin. Then, given indicator variables for the coins coming up heads $x_i$, the total surprisal is:
\begin{align}
\sum_i\left[ -x_i\log q_i - (1-x_i)\log (1-q_i)\right].
\end{align}
The expected value of the surprisal given the true values $\{p_i\}$ is the cross-entropy:
\begin{align}
\sum_i \left[-p_i\log q_i -(1-p_i)\log (1-q_i)\right].
\end{align}
It is nonnegative, and achieves its minimum value (the entropy of $\{p_i\}$) if and only if $p_i = q_i \forall i$.
If you subtract the entropy from the cross-entropy, you get the relative entropy (whose minimum value is zero). If you take $e^{-x}$ of that, you have a number in $[0, 1]$ as you wanted with a reasonable probabilistic interpretation.
Best Answer
Right and wrong, depending on the assumption / the point in time. Since a binary valued random variable is in any case Bernoulli distributed and you do not have a choice of how to model that, let us move away from that example and consider instead the following: We wonder whether or not the following data is normally distributed and if so, what mean it has:
-0.33,1.4,0.64,-0.11,0.51,0.4,1.66,0.28,0.51,0.35,-0.38,0.1,1.64,-0.88,0.12,1.36,-0.23,-1.05,-0.87,-0.39
Right now you have the choice of modelling that with a t-distribution, a normal distribution and so on. Furthermore it is data from the real world, hence yes: we can never be absolutely, 100% sure that this data even was produced from a random variable. Maybe all the concepts of probability do not apply here because there is simply no rule behind how this data was generated and maybe the next number that this process would have generated is 100001344.99 ( a number that would not fit at all into the pattern). But the true question here is: does it really matter? The anser is no: We simply try to model this data with different distribution and 'do the best we can'. In the end (in the real world) we want to make optimize something, reduce costs, reduce waste or so. So, if we were able to do that by using a (maybe somewhat inadequate) model, do we care as long as we can make "good money" out of it? I highly doubt that :-)
On the other hand, once you have selected a model (and therefore explicitly assumed that the data was in fact generated by a random variable and that this random variable in fact has a normal distribution with some unknown parameters) then you can compute everything you want (like $P[X > 0]$) explicitly!
On questions 2) and 3):
We always get insights about the data by using two sources:
1) context / business knowledge / experience from the past
2) the data
We use 1) to select the model and then we use 2) to adjust the parameters of the model. Examples for 1):
Clearly, the coice of the model influences what you get out of the experiment and if everybody uses the normal distribution because, hey, everybody before me did that so I am doing it as well, then we may keep ourselves from getting valuable new insights (by trying new distributions/models maybe?).