Answering your questions on the minimisation:
We can argue from the structure of the problem that the alphabet set of $Z$ must be $(a, a+1, a+2)$, for some integer $a$ (Ask me about this if this is unclear). Set $a = 0$ for convenience, since the decoder will know $a$, and can always subtract it.
Now, given this noise, we need to find the distribution $p_X$ that maximises $I(X;Y)$. I'll try to keep things fully analytic, no jumps of logic, no inspired guesswork.
$I(X;Y) = H(Y) - H(Y|X) = H(Y) - H(X+Z|X)$
$ = H(Y) - H(Z) = H(Y) - \log_2 3$
So we're looking for the distribution of $X$ : $(p_0, p_1, p_3, p_4)$ that maximises H(Y). There's a dirty old analytic way to do this, using lagrange multipliers. (Note, I'll be doing this in natural logs, because they don't leave a dirty constant. This doesn't make a difference to the maximiser.)
Now, the distribution of $Y$ in terms of that of $X$ is as follows (convince yourself of this):
$\left(p_0/3, (p_0 + p_1)/3, (p_0+p_1+p_2)/3, (p_1+p_2+p_3)/3, (p_2+p_3)/3, p_3/3 \right)$
We want to run the optimisation:
$$ \max_{(p_0, p_1, p_2,p_3)} \sum_{y=0}^5 -p_y\ln p_y$$
$$\text{Subject to } p_0 + p_1 + p_2 + p_3 = 1$$
Define $J = \sum_{y=0}^5 -p_y\log p_y + \lambda (p_0 + p_1 + p_2 + p_3 - 1)$
At the optimum:
$$\partial_{p_0} (J) = 0 \implies \log \left( (p_0)(p_1+p_0)(p_2+p_1+p_0) \right) = 3\lambda - 3 \tag{1}$$
$$\partial_{p_1} (J) = 0 \implies \log \left((p_1+p_0)(p_2+p_1+p_0)(p_3+p_2+p_1)\right) = 3\lambda - 3 \tag{2}$$
$$\partial_{p_2} (J) = 0 \implies \log \left((p_2+p_1+p_0)(p_3+p_2+p_1)(p_3+p_2)\right) = 3\lambda - 3 \tag{3}$$
$$\partial_{p_3} (J) = 0 \implies \log \left((p_3)(p_3+p_2)(p_3+p_2+p_1)\right) = 3\lambda - 3 \tag{4}$$
$$\partial_{\lambda}(J) = 0 \implies p_0 + p_1 + p_2 + p_3 = 1 \tag{5}$$
This is not so bad to calculate. Look at all terms with $p_0$ in them. When the product $(p_0+p_1)\log(p_0+p_1)$ is differentiated, differentiating the log will leave you with a 1, and differentiating the outer $p_0 + p_1$ will leave you with just the log.
Now, $(1) - (2) = 0$. This implies:
$$\log \frac{p_0}{p_1 + p_2 + p_3} = 0 \iff p_0 = p_1 + p_2 + p_3 \tag{$\alpha$}$$
$(4) - (3) = 0$. This implies:
$$\log \frac{p_3}{p_0 + p_1 + p_2 } = 0 \iff p_3 = p_0 + p_1 + p_2 \tag{$\beta$}$$
Solve out the two equations above, and feed them into equation $(5)$ to get $p_0 = p_3 = 0.5$
At this, the distribution of $Y$ is uniform over 6 characters.
Therefore, $C = \max\limits_{p_X} I(X;Y) = \log_2 6 - \log _2 3 = 1$. Phew.
Try to see what this is structurally doing. Suppose $X$ was uniform, then under this noise, I would be able to tell when the input was $0$ or $3$ to much greater accuracy than if it was $1$ or $2$. Why? Because whenever $Y= 0$ I know that the input is $0$, and similarly for $Y=5, X=5$. So a 'good' distribution for communication under this noise would be weighed more at $0$ and $3$.
To answer the other questions you raised:
Capacity can be zero, sure. For example, take $Z$ to be uniform over $\{ 1, 2, \dots ,N\}$. Then the capacity heads to $0$ as $n \to \infty$
What you're trying to do here is this: You know that your noise is uniform over 3 letters. You don't know which letters. Suppose the noise was the worst possible. At what rate would I still be able to communicate? This is the minimum capacity. It's the smallest of the maximum rates at which I could talk subject to varying noise.
(From your comments) You've calculated $H(Y|X)$ incorrectly.
$H(Y|X) = H(X+Z|X) = H(Z|X) = H(Z)$
Where the second equality comes from the fact that if I'm given $X$, I can always subtract it from any other random variable with 100% accuracy, and the last equality because $Z$ and $X$ are independent.
For a channel with input alphabet $\cal X$ and output alphabet $\cal Y$, the channel capacity is upper bounded by $\log \min (|\cal X|, |\cal Y|)$.
For example, take a channel with same input and output alphabet which just outputs the input exactly. Then, its capacity is $\log |\cal X|$ which is achieved by an uniform distribution on the inputs (in fact, for any (weakly) symmetric channel, the uniform distribution achieves capacity, but thats another mattter). This can be larger than 1 bit.
A rate R is achievable if there exists a $(\lceil 2^{nR} \rceil,n)$ codes whose maximal error probability goes to zero. A $(M,n)$ code for a channel (i.e. a specification of $p(y|x)$) is a mapping from $\{1,\ldots,M\}$ to $\cal X^n$ (which forms the encoder) and a deterministic mapping from $\cal Y^n \to \{1,\ldots,M\}$ which forms the decoder. The capacity is the supremum of achievable rates. So, in the example I gave above, rates below $\log |X|$ are achievable (e.g. by repeating $n/2$ symbols twice, you get a rate of $\log |\cal X| /2$), but this is not the largest rate possible -- just sending the $n$ symbols as is gives you the full rate of $\log |\cal X|$.
Best Answer
Not quite. What you get is $I(X;Y)=H(Y)-H(Y|X)=H(Y)-1$ bits, which is bounded by an uniform output distribution $p_y$, with $H(Y)=\log 26 $ bits. Then comes the realization that a uniform $p_y=1/26$ can be obtained by a uniform input $p_x=1/26$ - but this not the only solution, you could alternate the probabilities $P_x(A)=1/13$ $P_x(B)=0$ $P_x(C)=1/13$ etc.
Hence the noisy typewriter has several input distributions that attain the capacity (not a very frequent case). All of them are optimal (in the Shannon sense), but it happens that one of those alternatives leads to a simple (almost trivial) optimal encoding (again, a rather rare occurence).