It is possible to get a general formula for stationary ARMA(p,q) autocovariance function. Suppose $X_t$ is a (zero mean) stationary solution of an ARMA(p,q) equation:
$$\phi(B)X_t=\theta(B)Z_t$$
Multiply this equation by $X_{t-h}$, $h>q$, take expectations and you will get
$$r(h)-\phi_1r(h-1)-...-\phi_pr(h-p)=0$$
This is a recursive equation, which has a general solution. If all the roots $\lambda_i$ of polynomial $\phi(z)=1-\phi_1z-...-\phi_pz^p$ are different,
$$r(h)=\sum_{i=1}^pC_i\lambda_i^{-h}$$
where $C_i$ are constants which can be derived from the initial conditions. Since $|\lambda_i|>1$ to ensure stationarity it is very clear why the autocorrelation function (which is autocovariance function scaled by a constant) is decaying rapidly (if $\lambda_i$ are not close to one).
I've covered the case of unique real roots of the polynomial $\phi(z)$, all other cases are covered in general theory, but formulas are a bit messier. Nevertheless the terms $\lambda^{-h}$ remain.
Answers to question 2 and 3 more or less follow from this formula. For $AR(1)$ process $r(h)=c\phi_1^h$ and when $\phi_1$ is close to one, i.e. close to non-stationarity, you get the behaviour you describle. The same goes for general formula, if the process is nearly unit-root one of the roots $\lambda_i$ is close to 1 and it dominates other terms, producing the slow decay.
Autocorrelation lags are created by taking a pair of values at a given lag, multiplying the pair and summing across all pairs. Because your signal has a finite length, large autocorrelation lags have fewer and fewer pairs summed together, and thus are smaller values. You can compensate for this by using an "unbiased" autocorrelation. Statsmodels acf
has an option to return an unbiased estimate: http://statsmodels.sourceforge.net/stable/generated/statsmodels.tsa.stattools.acf.html
Best Answer
The procedure that you had following is recommended for specify an ARMA model. Your results show that the series in difference is like a white noise process. Actually this mean that no ARMA model can make prediction better than a long run mean.
For example, results like yours was historically used to achieve conclusion of unpredictability for stock returns. However this conclusion is exaggerated because the results above can lead us to exclude linear auto-predictability but not predictability at all.
Firstly is possible to have better result using others predictor, not only lag values; think about Granger causality (predictability).
Secondly you can try to predict your variable with ARMA tool after some transformation of original data.
Otherwise you can to use model that involve non linear relations among predicted variable and predictors; an example is Artificial Neural Networks.
If your objective is prediction your work is not ending, its starting.