The Box-Jenkins (ARIMA) model identification procedure consists of the following three stages.
Identification consists of using the data and any other knowledge that will tentatively indicate whether the time series can he described with a moving average (MA) model, an autoregressive (AR) model, or a mixed autoregressive – moving average (ARMA) model.
Estimation consists of using the data to make inferences about the parameters that will be needed for the tentatively identified model and to estimate values of them.
Diagnostic checking involves the EXAMINATION of residuals from fitted/tentative models, which can result in either no indication of model inadequacy or model inadequacy, together with information on how the series may be better described.
It is an ITERATIVE process yielding possible latent structure such as pulses, level/step shifts, seasonal pulses and local time trends while validating BOTH
1) constant parameters through time
and
2) constant error variance through time.
https://autobox.com/pdfs/ARIMA%20FLOW%20CHART.pdf details the iterative sequence.
When you post your data , I will attempt to highlight specific decision points.
EDITED AFTER RECEIPT OF DATA (2289 monthly values):
The DF test that you referred to reflects only tests for the need for differencing and ignores seasonal dummies/pulses as possible remedies for non-stationarity.
I used AUTOBOX my tool of choice ( which I have helped to develop ) to iteratively AND logically step through the ARIMA model building process.
The first step is to assess dominance of ARMA structure versus latent deterministic structure by comparing possible error variances from both. The conclusion is that monthly effects (NOT MONTHLY MEMORY) dominates. This is no surprise as it is common knowledge that the month-of-the-year effects are the most important factor when planning a trip to the Cayman Islands not just what occurred last year.
Note that monthly averages ( read : "seasonal pulses" ) are used as an aid to predict/forecast temperature ![enter image description here](https://i.stack.imgur.com/BXh8R.png)
A partial model list is here suggesting a level shift at or about 1919/6 while incorporating 11 seasonal dummies ![enter image description here](https://i.stack.imgur.com/b7AaK.png)
The first step yields a set of residuals suggesting the need for possible model augmentation i.e. an ar(1) component effectively adding memory to the model ..
and here ![enter image description here](https://i.stack.imgur.com/2hG6E.png)
The augmented model (1,0,0)(0,0,0)12 with 11 seasonal dummies and one level/step shift is shown here ![enter image description here](https://i.stack.imgur.com/xAoZs.png)
The Tsay Test for constant error variance suggests a significant error variance reduction at or around period 469
. This test is chronicled here http://docplayer.net/12080848-Outliers-level-shifts-and-variance-changes-in-time-series.html .
Here is the acf of the current model residuals
We proceed to evaluate possible anomalies that might need special attention . Here is the list of one-time pulses that need to be adjusted for in order to obtain robust parameters enabling meaningful tests of significance ![enter image description here](https://i.stack.imgur.com/ndyiR.png)
Finally we have a useful model
with residual plot here
with a forecast plot here for the next 36 months
and residual histogram here ![enter image description here](https://i.stack.imgur.com/yhU19.png)
In summary ... evaluate possible alternative strategies and then much like peeling an onion .. iterate until the error process is free of information suggesting model sufficiency.
Finally the data is non-stationary because there are identifiable fixed/deterministic (read monthly) effects and a level/step shift and a deterministic break-point in error variance.
Here is the Actual/Fit and Forecast graph ![enter image description here](https://i.stack.imgur.com/7e171.png)
Best Answer
There is nothing conflicting here. Differencing a unit root process removes the unit root. If the original process is AR(1) with a unit root, $y_t=y_{t-1}+\varepsilon_t$, the differenced process is just white noise, $\Delta y_t=\varepsilon_t$. This is what you see in the ACF and PACF for the differenced data.