To fit the best ARIMA model, you need to follow the standard method, the Box-Jenkins method. Notably, you should not choose $p$ and $q$ simply via looking at residual plots. The auto.arima function you have used is designed to do a lot of the work for you, but there is certainly no guarantee that it will give you the best model, it merely chooses the model with the lowest AIC, AICc or BIC.
So what is the Box-Jenkins method? Without going into too much depth, you need to:
First check for stationarity. Plot the time series, as in plot.ts(data). Is the average value of the time series constant? If it increases and decreases linearly, so there is a linear trend in your series, you will need first-order differencing; if the change is quadratic then second-order, and so on. It is important to not over-difference, however. In an ARIMA model you have ARIMA($p,d,q$) and first-order differencing means you need $d=1$. As well as a constant mean, you require an approximately constant variance. You can consider some transformations if it's clearly not constant but I don't think this will be a problem here.
Next check for seasonality. The easiest way to do this is look at the ACF plot of the data. If there is decay and then a spike at regular intervals, then there is a seasonal trend. In this problem, there's a good chance there is a yearly trend. So you would get a spike at lag 12, 24, and so on, as there are 12 observations in a year for data collected monthly. Note that this agrees with what auto.arima found.
Let's say that following the above steps you agree with what auto.arima gave you, as in there is a linear trend ($d=1$) and there is a yearly trend. You now want to find $p$ and $q$ but you need to look at the data after the differencing. You can do this with the diff function in R. diff(data) gives you your data after first-order differencing and diff(data, 12) gives your data after dealing with the yearly trend. So diff(diff(data), 12) should give you something stationary, plot it and see!
Now you can choose the values of $p$ and $q$. Examine the ACF and PACF and you should be able to choose appropriate values for $p$ and $q$. Look at the Box-Jenkins method on Wikipedia or similar for details. If you think $p=2$ and $q=0$, fit it and maybe try some similar ones.
Now, finally, come the residuals and the graphs you provided. If you've done everything right so far, you should get white noise. If not then some assumption made so far is wrong and generally it isn't easy to say which. If the residuals look like some process, like an AR(1) and not white noise, this can tell you what model you should try instead, but again I won't go into details as hopefully they will look like white noise.
Now do you actually have white noise and how do you tell? If you have white noise, then 95% of your sample autocorrelations should be between the two blue lines and so one line slightly over the line doesn't automatically imply you don't have white noise (remember white noise is random, specifically Gaussian). An ARIMA model is never going to fit the data perfectly, so you can't expect to have perfect residuals that are exactly white noise. Although it is certainly somewhat subjective, I'd say that the residuals you have are pretty close to white noise.
You don't really have a basis to assert that they are white noise. Failure to reject the null is not the same thing as H0 actually being true
However, the evidence that they're something else is quite weak -- the values are reasonably consistent with white noise.
I think your interpretation of the p-value for the Q-statistic at the 4th lag is wrong. Take a look at what the Q-statistic is actually doing; it's not looking "at the 4th lag" but at lags up to 4.
Best Answer
Actually, the null hypothesis of a portmanteau test on residuals is "$\rho_1=\rho_2=...=\rho_L=0$" where $\rho_l$ is the correlation of the noise between $\epsilon_t$ and $\epsilon_{t-l}$. Thus, if $L=6$, then a small p-value (column "Pr>Chisq") indicates it is difficult to accept the hypothesis of an independently distributed noise (until 6 time lags). As a result, the model (here, an AR(1) process) does not seem to take into account all the dynamic of the time series.
In the displayed example, the noise seems to have signifant autocorrelations between 1 and 3-4 (Columns "Autocorrelations"). Thus, an AR(4) could be more appropriate. However, an other test has to be performed before to be sure! Another method is to use some BIC or AIC criteria to compare autoregressive models.