K-means minimizes the total sum of squared deviations. Don't think of it in terms of distances. Yes, you assign objects by their distance. But that distance (squaredEuclidean) is just the sum of 1d squared deviations. So what you really are optimizing is the sum of squared 1d deviations.
The key part however is the mean, and here it is more obvious that this method optimizes based on single dimensions. The arithmetic mean is the minimum least squares estimation, again.
So you really should plot the sum of squared deviations, not the mean of euclidean distances.
Yes: euclidean distances and squared euclidean distances are monotone - for a single object.
Example in one dimension:
Objects at 0 and 2, mean at 0.5:
average distance: (0.5 + 1.5) / 2 = 1
squared deviations: 0.25 + 2.25 = 2.5
Object at 0 and 2, mean at 1:
average distance: (1 + 1) / 2 = 1
squared deviations: 1 + 1 = 2
So the second example is only better wrt. squared deviations, not with linear deviations. In particular, an optimum for the linear deviations isn't necessarily optimal for the squared deviations.
Above example highlights that one shouldn't just plug in arbitrary distances into k-means. It technically is not distance based. It is based on squared deviations in each single dimension, which happen to sum up to the squared euclidean distance, which is monotone with the regular euclidean distance, so it appears as if we are assigning every object to the nearest cluster center, although what we are actually doing is minimize the sum of squared 1d deviations.
Do not use k-means for timeseries.
DTW is not minimized by the mean; k-means may not converge and even if it converges it will not yield a very good result. The mean is an least-squares estimator on the coordinates. It minimizes variance, not arbitrary distances, and k-means is designed for minimizing variance, not arbitrary distances.
Assume you have two time series. Two sine waves, of the same frequency, and a rather long sampling period; but they are offset by $\pi$. Since DTW does time warping, it can align them so they perfectly match, except for the beginning and end. DTW will assign a rather small distance to these two series.
However, if you compute the mean of the two series, it will be a flat 0 - they cancel out. The mean does not do dynamic time warping, and loses all the value that DTW got. On such data, k-means may fail to converge, and the results will be meaningless. K-means really should only be used with variance (= squared Euclidean), or some cases that are equivalent (like cosine, on L2 normalized data, where cosine similarity is the same as $2 -$ squared Euclidean distance)
Instead, compute a distance matrix using DTW, then run hierarchical clustering such as single-link. In contrast to k-means, the series may even have different length.
Best Answer
So you are "I'm trying to cluster time series of different length and I came up to an idea to use DTW "...
Either you are
In a situation in which you can just make the time series the same length, see section 3 of [a]. In which case, problem solved.
You are in a sitation in which you cannot do this. For example, maybe your time series is a mix of one heartbeat, 2 and half heartbeats etc. Which you should clearly not make the same length. In this case you have two sub choices...
Cluster using only subsections of the time series, using u-shaplets [b] Cluster using derived (non-shape) features [c]
If I had to guess from your description only, go with u-shaplets [b], there is free code.
By the way, centroids under DTW was recently solved