Solved – Usefulness of standard deviation/alternatives for highly variable measurements

standard deviation

EDIT: SOME ADDITIONS TO CLARIFY ORIGINAL TEXT

If I remember correctly I heard some mention of standard deviation for precipitation means of sums is pretty useless due to the highly variable nature of preciptation quantities.

Let's say that climatologists have calculated standard deviations for means of sums of monthly precipitation for every month of the year for 30 years of measurements. A monthly sum equals the total amount that has fallen during that month. So a monthly sum equals one measurement in this case. So if you take the average of the month of july over 30 years you have 30 measurements. If the standard deviation of these mean values are bigger than the mean values themselves it tells us there is a relatively high spread in the dataset. This is another way of saying that the coefficient of variation is big.
But what would be considered big in this specific case? Are these sizes of the coefficient of variation normal for this type data? Lets assume here that all the coefficients of variations are above 100 %. Probably an irrelevant question in this forum.

Now when the difference between the average values of two 30-year-periods are calculated, each period introduces its own standard deviation. And the resulting standard deviation for the difference would be even bigger than the largest standard deviation between each of the normal periods. I believe this is called error propagation (please correct me if this is the wrong english terminology). If the resulting standard deviation is bigger than the difference of the mean values, it means that the difference between the mean values may be very far away from the true value. In other words a pretty "non-accurate" mean in this case right, which for certain/quite many observations would yield fictitious differences of means?

Precipitation can vary greatly in some regions of the world, for example due to large scale weather fluctuations like ENSO or other natural variaton. So perhaps 30 years is to low for averaging precipitation data due to high variability in some locations.
The World Meteorological Organization recommends averaging over periods of thirty years. And this is common practice. Of course there are weaknesses by doing so and deviations from this practice exist. For instance some claim that thirty years is too low for certain climatic parameters due to their variable nature. This kind of answers part of my own question here.
But if the precipitation data is only available for 30 years, are there any alternatives to standard deviation that would be recommended/considered more useful?

I think I have heard some mention that precipitation data from different locations may have different have different distributions. However is standard deviation only useful/make sense for normal distributions?

As a sidequestion: would the mean value be more accurate, with lower coefficient of variation if one has one million or billion years of measurements of data, even when each data point (spread) is highly variable?

EDIT 2: SHORT VERSION

If the data is not normally distributed what does coefficient of variation above 100 % tell us? What are the alternatives for detecting variation, if the alternatives are better/equally good (this is especially attractive to know about if the coefficient of variation is useless in my case)? Looking for answers which preferably are relevant to above example. Links to relevant studies are highly appreciated. Answers/research that provide intuitive examples/explanations are also highly appreciated. Of course answers to the other questions also are appreciated.

Best Answer

Question: Usefulness of standard deviation/alternatives for highly variable measurements?

Standard deviation will tell you whether or not the measurements are highly variable, it's not that you use "standard deviation" to predict the weather, it's that you use standard deviation to tell you if the other value (for which the standard deviation is provided) can be relied on as a predictor.

Even that alone is no guarantee. Example: It rained on this date 100% for the past 100 years, will it rain today? Answer: There's a good chance, but if there are no clouds in the sky there's 0% chance. The standard deviation of a single value is not the certainty of a result.

A simple example is provided on J. Smith of SNU's webpage on standard deviation:

"Everybody knows that when it comes to climate and weather, there really is no difference between Oklahoma and Hawaii. What?!?!?! You mean you don't believe me? Well, let's look at the statistics (after all, this is a stat course). The average (mean) daily temperature in Hawaii is 78 degrees farenheit. The average daily temperature in Oklahoma is 77 degrees farenheit. You see...no difference.

You still don't buy it huh? Well you are indeed smarter than you look. But how about those numbers? Are they wrong? Nope, the numbers are fine. But what we learn here is that our measures of central tendency (mean, median and mode) are not always enough to give us a complete picture of a distribution. We need more information to distinguish the difference.

Well before we go any further, let me ask a question: Which average temperature more accurately describes that state? Is 78 degrees more accurate of Hawaii than 77 degrees is of Oklahoma? Well if you live in Oklahoma I suspect you decided that 77 degrees is a fairly meaningless number when it comes to describing the climate here.

...

Okay...so the mean temperatures were 78 for Hawaii and 77 for Oklahoma...right? But notice the difference in standard deviation. Hawaii is a mere 2.52 while Oklahoma came in at 10.57. What does this mean you ask? Well the standard deviation tells us the standard amount that the distribution deviates from the average. The higher the standard deviation, the more varied that distribution is. And the more varied a distribution, the less meaningful the mean. You see in Oklahoma, the standard deviation for temperature is higher. This means that our temperatures are much more varied. And because the temperature varies so much, the average of 77 doesn't really mean much. But look at Hawaii. There the standard deviation is very low. This of course means the temperature there does not vary much. And as a result the average of 78 degrees is much more descriptive of the Hawaiin climate. I wonder if that has anything to do with why people want to vacation in Hawaii rather than Oklahoma?

From: "Probabilistic Forecasting - A Primer" by Chuck Doswell and Harold Brooks of the National Severe Storms Laboratory Norman, Oklahoma:

"Probabilistic forecasts can take on a variety of structures. As shown in Fig. 0, it might be possible to forecast Q as a probability distribution. [Subject to the constraint that the area under the distribution always sums to unity (or 100 percent), which has not been done for the schematic figure.] The distribution can be narrow when one is relatively confident in a particular Q-value, or wide when one's certainty is relatively low. It can be skewed such that values on one side of the central peak are more likely than those on the other side, or it can even be bimodal [as with a strong quasistationary front in the vicinity when forecasting temperature]. It might be possible to make probabilistic forecasts of going past certain important threshold values of Q. Probabilistic forecasts don't all have to look like PoPs! When forecasting for an area, it is quite likely that forecast probabilities might vary from place to place, even within a single metropolitan area.".

Question: However is standard deviation only useful/make sense for normal distributions?

All that standard deviation will tell you about "highly variable measurements" is that they are highly variable, but you knew that already; if the standard deviation is very low you can rely more, but not absolutely, on historical measurements.

As a sidequestion: would the mean value be more accurate, with lower coefficient of variation if one has one million or billion years of measurements of data, even when each data point (spread) is highly variable?

Q: Mean more accurate with more data points?: Yes.

Q: Lower variation (standard deviation)?: No, not if the "data point (spread) is highly variable".

The "standard deviation" doesn't affect the accuracy of your calculation of the mean, regardless of the standard deviation you have equal mathematical skills and calculate both the mean and standard deviation equally well. It's that with a standard deviation (accurately calculated) the mean (or any other value) has less meaning when the standard deviation is large. It's a less useful predictor.

With a very low standard deviation any prediction based on a single value (for example, the mean) isn't 100% reliable.

Question: Looking for answers which preferably are relevant to above example. Links to relevant studies are highly appreciated. Answers/research that provide intuitive examples/explanations are also highly appreciated. Of course answers to the other questions also are appreciated.

- Understanding the difference between climatological probability and climate probability

Using the Probability Forecast Distribution Tool
Why do ENSO forecasts use probabilities?
Probabilistic Forecasting - A Primer (repeat of link given above)

- Bayesian probability

"Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief.

The Bayesian interpretation of probability can be seen as an extension of propositional logic that enables reasoning with hypotheses, i.e., the propositions whose truth or falsity is uncertain. In the Bayesian view, a probability is assigned to a hypothesis, whereas under frequentist inference, a hypothesis is typically tested without being assigned a probability.

Bayesian probability belongs to the category of evidential probabilities; to evaluate the probability of a hypothesis, the Bayesian probabilist specifies some prior probability, which is then updated to a posterior probability in the light of new, relevant data (evidence). The Bayesian interpretation provides a standard set of procedures and formulae to perform this calculation.".

- Modern Forecasting Papers

A method for preferential selection of dates in the Schaake shuffle approach to constructing spatiotemporal forecast fields of temperature and precipitation (April 2017) by Scheuerer, Hamill, Whitin, He, and Henkel.
Probabilistic temperature forecasting based on an ensemble AR modification (6 Aug 2015), by Möller and Groß.
Spatial postprocessing of ensemble forecasts for temperature using nonhomogeneous Gaussian regression (30 June 2014), by Feldmann, Scheuerer, and Thorarinsdottir.

That should get you started, each of those papers has citation links which lead to newer papers.

Best Answer

Related Solutions

Standard Deviation – How to Calculate Standard Deviation of Multiple Measurements with Uncertainties in Time Series Data

Solved – Standard deviation of (assumed) normal distribution

Related Question