Solved – How to look for valleys in a graph

data visualizationdistributionsrstatistical significance

I'm examining some genomic coverage data which is basically a long list (a few million values) of integers, each saying how well (or "deep") this position in the genome is covered.

I would like to look for "valleys" in this data, that is, regions which are significantly "lower" than their surrounding environment.

Note that the size of the valleys I'm looking for may range from 50 bases to a few thousands.

What kind of paradigms would you recommend using to find those valleys?

UPDATE

Some graphical examples for the data:
alt text

UPDATE 2

Defining what is a valley is of course one of the question I'm struggling with. These are obvious ones for me:
alt text

but there some more complex situations. In general, there are 3 criteria I consider:
1. The (average? maximal?) coverage in the window with respect to the global average.
2. The (…) coverage in the window with respect to its immediate surrounding.
3. How large is the window: if I see very low coverage for a short span it is interesting, if I see very low coverage for a long span it's also interesting, if I see mildly low coverage for a short span it's not really interesting, but if I see mildly low coverage for a long span – it is.. So it's a combination of the length of the sapn and it's coverage. The longer it is, the higher I let the coverage be and still consider it a valley.

Thanks,

Dave

Best Answer

You could use some sort of Monte Carlo approach, using for instance the moving average of your data.

Take a moving average of the data, using a window of a reasonable size (I guess it's up to you deciding how wide).

Throughs in your data will (of course) be characterized by a lower average, so now you need to find some "threshold" to define "low".

To do that you randomly swap the values of your data (e.g. using sample()) and recalculate the moving average for your swapped data.

Repeat this last passage a reasonably high amount of times (>5000) and store all the averages of these trials. So essentially you will have a matrix with 5000 lines, one per trial, each one containing the moving average for that trial.

At this point for each column you pick the 5% (or 1% or whatever you want) quantile, that is the value under which lies only 5% of the means of the randomized data.

You now have a "confidence limit" (I'm not sure if that is the correct statistical term) to compare your original data with. If you find a part of your data that is lower than this limit then you can call that a through.

Of course, bare in mind that not this nor any other mathematical method could ever give you any indication of biological significance, although I'm sure you're well aware of that.

EDIT - an example

require(ares) # for the ma (moving average) function

# Some data with peaks and throughs 
values <- cos(0.12 * 1:100) + 0.3 * rnorm(100) 
plot(values, t="l")

# Calculate the moving average with a window of 10 points 
mov.avg <- ma(values, 1, 10, FALSE)

numSwaps <- 1000    
mov.avg.swp <- matrix(0, nrow=numSwaps, ncol=length(mov.avg))

# The swapping may take a while, so we display a progress bar 
prog <- txtProgressBar(0, numSwaps, style=3)

for (i in 1:numSwaps)
{
# Swap the data
val.swp <- sample(values)
# Calculate the moving average
mov.avg.swp[i,] <- ma(val.swp, 1, 10, FALSE)
setTxtProgressBar(prog, i)
}

# Now find the 1% and 5% quantiles for each column
limits.1 <- apply(mov.avg.swp, 2, quantile, 0.01, na.rm=T)
limits.5 <- apply(mov.avg.swp, 2, quantile, 0.05, na.rm=T)

# Plot the limits
points(limits.5, t="l", col="orange", lwd=2)
points(limits.1, t="l", col="red", lwd=2)

This will just allow you to graphically find the regions, but you can easily find them using something on the lines of which(values>limits.5).

Related Solutions

Solved – Should I use an average ECDF

To average the ECDFs, I'd do something like:

impute_resolution = 1e3
values_to_impute = seq(
    min(my_data$true_data)
    , max(my_data$true_data)
    , length.out = impute_resoluton
)

ecdfs = matrix(NA,nrow=length(values_to_impute))

for(i in 1:(ncol(my_data)-1)){ #assumes column 1 is true_data
    this_ecdf = ecdf(my_data[,i+1])
    ecdfs[i,] = this_ecdf(values_to_impute)
}

mean_ecdf = colMeans(ecdfs)
plot(
    x = values_to_impute
    , y = mean_ecdf
    , type = 'l'
)

Solved – Preferred methods for graphing time-series data to present “averages”

I suggest adding an example or two of what you are presently doing so we can better see what you are dealing with.

What you are concerned with is an important issue: how do you convey the "overall" pattern in the time series data while also not misleading viewers by showing just average values? One way I have dealt with this situation is plotting an average or median line along with surrounding quantile bands. For example,

enter image description here

Here, the time series data are from a bootstrap-based simulation so there are hundreds of values associated with each time point. The actual data are plotted in the black line with colored bands showing the variability of values from the simulation. This particular plot is maybe not the best example to show, but you can see that some points have much more variability than others, and you can also assess how the variability is skewed above/below the actual values depending on the position in the series.

UPDATE: Given your update here are some additional questions and thoughts... What decisions, if any, are made from this visualization? For example, are you looking for specific points in time where there is very slow response time, perhaps above a specific threshold? If so, it may be better to simply plot all of the points as a scatter plot, and then also plot a time series line showing the average value, as well as some lines delineating the bounds you are concerned about. This recommendation is not appropriate if you have numerous observations at some time points (too much clutter), or if your time measurement is not sufficiently coarse (in which case you can bin response data into minute-wide time of day intervals). But the visualization recommendation will certainly be affected by what decision(s) will be supported with it. In my example, I was looking at such plots side by side, one from one simulation and the other from another simulation (each simulation using different parameters) so I could assess the variability of the underlying model due to sampling error.

Best Answer

Related Solutions

Solved – Should I use an average ECDF

Solved – Preferred methods for graphing time-series data to present “averages”

Related Question