Solved – how to interpret a scatter plot below

descriptive statisticsscatterplot

Below is the scatter plot of Brightness temperature on Y axis and corresponding Rainfall rate from TRMM satellite image on the x-axis. I want to know if there is a relationship between the two. If yes, how to get a regression equation from it and if not, how to know the reason for it?

Best Answer

Just from eyeballing the scatter plot, it doesn't look like there is much of a relationship. You could improve the plot by choosing a different symbol for the dots - one that takes up less room or is translucent would be good. You might also want to transform the rainfall rate, perhaps using log (if the rate is always positive) or square root.

Then you could try fitting a loess line or other smooth line to the data.

Then you can try regression; how to do regression depends on the package you are using (and questions about code are off topic here) but this plot looks like Excel - I, personally, would avoid using Excel and use a statistics package instead, because they have more options and so on.

If you decide not to transform the data, then you are likely to have some influential points. You might try a robust regression or a quantile regression.

Related Solutions

Machine Learning – How to Detect Discernible Structure in Scatter Plots

Maximal Information coefficient is one method that has been used for this. "In statistics, the maximal information coefficient (MIC) is a measure of the strength of the linear or non-linear association between two variables X and Y."

Detecting Novel Associations in Large Data Sets. D. Reshef et. al

http://en.wikipedia.org/wiki/Maximal_information_coefficient

Solved – Detecting outliers along the distribution in a scatter plot

I think a funnel plot is a great idea. The challenge then is how to calculate the confidence band.

You need a distribution of allele frequencies for one SNP. This is the challenging step. I don't know enough about the subject to guess this, so I would just use the empirical probabilities.
If you have more than one SNP, possible mean values result from the combination of the possible values for each SNP.

Thus, you could do this:

ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
#        0.1         0.2         0.3         0.4         0.5         0.6         0.7 
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124

We assume that the probabilities for values > 0.7 are zero. The error we make with this assumption is negligible.

Now we can simulate data:

n <- 1e4
set.seed(42)
sims <- sapply(1:80, 
               function(k) 
                 rowSums(
                   replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)
layout(t(1:2))
plot((mean_score) ~ total_number_snps, data = DF)
matplot(1:80, t(sims), pch = 1, col = 1)
layout(1)

You can see the same patterns in the simulated data as in your data.

Finally we can calculate quantiles:

quants <- apply(sims, 2, quantile, probs = c(0.025, 0.975))

plot((mean_score) ~ total_number_snps, data = DF)
matlines(1:80, t(quants), col = "red", lty = 2)

It looks like the assumption that the probability distribution for a single SNP's allele frequency is independent of the number of SNPs in a gene doesn't really hold for high numbers of SNPs (or the sample size is just too small, but you have more data).

Best Answer

Related Solutions

Machine Learning – How to Detect Discernible Structure in Scatter Plots

Solved – Detecting outliers along the distribution in a scatter plot

Related Question