I think a funnel plot is a great idea. The challenge then is how to calculate the confidence band.
You need a distribution of allele frequencies for one SNP. This is the challenging step. I don't know enough about the subject to guess this, so I would just use the empirical probabilities.
If you have more than one SNP, possible mean values result from the combination of the possible values for each SNP.
Thus, you could do this:
ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124
We assume that the probabilities for values > 0.7 are zero. The error we make with this assumption is negligible.
Now we can simulate data:
n <- 1e4
set.seed(42)
sims <- sapply(1:80,
function(k)
rowSums(
replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)
layout(t(1:2))
plot((mean_score) ~ total_number_snps, data = DF)
matplot(1:80, t(sims), pch = 1, col = 1)
layout(1)
You can see the same patterns in the simulated data as in your data.
Finally we can calculate quantiles:
quants <- apply(sims, 2, quantile, probs = c(0.025, 0.975))
plot((mean_score) ~ total_number_snps, data = DF)
matlines(1:80, t(quants), col = "red", lty = 2)
It looks like the assumption that the probability distribution for a single SNP's allele frequency is independent of the number of SNPs in a gene doesn't really hold for high numbers of SNPs (or the sample size is just too small, but you have more data).
Best Answer
Just from eyeballing the scatter plot, it doesn't look like there is much of a relationship. You could improve the plot by choosing a different symbol for the dots - one that takes up less room or is translucent would be good. You might also want to transform the rainfall rate, perhaps using log (if the rate is always positive) or square root.
Then you could try fitting a loess line or other smooth line to the data.
Then you can try regression; how to do regression depends on the package you are using (and questions about code are off topic here) but this plot looks like Excel - I, personally, would avoid using Excel and use a statistics package instead, because they have more options and so on.
If you decide not to transform the data, then you are likely to have some influential points. You might try a robust regression or a quantile regression.