Solved – Correlation between two data sets with same x-axis values (year) and different y values

correlationpythonr

I have two data sets, both ranging from 1996-2016. However, the y-axis values are on completely different scales. The first is for mean NDVI values where 0 is centered on the mean (.1865) and the ranges are the differences between the mean and the values for that year that range from -0.03 to 0.03. The second dataset is the Palmer Drought Severity Index with the same date range, but range from -7 and 6.

I want to know the best way to find correlation between these two datasets, and would prefer to be able to do it in python. Here is an image of the two plots, both have the same x-axis which is the year range from 1996-2016

Best Answer

The different ranges of your data is no problem, since e.g. scaling (both or just one of them) to mean zero and unit variance does not change the correlation between them. However, if you want to correlate the two data vectors you have, the need to have the same length, i.e. if your PMDI vector has more data points than your other vector, then you need to find a way (e.g. taking the mean over some period) to summarise your PMDI vector in less data points. Calculating correlation in Python: See e.g. https://stackoverflow.com/questions/19428029/how-to-get-correlation-of-two-vectors-in-python

Related Solutions

Solved – Visualising a linear model with 6 predictors in R

Here is some code that is hopefully self-explanatory:

set.seed(20987)     # for reproducability

N = 200

  # variables
days_since   = rpois(N, lambda=60)
site         = factor(sample(c("site1", "site2"), N, replace=T), c("site1", "site2"))
age          = factor(sample(c("juv", "adult"),   N, replace=T), c("juv", "adult"))
year         = factor(sample(c("2012", "2013"),   N, replace=T), c("2012", "2013"))
PC1          = rnorm(N, mean=100, sd=25)
arrival_date = sample.int(365, N, replace=T)

  # betas
B0  =  13
Bds =  74
Bs  = 114
Ba  = 160
By  = 191
Bpc =  59
Bad =  11

  # response variable
weight = B0 + Bds*days_since + Bs*(site=="site2") + Ba*(age=="adult") + 
         By*(year=="2013") + Bpc*PC1 + Bad*arrival_date + rnorm(N, mean=0, sd=10)

model = lm(weight~days_since+site+age+year+PC1+arrival_date)

  # predicted values for plot
ds    = seq(min(days_since), max(days_since))
ds1j2 = predict(model, data.frame(days_since=ds, site="site1", age="juv",   
                       year="2012", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds1j3 = predict(model, data.frame(days_since=ds, site="site1", age="juv",   
                       year="2013", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds1a2 = predict(model, data.frame(days_since=ds, site="site1", age="adult", 
                       year="2012", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds1a3 = predict(model, data.frame(days_since=ds, site="site1", age="adult", 
                       year="2013", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds2j2 = predict(model, data.frame(days_since=ds, site="site2", age="juv",   
                       year="2012", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds2j3 = predict(model, data.frame(days_since=ds, site="site2", age="juv",   
                       year="2013", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds2a2 = predict(model, data.frame(days_since=ds, site="site2", age="adult", 
                       year="2012", PC1=mean(PC1), arrival_date=mean(arrival_date)))
ds2a3 = predict(model, data.frame(days_since=ds, site="site2", age="adult", 
                       year="2013", PC1=mean(PC1), arrival_date=mean(arrival_date)))

  # plot
windows()
  plot(x=ds, y=ds1j2, ylim=c(11000, 14500), type="l", lty=1,
       ylab="predicted weight", xlab="days since 1st Sept")
                                points(range(ds), range(ds1j2), pch=5)
  lines(x=ds, y=ds1j3, lty=1);  points(range(ds), range(ds1j3), pch=18)
  lines(x=ds, y=ds1a2, lty=2);  points(range(ds), range(ds1a2), pch=5)
  lines(x=ds, y=ds1a3, lty=2);  points(range(ds), range(ds1a3), pch=18)
  lines(x=ds, y=ds2j2, lty=1);  points(range(ds), range(ds2j2), pch=1)
  lines(x=ds, y=ds2j3, lty=1);  points(range(ds), range(ds2j3), pch=16)
  lines(x=ds, y=ds2a2, lty=2);  points(range(ds), range(ds2a2), pch=1)
  lines(x=ds, y=ds2a3, lty=2);  points(range(ds), range(ds2a3), pch=16)

  legend("bottomright", lty=rep(1:2, 4), pch=c(5,18,5,18,1,16,1,16), 
         legend=c("site 1, juveniles, 2012", "site 1, juveniles, 2013", 
                  "site 1, adults,    2012", "site 1, adults,    2013", 
                  "site 2, juveniles, 2012", "site 2, juveniles, 2013", 
                  "site 2, adults,    2012", "site 2, adults,    2013"))

You can write code that's much shorter by writing functions that will read in a list and do all of this for you rather than copying and pasting the same thing eight times in a row, but this should be easier to follow. Here is the plot:

enter image description here

This kind of plot is more interesting / useful when there are interactions (the lines aren't parallel). In this case, we just have a set of eight lines that are shifted vertically relative to each other.

Solved – Correlation coefficient between two data sets having points of different dimensions

I think you are looking for multiple correlation. In short, just like in simple regression correlation ($r$) is the square root of the coefficient of determination ($R^2$), in multiple regression you can also use the square root of $R^2$ as multiple coefficient of correlation.

In most statistical packages $R^2$ is computed when performing multiple regression. Then, you just need to perform multiple regression or your real variable using the components of your vector variable as predictors, extract $R^2$ and take its square root.

I never tried it but in Python it can be done with Numpy.

Best Answer

Related Solutions

Solved – Visualising a linear model with 6 predictors in R

Solved – Correlation coefficient between two data sets having points of different dimensions

Related Question