Time Series Analysis – Is Modeling with Deep Learning/Classic Statistics Useless If p-Value Fails to Reject Null Hypothesis?

correlationmultivariate analysisneural networksp-valuetime series

Working with general data(not timeseries) is it worth to create a e.g deep-learning model if the p values for correlation between e.g x and y makes us not be able to reject the null hypothesis(zero correlation)?

Or for timeseries and my task:

Working with RNN/CNN/VAR-models or timeseries in general, is it even worth to do it if several variables fails an granger causality test?

I am working on a timeseries prediction task where i have two features(budgets) and one target(sales) and i came across this and this question arose. The variance for the variable(budget) that fails the test is really low and the one that pass it have high variance. The target sales have high autocorrelation since it is really dependent on month, date, day of week etc… this can be clearly seened with a decompose.

Why i am doing this is to be able to adjust the features for maximizing the output y. So the relation between the features(budgets) and y(sales) is key.
Hence just modelling the autocorrelation of the target y is useless in my case.
Why i am doing this is to be able to adjust the features for maximizing the output y. So the relation between the features(budgets) and y(sales) is key.
Hence just modelling the autocorrelation of the target y is useless in my case.
So maybe arimax would be the way to go… but then im stuck on base 1, what if the variables dont pass any tests.. how can i be sure that i dont just pick up the the autocorrelation of y and not the relation between the features and y?

I want the general idea behind the model to work on different datasets, but since the relations between variables can differ alot depending on dataset i would probably train one model on each and every dataset.
The datasets are quite small(60-120 timesteps) depending on the dataset..

I stationarized the timeseries before the test.. However i have not checked that the data can be adequately described by a linear model" so i head to that next. Probably it will not be.
Is there any best practice for checking the causality for nonlinear relationships?

Lets say i instead of treating the data as a series inject the time data as variables, e.g instead of (budget1, budget2) and a series i have features(budget1, budget2, day, month) per timestep, am i running into any risks? i assume based on previous understandings(high correlation between date and target) that i will see a strong correlation between the date variables and y and maybe less so between the budgets and y.

Another idea would be to sum the budgets for an month and then predict the total sales for that month, and then let it roll one timestep per sample.

So im thinking about this: I could adjusted the the features for a month and see the output and maybe from then it could pass the test. I assume quite drastic shifts of values would be the best, Could this be a smart strategy?

Best Answer

Welcome to Cross Validated!

Like any statistical test, the Granger Causality test has certain assumptions that need to be met for the results of the test to be valid/useful.

For the Granger Causality test, the assumptions are: (i) that it is covariance stationary (i.e., the mean and variance of each time series do not change over time), and (ii) that it can be adequately described by a linear model. source

(You will find that these two assumptions are common among many time-series statistical tests). That's for time series.

You also ask for non-timeseries, and the problem is the same. "Correlation" is also a linear measure (meaning it does not capture, or captures poorly, any non-linear pattern like quadratic, polynomial, logarithmic, stepwise etc.) . Additionally, linear regression has various other assumptions (like normality and independence of samples). Note, autocorrelation violates the "independence" assumption of linear regression, so you can't directly apply it to an autocorrelated timeseries. There are dedicated timeseries techniques (ARMA and ARIMA models) to address this.

With all of this in mind, the answer to your question is: No, it is not useless to try to model the relationship between two or more variables using complex (or any) methods if there is a failure to reject the Granger Causality test, or failure to find a high correlation.

Granted, if the correlation is low that probably generally means it's less likely you'll find an easy pattern there. But it really depends on the problem and the kinds of relationships that exist in your data. Maybe there's some kind of complex conditional relationship between the dependent variables etc.

To be honest, I don't fully understand your last question there after "So im thinking about this:", but I hope that my answer also coincidentally answers that question too.

Essentially, it's often worth trying out some different analysis methods even if Granger Causality test fails, but do so carefully, because failure of a test like that does indicate (most of the time) that you have some complex relationship going on (if any at all).

Side note: You don't always need to jump to things as complex and resource-hungry as neural networks. Try out time series techniques like the ones I mentioned, maybe find a way to normalize your time series and remove autocorrelation (like differencing the series) and try regular linear regression on the differences etc. Maybe try a less deep/complicated machine learning method like RandomForest. There are many options besides neural nets, and depending on the task, they might work out equally well for less hassle.

EDIT: I'm adding this based on OP's adding some information to the question. Please see my comment under the question for critical info, mainly regarding the fact that 60-120 rows of data is barely enough for a good linear regression, much less any kind of machine learning or deep learning (which requires at least an order of magnitude more data).

You mentioned adding month/day. With so little data, I wouldn't bother with that. Unless your data is weekly or something, you don't have enough of it to discern different patterns dependent on day of the year, day of the month etc. Think about it, if you added month but your data (for example) only spans two months.. it's quite unlikely you'll find a usable monthly pattern within 2 months.