Solved – Longitudinal predictive models

data miningpredictive-models

I have a predictive model to construct and am looking for some ideas on the approach. I have a large training dataset of customers' balances (think a savings account) at time t=1 to 36 representing monthly averages over three years. I will want to construct a model to predict the value of a customer's account at t=4..36 using only information available at the end of t=3. Hence, this is a forecasting type problem, but not in the time series sense where you observe a sequence of data significantly longer than the time horizon you are trying to predict.

I am thinking of approaching this like a discrete hazard model, where a row in the dataset is created for every customer/month combination (so here each customer has 33 rows in the data and a variable called t takes on the values 4 to 36). Repeated for every month is the "initial state" variables for the customer known at the end of month 3. This is the setup of the so-called person period data.

I would then learn a regression model on this data using the initial state data and 't'.

There would be no accounting for the fact that the customers are repeated in the data. This works for a discrete time hazard model, but my question is if this is OK for a linear regression or a machine learning algorithm (e.g. neural net)? Is there a better way?

ADD: Specifically I am wondering if such a model will fail if it does not explicitly account for the repeated measures and the correlation between rows (same customers at various values of 't'). When dealing with a predictive model, how could this be accounted for?

Best Answer

This is phase 1 of my answer. I want first to make sure that I understand the model. Take just one customer, say customer $i$, and denote by $s_{it}$ its monthly savings balance. From what the OP writes, the model appears to be (for one customer and assuming linearity for the moment)

$$ s_{it} = g(a_1t,a_2t^2, a_3t^3,..) + X_{i3}\beta + u_{it},\qquad t=4,...,36$$

...where $g(b_1t,b_2t^2, b_3t^3,..)$ represents various time-variables (that enter additively), and $X_{i3}$ is the matrix including the "initial state data" (of time period $3$). Is this an accurate depiction of the general idea? If it is I have one question and one remark:
Question: what happens to data from period 1,2,3? Aren't they used somehow (apart from $X_{i3}$)?

Remark: If all regressors in $X_{i3}$ are time-invariant, then their effect on $ s_{it}$ cannot be separated -they form a composite "intercept".

Waiting on OP's feedback.

PHASE 2
Given the OP's clarifications (although, I must note that it is not "for simplicity" that a time-invariant regressor "does not change with t"), let's explore the possible model

$$ s_{it} = \sum_{l=1}^{k}a_lt^l + X_i\beta + u_{it},\qquad t=4,...,36$$

My previous remark stands: whatever is in $X_i$, be it regressors specific to customer $i$, or even products of these regressors etc, or even some collective variables from periods $1,2,3$, to the degree that they are time-invariant for the time period $4,...,36$, then their effects on the dependent variable cannot be separated. Moreover $X_i$ is known, and if they are measured in the same units as the dependent variable, then their sum should be subtracted from the dependent variable, they do not belong in the RHS of the equation.

Moreover by a comment of OP in another answer, it appears that an assumption that customers exhibit similar behavior is made: namely, we could use some accounts to estimate the regression, and then use the estimated coefficients to predict the behavior of _other_accounts, presumably of other customers.

Taken all the above into account, we are led in a model that can be written $$ s_{it} - \sum_{j=1}^{m}x_{ij} = \sum_{l=0}^{k}a_lt^l + u_{it},\qquad t=4,...,36\qquad i=1,...,n$$

...where $n$ is the number of customers comprising the training sample. This model could be estimated by one of the many panel data estimators available, given also the constraints imposed by the structure of the specification. But although the panel data would permit a better estimation of the coefficients involved, the main issue remains: we are depending only on the powers of $t$ to capture the variability of the dependent variable. Personally, I don't "trust" that kind of models: they are a classic case where you could obtain a "perfect fit" by adding more and more powers of $t$ -only to find out that this "perfectly fitted" model is the worst when it comes to prediction out-of-sample.

What then? I believe that the $VARMA$ modelling (also discussed in another answer) is a better way to go: namely a vector autoregressive-moving average model:

$$s_{it} - \sum_{j=1}^{m}x_{ij} = \phi(L)s_{it} + \psi(L)u_{it},\qquad t=4,...,36\qquad i=1,...,n$$

where $\phi(L)$ and $\psi(L)$ are polynomials in the lag operator operating on $t$.

Related Question