Why not just put binary indicator for event as a target variable and length of time period as a explanatory variable (plus other covariates)? If event happens then target is 1 and time period is calculated as time until event happens - start time. For some observations where target is 0 this time period is 36, if measured in months. For some observations where target is 1 it can be much less.
Can there be panel attrition where some observation is removed from the data set before whole surveillance period is over? That must be accounted somehow.
To get individual survival probabilities for various time intervals you score new data set with just developed model object i times where i number of different time periods and particular time period has value n. Then just concatenate i vectors containing time period specific probabilities.
Idea is that measured time period with other covariates accounts for the survival probability conditional on length of time in the observational study.
EDIT:
I looked for R package neuralnet. You can have individual time period specific survival events in the target matrix in the following way. C1 is covariate1, T1 is vector of survival events in the time period 1 etc. Your data frame / matrix could look like this:
ID T1 T2 T3 T4 T5 T6 C1 C2 C3 CN
1 1 1 1 1 1 1 X11 X12 X13 X1N
2 1 0 0 0 0 0 X21 X22 X23 X2N
..
Use following code:
survexample=neuralnet(T1+T2+T3+T4+T5+T6~C1+C2+...+CN,data=example,hidden=n,err.fct="ce",linear.output=FALSE)
This example code does classifying and forces output vector values to be in the range of [0,1].
You can use the elasticnet penalty in sklearn's Logistic Regression classifier:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty = 'elasticnet', solver = 'saga', l1_ratio = 0.5)
LogisticRegressionCV will take these parameters as well
Best Answer
I think the crucial part to consider in answering your question is
because this statement implies something about why you want to use the model. Model choice and evaluation should be based on what you want to achieve with your fitted values.
First, lets recap what $R^2$ does: It computes a scaled measure based on the quadratic loss function, which I am sure you are already aware of. To see this, define residual $e_i = y_i - \hat{y}_i$ for your i-th observation $y_i$ and the corresponding fitted value $\hat{y}_i$. Using the convenient notation $SSR := \sum_{i=1}^Ne_i^2$, $SST:=\sum_{i=1}^N(y_i - \bar{y})^2$, $R^2$ is simply defined as $R^2 = 1 - SSR/SST$.
Second, let us see what using $R^2$ for model choice/evaluation means. Suppose we choose from a set of predictions $\bar{Y}_M$ that were generated using a model $M:M \in \mathcal{M}$, where $\mathcal{M}$ is the collection of models under consideration (in your example, this collection would contain Neural networks, random forests, elastic nets, ...). Since $SST$ will remain constant amongst all the models, if minimizing $R^2$ you will choose exactly the model that minimizes $SSR$. In other words, you will choose $M \in \mathcal{M}$ that produces the minimal square error loss!
Third, let us consider why $R^2$ or equivalently, $SSR$ might be interesting for model choice. Traditionally, the square loss ($L^2$ norm) is used for three reasons: (1) It is easier computable than Least Absolute Deviations (LAD, the $L^1$ norm) because no absolute value appears in the computation, (2) it punishes fitted values that are far off from the actual value much more than LAD (in a squared rather than an absolute sense) and thereby makes sure we have less extreme outliers, (3) it is symmetric: Over- or underestimating the price of a car is considered to be equally bad.
Fourth (and last), let us see if this is what you need for your predictions. The point that might be of most interest here is (3) from the last paragraph. Suppose you want to take a neutral stance, and you are neither buyer nor seller of a car. Then, $R^2$ can make sense: You are impartial, and you wish to punish deviations to over- or underpricing exactly identically. The same applies if you just want to model the relation between the quantities without wishing to predict unobserved values. Now suppose you are working for a consumer/buyer on a tight budget: In this situation, you might want to punish overestimation of the price in a quadratic sense, but underestimation in an $L^p$ sense, where $1 \leqslant p <2$. For $p=1$, you would punish in an absolute deviation sense. This can be seen to reflect the goals and intentions of the buyer, and biasing the estimation downward might be of interest for him/her. Conversely, you could flip the thinking if you were to model the price predictions for the seller. Needless to say, any norm $L^p$ could be chosen to reflect the preferences of the modeller/the agent you model for. You can also punish outside of the $L^p$ norm entirely, and use constant, exponential, or log loss on one side and a different loss on the other.
In summary, model choice/evaluation cannot be considered independently of the model's aim.