Solved – Logistic regression loss with continuous labels

logisticloss-functions

I want to learn a logistic regression model but my outcome variable can take continuous values in [0;1], not only binary labels {0;1}.

My model thus still needs a logistic link (to bound my predictions in [0;1]) but the log loss is not appropriate in my case. I cannot formulate my problem as a series of Bernoulli trials as in standard binary LR.

Any suggestion for n appropriate loss in my case?

Best Answer

If $Y$ cannot be exactly 0 or 1, you can just take the logit transformation of $Y$ and use ordinary least squares. Such a model reasonably thinks of errors as being on the logit scale (scale of the linear predictor in the linear ols model). You cannot use nonlinear least squares because that would assume errors on the original $Y$ scale, resulting in predictions outside $[0,1]$. I'm not sure what your reference to log loss means. The transformed ols model would be assuming Gaussian errors.

Related Solutions

Solved – Logistic regression-like model for non-discrete outcomes

If you have "continuous" (seemingly, as they could still be discrete) values in between 0 and 1 there are at least two cases:

They came from a number of independent binary trials and the "continuous" value is the number of successes divided by trials. Then a binomial GLM might be appropriate. In this case you need to fit it in R as glm(cbind(numberSuccesses,numberFailures)~x,family=binomial)
If that is not the case, then you might have something for which a Beta Model might be more appropriate. The link I provided shows how to do that in R.

Note that in R glm(y~x,family=binomial) with a "continuous" $y$ will throw a warning and in general the result will not be the same as in the case with number of successes and trials:

set.seed(1)
successes<-sample(1:10,100,replace=TRUE)
x<-1:100
n<-12
failures<-n-successes

summary(glm(cbind(successes,failures)~x,family=binomial))
Call:
glm(formula = cbind(successes, failures) ~ x, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8197  -0.9434   0.0454   0.9358   2.4921  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.24622    0.11349   -2.17     0.03 *
x            0.00080    0.00195    0.41     0.68  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 134.99  on 99  degrees of freedom
Residual deviance: 134.82  on 98  degrees of freedom
AIC: 422.2

Number of Fisher Scoring iterations: 3

but

props<-successes/n
summary(glm(props~x,family=binomial))

Call:
glm(formula = props ~ x, family = binomial)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-0.852  -0.282  -0.105   0.394   0.760  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.134339   0.403836   -0.33     0.74
x            0.000281   0.006941    0.04     0.97

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20.888  on 99  degrees of freedom
Residual deviance: 20.887  on 98  degrees of freedom
AIC: 141.3

Number of Fisher Scoring iterations: 3

Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

Classification with Noisy Labels – Techniques and Methods

The right thing to do here is to change the model, not the loss. Your goal is still to correctly classify as many data points as possible (which determines the loss), but your assumptions about the data have changed (which are encoded in a statistical model, the neural network in this case).

Let $\mathbf{p}_t$ be a vector of class probabilities produced by the neural network and $\ell(y_t, \mathbf{p}_t)$ be the cross-entropy loss for label $y_t$. To explicitly take into account the assumption that 30% of the labels are noise (assumed to be uniformly random), we could change our model to produce

$$\mathbf{\tilde p}_t = 0.3/N + 0.7 \mathbf{p}_t$$

instead and optimize

$$\sum_t \ell(y_t, 0.3/N + 0.7 \mathbf{p}_t),$$

where $N$ is the number of classes. This will actually behave somewhat according to your intuition, limiting the loss to be finite.

Best Answer

Related Solutions

Solved – Logistic regression-like model for non-discrete outcomes

Classification with Noisy Labels – Techniques and Methods

Related Question