I think it's perfectly well justified. (In fact, I've use this convention in papers I've published; or you can call them "nats" if you prefer to stick with logarithms of base $e$).
The justification runs as follows: the log-likelihood of the fitted model can be viewed as a Monte Carlo estimate of the KL divergence between the "true" (unknown) data distribution and the distribution implied by the fitted model. Let $P(x)$ denote the "true" distribution of the data, and let $P_\theta(x)$ denote the distribution (i.e., the likelihood $P(x|\theta))$ provided by a model.
Maximum likelihood fitting involves maximizing
$L(\theta) = \frac{1}{N}\sum_i \log P_\theta(x_i) \approx \int P(x) \log P_\theta(x) dx$
The left hand side (the log-likelihood, scaled by the # datapoints $N$) is a Monte Carlo estimate for the right hand side, i.e., since the datapoints $x_i$ were drawn from $P(x)$. So we can rewrite
$L(\theta) \approx \int P(x) \log P_\theta(x) dx = \int P(x) \log \frac{P_\theta(x)}{P(x)} dx + \int P(x) \log P(x)dx$
$ = -D_{KL}(P,P_\theta) - H(x)$
So the log-likelihood normalized by the number of points is an estimate of the (negative) KL-divergence between $P$ and $P_\theta$ minus the (true) entropy of $x$. The KL divergence has units of "bits" (if we use log 2), and can be understood as the number of "extra bits" you would need to encode data from $P(x)$ using a codebook based on $P_\theta(x)$. (If $P = P_\theta$, you don't need any extra bits, so KL divergence is zero).
Now: when you take the log-likelihood ratio of two different models, it should be obvious that you end up with:
$\log \frac{P_{\theta_1(x)}}{P_{\theta_2}(x)} \approx D_{KL}(P,P_{\theta_2}) - D_{KL}(P,P_{\theta_1})$
The entropy $H(x)$ terms cancel. So the log-likelihood ratio (normalized by $N$) is an estimate of the difference between the KL divergence of the true distribution and the distribution provided by model 1, and the true distribution provided by model 2. It's therefore an estimate of the number of "extra bits" you need to code your data with model 2 compared to coding it with model 1. So I think the "bits" units are perfectly well justified.
One important caveat: when using this statistic for model-comparison, you should really use LLR computed on cross-validated data. The log-likelihood of training data is generally artificially high (favoring the model with more parameters) due to overfitting. That is, the model assigns this data higher probability than it would if it were fit to an infinite set of training data and then evaluated at the points $x_i \dots x_N$ in your dataset. So the procedure many people follow is to:
train models 1 and 2 using training data;
evaluate the log-likelihood ratio of a test dataset and report the resulting number in units of bits as a measure of the improved "code" provided by model 1 compared to model
The LLR evaluated on training data would generally give an unfair advantage to the model with more parameters / degrees of freedom.
The logrank test is the score test from a Cox proportional hazards model, so it makes the same assumptions as the Cox model. The LR test, among the three commonly used tests (the other two being Wald and score) is the gold standard. It is typically more accurate for all sample sizes. One way to see this is to note that even with complete separation, which occurs more often with logistic models than with Cox, the LR test is fully accurate whereas standard errors used in Wald tests blow up rendering Wald tests useless when complete separation (infinite regression coefficient estimates) is in play.
The LR test also provides more accurate confidence intervals using confidence profiles. If you don't want to have to make approximations, Bayesian inference is exact.
Best Answer
y
values in your data. They are only meaningful when comparing different fits to the same set of values for the outcome.