first of all try and be consistent with logs (choose either log
or log2
but not both with your m
and a
calculations)
The steps that you have outlined do look correct. predict()
will take a model (generated by loess(m ~ a)
) and give you the 'corrected' m
(mc
) for every value a
. You can view the loess line by doing
#set.seed(12345)
x = rnorm(100) + 5
y = x + 0.6 + rnorm(100)*0.8
m = log(x/y)
a = 0.5*log(x*y)
l = loess(m ~ a)
mc <- predict(l, a)
plot(a, m, ylim=c(-0.5, 0.5));
#make sure things are ordered for the lines() plot
lines(a[order(m)], mc[order(m)]) #this is the loess line through the points m
dev.new()
#if you want your m centered around loess line
plot(a, m-mc, ylim=c(-0.5, 0.5)); dev.new()
#rescaled values
x2 = exp(log(x) - mc/2)
y2 = exp(log(y) + mc/2)
m2 = log(x2/y2)
a2 = 0.5*log(x2*y2)
plot(a2, m2, ylim=c(-0.5, 0.5)) #same as second plot
In this case, l$residual
and m-mc
are the same values because you are interested in predicting everything that you fit. However, you could do mc2 = predict(l, a2)
where a2 might be a superset or something else you want to fit a similar transform to, m2-mc2
will be different from l$residual
. This is useful if you are only interested in using a subset of your data to fit to the model but interested in adjusting all of your data.
Also see this loess guide here
As for the back calculation, I think you just adjust X and Y by mc
x2 = exp(log(x) - mc/2)
y2 = exp(log(y) + mc/2)
If you are working on gene expression, the affy package has all of this implemented under normalize.loess()
Generally you don't make them comparable by doing something to the counts, but you do take account of the different exposures in computing the expected values in the chi-squared test.
Under a null hypothesis of equal event rates (events per hour), the two periods can simply be combined to estimate the rate ... that is $275+129$ events in $120+48$ hours, so we estimate the rate as $\frac{275+129}{120+48}$ events per hour, and the expected count in period 1 is then $(275+129)\frac{120}{120+48}\approx 288.57$ and in period 2 is $(275+129)\frac{48}{120+48}\approx 115.43$.
With those expected values, the chi-square goodness of fit statistic, $\sum_i \frac{(O_i-E_i)^2}{E_i}$ is straightforward to calculate by hand; it has $k-1=1$ degree of freedom in this example. However, it's a pretty standard calculation - for example, here it is in R:
eventcounts = c(275,129)
exposuretime = c(120,48)
chisq.test(eventcounts, p = exposuretime, rescale.p = TRUE)
Chi-squared test for given probabilities
data: eventcounts
X-squared = 2.2339, df = 1, p-value = 0.135
which is the same result as doing it by hand.
Best Answer
I am not aware of an "official" definition and even if there it is, you shouldn't trust it as you will see it being used inconsistently in practice.
This being said, scaling in statistics usually means a linear transformation of the form $f(x) = ax+b$.
Normalizing can either mean applying a transformation so that you transformed data is roughly normally distributed, but it can also simply mean putting different variables on a common scale. Standardizing, which means subtracting the mean and dividing by the standard deviation, is an example of the later usage. As you may see it's also an example of scaling. An example for the first would be taking the log for lognormal distributed data.
But what you should take away is that when you read it you should look for a more precise description of what the author did. Sometimes you can get it from the context.