glmnet optimizes the following loss function:
$\sum_{i=1}^n (\hat{Y}_i-Y_i)^2 + \lambda\left(\frac{(1-\alpha)}{2}||\beta||_2^2 + \alpha ||\beta||_1 \right)$
The residual sums of squares is on the left, as typical with regression, and the penalization for the coefficients is on the right. $\alpha$ defaults to 1, which gives the LASSO penalty.
Now, if you don't fit an intercept, the term on the left will be very large (if $E(Y)$ is large). The model will try to account for that, but it will require larger coefficient values to account for the intercept. It may be the case (and I'm guessing here) that you have $E(Y)$ large and one of your variables is fairly constant. In that case, that variable will get a large coefficient (as it helps to reduce the SSR), but other variables increase the penalization to much and hence there coefficients are zero.
Maybe you could supply your own lambda sequence to the function, something like
lambda=10^seq(1,-4,-.5)
If $\lambda$ is small enough, you should get more non-zero coefficients in the model without an intercept as well.
Note: I don't think this problem has anything to do with the fact you're using cv.glmnet. You should see the same thing if you just use glmnet.
There seems to be one single intercept 49.80842, whereas it would make sense to have two different intercepts
No, it usually wouldn't make sense to have two intercepts; that only makes sense when you have a factor with two levels (and even then only if you regard the relationship holding factor levels constant).
The population intercept, strictly speaking, is $E(Y)$ for the population model when all the predictors are 0, and the estimate of it is whatever our fitted value is when all the predictors are zero.
In that sense - whether we have factor variables or numerical variables - there's only one intercept for the whole equation.
Unless, that is, you're considering different parts of the model as separate equations.
Imagine that we had one factor with three levels, and one continuous variable - for now without the interaction:
For the equation as a whole, there's one intercept, but if you think of it as a different relationship within each subgroup (level of the factor), there's three, one for each level of the factor ($a$) -- by considering a specific value of $a$, we get a specific straight line that is shifted by the effect of $a$, giving a different intercept for each group.
But now let's consider the relationship with $a$. Now for each level of $a$, if $x$ had no impact, there'd be a very simple relationship $E(Y|a=j)=\mu_j$. There's one intercept, the baseline mean (or if you conceive it that way, three, one for each subgroup -- where the intercept would be the average value in that subgroup).
(nb It may be hard to see here but the means are not equally spaced; don't be tempted by this plot to think of $y$ as linear in $a$ considered as a numeric variable.)
But now if we consider $x$ does have an impact and look at the relationship at a specific value of $x$ ($x=x_0$), as a function of $a$, $E(Y|a=j)=\mu_j(x_0)$ -- each group has a different mean, but those means are shifted by the effect of $x$ at $x_0$.
So that would be one intercept (the black dot if it's the baseline group) ... at each value of $x$.
For each of infinite number of different values that $x$ might take, there's a new intercept.
So depending on how we look at it, there's one intercept, or three, or an infinite number... but not two.
Now if we introduce an $x:a$ interaction, nothing changes but the slopes! We still can conceive of this as having one intercept, or perhaps three, or perhaps an infinite number.
So how does this all relate to two numeric variables?
Even though we didn't have it in this case, imagine that the levels of $a$ were numeric and that the fitted model was linear in $a$ (perhaps $a$ is discrete, like the number of phones owned collectively by a household). [i.e. we're now doing what I said earlier not to do, taking $a$ to be numeric and (conditionally) linearly related to $y$]
Then we'd still have one intercept in the strict sense, the value taken by the model when $x=a=0$ (even though neither variable is 0 in our sample), or one for each possible value taken by $a$ (in our sample, three different values occurred, but maybe 0, 4, 5 ... are also possible), or one for each value taken by $x$ (an infinity of possible values since $x$ is discrete). It doesn't matter if our model has an interaction, it doesn't change that consideration about how we count intercepts.
So how do we interpret the interaction term when both variables are numeric?
You can consider it as providing for a different slope in the relationship between $y$ and $x$, at each $a$ (three different slopes in all, one for the baseline and two more via interaction), or you can consider it as providing for a different slope between $y$ and (the now-numeric) $a$ at each value of $x$.
Now if we replace this now numeric but discrete $a$ with a continuous variate, you'd have an infinite number of slopes for both one-on-one relationships, one at each value of the third variable.
You effectively say as much in your question of course.
are we constrained to expressing this with absurd scenarios, such as if we had cars with 1hp we would have a modified slope for the weight equal to (−8.21662+0.02785)∗1∗weight? Or is there a more sensible way to look at this term?
Sure there is, consider values more like the mean. So for a typical relationship between mpg and wt, hold horsepower at some value near the mean. To see how much the slope changes, consider two values of horsepower, one below the mean and one above it.
Where the variable-values aren't especially meaningful in themselves (like score on some Likert-scale-based instrument say) you might go up or down by a standard deviation on the third variable, or pick the lower and upper quartile.
Where they are meaningful (like hp) you can pick two more or less typical values (100 and 200 seem like sensible choices for hp for the mtcars data, and if you also want to look at something near the mean, 150 will serve quite well, but you might choose a typical value for a particular kind of car for each choice instead)
So you could draw a fitted mpg-vs-wt line for a 100hp car and a 150hp car and a 200 hp car. You could also draw a mpg-vs-hp line for a car that weighs 2.0 (that's 2.0 thousand-pounds) and 4.0 or (or 2.5 & 3.5 if you want something nearer to quartiles).
Best Answer
After some research I found out that the difference between R computed residuals given by
wnF$residuals
and the residuals I've computed externaly in Excel file, was originated from the lack of precision of the data passed to Excel.At first, I had passed data to Excel with only 7 decimal places. After repeating the procedure with 15 decimal places, the difference almost disapeared.
Also, as Richard Hardy commented, the model I was fitting was not correct. The correct model is:
$$ x_t-\mu=\psi_1(x_{t-1}-\mu)+\omega_t $$