The conceptual uses of "square" and "squared" are subtly different, although (almost) interchangeable:
"Squared" refers to the past action of taking or computing the second power. E.g., $x^2$ is usually read as "x-squared," not "x-square." (The latter is sometimes encountered but I suspect it results from speakers who are accustomed to clipping their phrases or who just haven't heard the terminal dental in "x-squared.")
"Square" refers to the result of taking the second power. E.g., $x^2$ can be referred to as the "square of x." (The illocution "squared of x" is never used.)
These suggest that a person using a phrase like "mean squared error" is thinking in terms of a computation: take the errors, square them, average those. The phrase "mean square error" has a more conceptual feel to it: average the square errors. The user of this phrase may be thinking in terms of square errors rather than the errors themselves. I believe this shows up especially in theoretical literature where the second form, "square," appears more often (I believe: I haven't systematically checked).
Obviously both are equivalent in function and safely interchangeable in practice. It is interesting, though, that some careful Google queries give substantially different hit counts. Presently,
"mean squared" -square -root -Einstein -Relativity
returns about 367,000 results (notice the necessity of ruling out the phrase "$e=m c^2$" popularly quoted in certain contexts, which demands the use of "squared" instead of "square" when written out), while
"mean square" -squared -root -Einstein -Relativity
(maintaining analogous exclusions for comparability) returns an order of magnitude more, at 3.47 million results. This (weakly) suggests people favor "mean square" over "mean squared," but don't take this too much to heart: "mean squared" is used in official SAS documentation, for instance.
Say your predictor matrix is $X$ and your response vector is $y$. PCA is concerned only with the (co)variance within the predictor matrix $X$ itself, while a regression model is (also) concerned with the covariance between $X$ and the response $y$. If there is no relationship between these concepts, dimension reduction by PCA can be harmful to your regression by screening out those predictors in $X$ that are correlated with the response.
Here is a simple example, in R, as I don't have access to matlab. Suppose I create some random gaussian data
x_1 <- rnorm(10000, mean = 0, sd = 1)
x_2 <- rnorm(10000, mean = 0, sd = .1)
X <- cbind(x_1, x_2)
And set up a situation where the a response is correlated with only the smaller varaince component
y <- x_2 + 1
The principal components of $X$ are just $x_1$ and $x_2$, given the way I set it up
> cor(X)
x_1 x_2
x_1 1.0000000000 0.0004543833
x_2 0.0004543833 1.0000000000
If PCA is used to select one component, we get $x_1$, as this has the highest variance. Regressing $y$ on $x_1$ is useless
> lm(y ~ x_1)
Call:
lm(formula = y ~ x_1)
Coefficients:
(Intercept) x_1
1.001e+00 4.544e-05
The coefficient of $x_1$ here is zero, the model has no more predictive power than an intercept only model. On the other hand, if I select the lower variance component
> lm(y ~ x_2)
Call:
lm(formula = y ~ x_2)
Coefficients:
(Intercept) x_2
1 1
I get back a much more predictive model.
Because PCA ignores the relationship of $y$ to $X$, there is no reason to believe that its dimension reduction is reasonable in the context of your regression problem.
On the other hand using the relationship between $y$ and $X$ to do pre-regression dimension reduction and variable selection is a very good way to overfit your model to your training data. In some ways, PCAs ignorance of the relationship between $X$ and $y$ is a blessing, because, while it can be harmful in the way outlined above, it cannot overfit the relationship in your training data in the same way that peeking at $y$ can.
As for your more practical question, the matlab documentation says
coeff = pca(X)
returns the principal component coefficients, also known as loadings, for the n-by-p data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is p-by-p. Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. By default, pca centers the data and uses the singular value decomposition (SVD) algorithm.
This says to me, that to do PCA dimension reduction in matlab, you need to:
- Center the columns of your $X$ matrix.
- Select the first $N$ columns of the
coef
matrix, where $N$ is the number of non-intercept regressors you want in your model.
- Create a new data matrix as
center(X) * coef[, 1:N]
.
- Use the columns in the new matrix as regressors in your dimensional reduced regression.
Again, I am far from fluent in matlab, so the syntactic details of how to preform these steps is unknown to me.
Best Answer
I might be telling you something you already know, but keep in mind that really
$\hat{f}(x)=\hat{f}(x,\{X_k\})$,
where $\{X_k\}$ is the set of sample points over which you build your estimate. For most non-parametric estimators, the $X_k$ are assumed independent, and the method is additive, so you can just look at the $MSE$ of $\hat{f}(x,X_k)$ and then take an average.
Then your formula is interpreted as
$MSE(\hat f(x)) = E[(\hat f(x)-f(x))^{2}]=\int_\Omega(\hat{f}(x,z)-f(x))^2f(z)dz$,
which yes, is the MSE error at the point $x$.
As for the $MSEP$, I'm not entirely sure what your question is, but there are surely various ways to predict this. If you want to know the error expected at $x^*$, then I guess it probably does line up with the $MSE$, however, you might want to know for example the prediction error for a random $X^*$, in which case you might assume it is drawn from a $f(x)$ distribution, then you would want the $MISE$.
Hope that helps clarify something.