Solved – Data matrix, predictor matrix, observation matrix, model matrix, and design matrix. What do they mean

matrixmodelingregressionterminology

Is there a clear distinction between these terms? To the best of my knowledge:

Suppose we have $N$ observations and $p$ predictors.

predictor matrix $\in \mathbb{R}^{N\times p}$ is synonymous to observation matrix and data matrix. They contain the raw, untreated data. design matrix refers to the same concept in the context of a designed experiment.
model matrix is the result of applying some basis expansion^* to the predictor matrix.

However, according to Wikipedia, design matrix and model matrix are synonymous:

In statistics, a design matrix, also known as regressor matrix or model matrix or data matrix, is…

Furthermore, MathWorks offers a function to

Convert predictor matrix to design matrix

^{* see Elements of Statistical Learning, chapter 5 and this question}

Best Answer

I wouldn't get caught up in the terms. Just know they are referring to your data. Every discipline (engineering, CS, statistics) has different terms for the same thing.

However, to dive in to the detail, if your data is all numerical (no categorical data), then the model matrix = design matrix because there are no categorical values to expand on (no contrasts). A design matrix will most likely contain categorical values like gender, race, or some other type of binary/categorical status. A categorical matrix with these categorical values need to be one-hot coded to be numerically meaningful. Then, depending on your contrasts settings, you may see k-1 categorical vectors from the k categorical values.

An example of these types of settings are included in R's documentation contrasts.

Depending on your settings, you may see the following:

> warpbreaks =  warpbreaks[order(runif(dim(warpbreaks)[1])),] ## random shuffle
> head(model.matrix(breaks ~ wool, data = warpbreaks)) ##
     (Intercept) woolB
30           1     1
39           1     1
32           1     1
16           1     0
6            1     0
7            1     0
> head(model.matrix(breaks ~ wool - 1, data = warpbreaks))
     woolA woolB
30     0     1
39     0     1
32     0     1
16     1     0
6      1     0
7      1     0

Python's patsy also has similar settings.

State spaces

A common statistical use of "the distribution," as in "the Normal distribution with PDF proportional to $\exp(-\frac{1}{2}(x-\mu)/\sigma)^2)dx$" is actually a (serious) abuse of English, because obviously this is not one distribution: it's a whole family of distributions parameterized by the symbols $\mu$ and $\sigma$. A standard notation for this is the "state space" $\Omega$, a set of distributions. (I am simplifying a bit here for the sake of exposition and will continue to simplify as we go along, while remaining as rigorous as possible.) Its role is to delineate the possible targets of our statistical procedures: when we estimate something, we are picking out one (or sometimes more) elements of $\Omega$.

Sometimes state spaces are explicitly parameterized, as in $\Omega = \{\mathcal{N}(\mu, \sigma^2)|\mu \in \mathbb{R}, \sigma \gt 0\}$. In this description there is a one-to-one correspondence between the set of tuples $\{(\mu,\sigma)\}$ in the upper half plane and the set of distributions we will be using to model our data. One value of such a parameterization is that we may now refer concretely to distributions in $\Omega$ by means of an ordered pair of real numbers.

In other cases state spaces are not explicitly parameterized. An example would be the set of all unimodal continuous distributions. Below, we will address the question of whether an adequate parameterization can be found in such cases anyway.

Parameterizations

Generally, a parameterization of $\Omega$ is a correspondence (mathematical function) from a subset of $\mathbb{R}^d$ (with $d$ finite) to $\Omega$. That is, it uses ordered sets of $d$-tuples to label the distributions. But it's not just any correspondence: it has to be "well behaved." To understand this, consider the set of all continuous distributions whose PDFs have finite expectations. This would widely be regarded as "non-parametric" in the sense that any "natural" attempt to parameterize this set would involve a countable sequence of real numbers (using an expansion in any orthogonal basis). Nevertheless, because this set has cardinality $\aleph_1$, which is the cardinality of the reals, there must exist some one-to-one correspondence between these distributions and $\mathbb{R}$. Paradoxically, that would seem to make this a parameterized state space with a single real parameter!

The paradox is resolved by noting that a single real number cannot enjoy a "nice" relationship with the distributions: when we change the value of that number, the distribution it corresponds to must in some cases change in radical ways. We rule out such "pathological" parameterizations by requiring that distributions corresponding to close values of their parameters must themselves be "close" to one another. Discussing suitable definitions of "close" would take us too far afield, but I hope this description is enough to demonstrate that there is much more to being a parameter than just naming a particular distribution.

Properties of distributions

Through repeated application, we become accustomed to thinking of a "property" of a distribution as some intelligible quantity that frequently appears in our work, such as its expectation, variance, and so on. The problem with this as a possible definition of "property" is that it's too vague and not sufficiently general. (This is where mathematics was in the mid-18th century, where "functions" were thought of as finite processes applied to objects.) Instead, about the only sensible definition of "property" that will always work is to think of a property as being a number that is uniquely assigned to every distribution in $\Omega$. This includes the mean, the variance, any moment, any algebraic combination of moments, any quantile, and plenty more, including things that cannot even be computed. However, it does not include things that would make no sense for some of the elements of $\Omega$. For instance, if $\Omega$ consists of all Student t distributions, then the mean is not a valid property for $\Omega$ (because $t_1$ has no mean). This impresses on us once again how much our ideas depend on what $\Omega$ really consists of.

Properties are not always parameters

A property can be such a complicated function that it would not serve as a parameter. Consider the case of the "Normal distribution." We might want to know whether the true distribution's mean, when rounded to the nearest integer, is even. That's a property. But it will not serve as a parameter.

Parameters are not necessarily properties

When parameters and distributions are in one-to-one correspondence then obviously any parameter, and any function of the parameters for that matter, is a property according to our definition. But there need not be a one-to-one correspondence between parameters and distributions: sometimes a few distributions must be described by two or more distinctly different values of the parameters. For instance, a location parameter for points on the sphere would naturally use latitude and longitude. That's fine--except at the two poles, which correspond to a given latitude and any valid longitude. The location (point on the sphere) indeed is a property but its longitude is not necessarily a property. Although there are various dodges (just declare the longitude of a pole to be zero, for instance), this issue highlights the important conceptual difference between a property (which is uniquely associated with a distribution) and a parameter (which is a way of labeling the distribution and might not be unique).

Statistical procedures

The target of an estimate is called an estimand. It is merely a property. The statistician is not free to select the estimand: that is the province of her client. When someone comes to you with a sample of a population and asks for you to estimate the population's 99th percentile, you would likely be remiss in supplying an estimator of the mean instead! Your job, as statistician, is to identify a good procedure for estimating the estimand you have been given. (Sometimes your job is to persuade your client that he has selected the wrong estimand for his scientific objectives, but that's a different issue...)

By definition, a procedure is a way to get a number out of the data. Procedures are usually given as formulas to be applied to the data, like "add them all up and divide by their count." Literally any procedure may be pronounced an "estimator" of a given estimand. For instance, I could declare that the sample mean (a formula applied to the data) estimates the population variance (a property of the population, assuming our client has restricted the set of possible populations $\Omega$ to include only those that actually have variances).

Estimators

An estimator needn't have any obvious connection to the estimand. For instance, do you see any connection between the sample mean and a population variance? Neither do I. But nevertheless, the sample mean actually is a decent estimator of the population variance for certain $\Omega$ (such as the set of all Poisson distributions). Herein lies one key to understanding estimators: their qualities depend on the set of possible states $\Omega$. But that's only part of it.

A competent statistician will want to know how well the procedure they are recommending will actually perform. Let's call the procedure "$t$" and let the estimand be $\theta$. Not knowing which distribution actually is the true one, she will contemplate the procedure's performance for every possible distribution $F \in \Omega$. Given such an $F$, and given any possible outcome $s$ (that is, a set of data), she will compare $t(s)$ (what her procedure estimates) to $\theta(F)$ (the value of the estimand for $F$). It is her client's responsibility to tell her how close or far apart those two are. (This is often done with a "loss" function.) She can then contemplate the expectation of the distance between $t(s)$ and $\theta(F)$. This is the risk of her procedure. Because it depends on $F$, the risk is a function defined on $\Omega$.

(Good) statisticians recommend procedures based on comparing risk. For instance, suppose that for every $F \in \Omega$, the risk of procedure $t_1$ is less than or equal to the risk of $t$. Then there is no reason ever to use $t$: it is "inadmissible." Otherwise it is "admissible".

(A "Bayesian" statistician will always compare risks by averaging over a "prior" distribution of possible states (usually supplied by the client). A "Frequentist" statistician might do this, if such a prior justifiably exists, but is also willing to compare risks in other ways Bayesians eschew.)

Conclusions

We have a right to say that any $t$ that is admissible for $\theta$ is an estimator of $\theta$. We must, for practical purposes (because admissible procedures can be hard to find), bend this to saying that any $t$ that has acceptably small risk (when being compared to $\theta$) among practicable procedures is an estimator of $\theta$. "Acceptably" and "practicable" are determined by the client, of course: "acceptably" refers to their risk and "practicable" reflects the cost (ultimately paid by them) of implementing the procedure.

Underlying this concise definition are all the ideas just discussed: to understand it we must have in mind a specific $\Omega$ (which is a model of the problem, process, or population under study), a definite estimand (supplied by the client), a specific loss function (which quantitatively connects $t$ to the estimand and is also given by the client), the idea of risk (computed by the statistician), some procedure for comparing risk functions (the responsibility of the statistician in consultation with the client), and a sense of what procedures actually can be carried out (the "practicability" issue), even though none of these are explicitly mentioned in the definition.

Solved – Fixed Regressor Conspiracy and Connection to Exchangeability

A regression model gives predictions of the response conditional on predictor values; so there's no problem in applying a model fitted to one set of predictor values fixed by design to another set of predictor values, even if the latter are randomly sampled from a population. With an experimental design matrix $X$, the expectation & variance of the predicted response $\hat y$ for a (new) predictor vector $x$ are given by $$\operatorname{E}{\hat y \,|\, x} = x^\mathrm{T}\beta$$ $$\operatorname{Var}{\hat y\,|\,x}=\sigma^2\left(1+x^\mathrm{T}(X^\mathrm{T}X)^{-1}x\right)$$ where $\beta$ is the coefficient vector & $\sigma^2$ is the error variance—so the particular predictor values used for the fit don't affect the expectation of predictions, but do affect the variation in their precision throughout predictor space. Note that any aggregate fit metrics, say root mean square error of predictions, don't carry over from the experiment to the new sample.
The above discussion assumes the model is right: in practice there will be extra-statistical considerations when applying it. You need to think about e.g. variation of effects that weren't investigated in the original experiment, the reliability of extrapolation into new regions of predictor space, selection bias in the population, & whether experimental manipulation is comparable to a natural cause. An engineer might model resistivity as a linear function of temperature from experimental data & be confident in applying the model to a particular collection of resistors used in a circuit board. The medical researcher in your example might assert that the medicine reduces blood cholesterol level, & confidently predict the results of further experiments; but would be unlikely to claim that, in a random sample from, say, all hospital admissions, those patients taking the medicine would have lower cholesterol levels than those who weren't.