I take it that the question is about LDA and linear (not logistic) regression.
There is a considerable and meaningful relation between linear regression and linear discriminant analysis. In case the dependent variable (DV) consists just of 2 groups the two analyses are actually identical. Despite that computations are different and the results - regression and discriminant coefficients - are not the same, they are exactly proportional to each other.
Now for the more-than-two-groups situation. First, let us state that LDA (its extraction, not classification stage) is equivalent (linearly related results) to canonical correlation analysis if you turn the grouping DV into a set of dummy variables (with one redundant of them dropped out) and do canonical analysis with sets "IVs" and "dummies". Canonical variates on the side of "IVs" set that you obtain are what LDA calls "discriminant functions" or "discriminants".
So, then how canonical analysis is related to linear regression? Canonical analysis is in essence a MANOVA (in the sense "Multivariate Multiple linear regression" or "Multivariate general linear model") deepened into latent structure of relationships between the DVs and the IVs. These two variations are decomposed in their inter-relations into latent "canonical variates". Let us take the simplest example, Y vs X1 X2 X3. Maximization of correlation between the two sides is linear regression (if you predict Y by Xs) or - which is the same thing - is MANOVA (if you predict Xs by Y). The correlation is unidimensional (with magnitude R^2 = Pillai's trace) because the lesser set, Y, consists just of one variable. Now let's take these two sets: Y1 Y2 vs X1 x2 x3. The correlation being maximized here is 2-dimensional because the lesser set contains 2 variables. The first and stronger latent dimension of the correlation is called the 1st canonical correlation, and the remaining part, orthogonal to it, the 2nd canonical correlation. So, MANOVA (or linear regression) just asks what are partial roles (the coefficients) of variables in the whole 2-dimensional correlation of sets; while canonical analysis just goes below to ask what are partial roles of variables in the 1st correlational dimension, and in the 2nd.
Thus, canonical correlation analysis is multivariate linear regression deepened into latent structure of relationship between the DVs and IVs. Discriminant analysis is a particular case of canonical correlation analysis (see exactly how). So, here was the answer about the relation of LDA to linear regression in a general case of more-than-two-groups.
Note that my answer does not at all see LDA as classification technique. I was discussing LDA only as extraction-of-latents technique. Classification is the second and stand-alone stage of LDA (I described it here). @Michael Chernick was focusing on it in his answers.
Inclusion/exclusion of variates (step 3):
I understand that you ask which of the original measurement channels to include into the modeling.
- Is such a decision sensible for your data?
E.g. I work mainly with spectroscopic data, for which PLS is frequently and successfully used. Well measured spectra have a high correlation betweeen neighbour variates and the relevant information in spectroscopic data sets tends to be spread out over many variates. PLS is well suited for such data, but deciding on a variate-to-variate basis which variates to use for the model IMHO is usually not appropriate (decisions about inclusion/exclusion of spectral ranges based on spectroscopic knowledge about the application is IMHO a far better approach).
- If for your data and application variable selection is a natural choice, is PLS the regularization technique you want?
You may want to read the sections about regularization (3.4 - 3.6) in the Elements of Statistical Learning where PLS as a regularization is compared to other regularization approaches. My point here is that in contrast to e.g. the Lasso, PLS is a regularization technique that does not tend to completely exclude variables from the model. I'd thus say that PLS is probably more suitable for data where this behaviour is sensible, but in that case variable selection is not a natural choice (e.g. spectroscopic data).
- Does your data contain enough information for such a data-driven model optimization? Doing a t-test for each input variable is a massive multiple testing situation.
IMHO the main point of PLS (or other regularization techniques) is to avoid the need for such a variable selection.
Remark to Step 2:
If you build a linear regression model in PCA score space, that is usually called principal component regression (PCR) in chemometrics. It is not the same as a PLS model.
How to find out which variates are used by the PCA/PLS model?
There are several ways to approach this question. Obviously, variates where the PCA loadings or PLS weights are 0 do not enter the model. Whether it is sufficient to look at the loadings or whether you need to go a step further depends on your data: if the data set is not standardized you may want to calculate how much each variate "contributes" to the respective PCA/PLS score.
Literature where we did that with LDA (works just the same way): C. Beleites, K. Geiger, M. Kirsch, S. B. Sobottka, G. Schackert and R. Salzer: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal. Bioanal. Chem., 400 (2011), 2801 - 2816. The linked page has both links to the official web page and my manuscript.
You can also derive e.g. bootstrap distributions of the loadings (or the contributions) and have a look at them. For PCR and PLS coefficients that is straightforward, as the Y variable automatically "aligns" the coefficients. PCA and PLS scores need some more care, as e.g. flipping of the directions needs to be taken into account, and you may decide to treat models as equivalent where the scores which are then used for further modeling are just rotated or scaled versions of each other. Thus, you may want to align the scores first e.g. by Procrustes analysis. The paper linked above also discusses this (for LDA, but again, the ideas apply to the other bilinear models as well).
Last but not least, you need to be careful not to overinterprete the models, and you can have situations where important variates have coefficients frequently touching the zero mark in the bootstrap experiments if you have correlation between variates. However, ehat you can or cannot conclude will depend on your type of data, though.
Best Answer
PLS-DA is in fact a hack on PLSR. PLS tries to find uncorrelated projection (scores) of X that maximizes covariance between X and Y. If you are using all components (called latent variables for PLS) then it becomes OLS. By using a single component or any number of components that is less the maximum number possible, you are actually penalizing the rest of the directions which are assumed to be less useful for regression (such as irrevelant information in X about Y).
PLS with SIMPLS algorithm produces:
$\mathbf T_(mxa) = \mathbf X_(mxn) \mathbf R(nxa) $ where T is your X scores for a components, R is the weights for a components
$\mathbf {\hat Y_{(mxl)}} = \mathbf T_(mxa) \mathbf Q_(lxa)'$ This is the prediction step.
Therefore you can define the regression coefficents:
$\mathbf B_(nxl) = \mathbf R(nxa) \mathbf Q_{(lxa)}'$
So the prediction can be done with a single regression matrix B
$\mathbf {\hat Y_{(mxl)}} = \mathbf X_(mxn) \mathbf B_(nxl) $
Now if you retain all components ( by setting $ a = min(m-1,n-1) $ ) then B will be equivalent of solving OLS for $Y = XB$ but if you choose smaller number of components it is now PLSR and in that case the only thing that remains common between OLS and PLSR is the size of the B matrix.
For more details, here is the original article: De Jong, Sijmen. "SIMPLS: an alternative approach to partial least squares regression." Chemometrics and intelligent laboratory systems 18, no. 3 (1993): 251-263.