Reconstruct IR Spectra Based on PLS Model

chemometricspythonscikit learn

I am currently using the scikit-learn package in python to setup PLS models (sklearn.cross_decomposition.PLSRegression) to predict the concentration of different substances based on IR spectra. In this regard I would be interested if it also possible to go the other way round. Like to predict the concentration of a certain substance and then print out the corresponding spectrum as how it would look according to the setup PLS model for this concentration? Is there way to this in python?
The goal would be obtain a "clean" spectrum for just the substance even though the original sample might sometimes include impurities.

Best Answer

You can reconstruct the part of a spectrum explained by the PLS model.
That is often something useful to do with spectra used for prediction by the model. In particular, it can be useful to check what part of the spectrum is not explained by the model (out-of-model error/residuals), and whether that is unusually large for some sample you want to predict. Depending on the application/scenario/data it can also be insightful to check whether the reconstructed spectrum has higher intensity than the acutally measured spectrum.

And yes, it can also be useful for model interpretation to have a look how the pure analyte spectrum is thought to look by the model.

The goal would be obtain a "clean" spectrum for just the substance

You won't get a pure component spectrum, though.

Consider a particular analyte signal/band that is overlaid by some strong interferent signal in your application, and that in consequence has low (or even no) correlation with the analyte concentration. This band should stay in the unexplained X variance* - that's the point of PLS regularization.
Reconstructing explained spectra from concentrations should thus also not explain this band (i.e. give just average/center intensity).

*Assuming you use not too many latent variables, but after all PLS is used because it allows to predict with few latent variables.
If you go for the full PLS model, eventually also this band will be modeled.

Related Solutions

Solved – How to test how “good” the Poisson Model is

Some general answers:

Using the Poisson distribution to estimate the mean parameter requires relatively weak assumption, essentially only that our model for the mean (expected value) of the response variable y given the explanatory variables x is correctly specified. With the default log-link this is

$$E(y | x) = exp(x b)$$

We can use this to predict the mean for a new observation given a new set of explanatory variables. We can also get standard errors and confidence intervals for the prediction of the mean.

However, when we want the entire predictive distribution we need the assumption that we have a correctly specified distribution for the response. statsmodels does not have much of built-in support for goodness of fit tests for distributions outside the normal distribution case, especially not for discrete distribution like Bernoulli, Binomial or Poisson.

However, it is easy to get the predictive distribution using the scipy.stats distributions.

For example a sequence to get the results for Poisson could be

from scipy import stats
import statsmodels.api as sm
results = sm.Poisson(y, x).fit()
mean_predicted = results.predict()   # or use a new x
prob_2more = stats.poisson.sf(2 - 1, mean_predicted).mean() # average over x in sample
freq_2more = (y > 2 - 1).mean()  # sample average

or similar probability and frequency for observing y=2 given the x or predicted mean for each observation:

prob_2_obs = stats.poisson.pmf(2, mean_predicted)

Note: 2 - 1 in sf is used because of the weak inequality in sf(k)=Prob(y > k)

The pmf could be used to create an analog of a classification table comparing predicted versus empirical counts.

related code: The first is essentially doing what I explained for sm.Poisson https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L2691 and not yet available Vuong test for comparing model (not yet in statsmodels) https://gist.github.com/jseabold/6617976#file-zip_model-py-L5

Caveat: The above predictions ignore parameter uncertainty, i.e. they don't take into account that our parameters for the mean of the Poisson distribution are estimated. We can use simulation to get the distribution of the predicted Poisson probabilities, but I didn't manage to come up with an API for it, it has to be simulated for each x in the prediction and produces a large collection of numbers, i.e. a distribution over distributions for each explanatory set x.

technical aside:

statsmodels has 3 versions for Poisson that all produce the same results but have different extras, sm.Poisson (from discrete_model), GLM with family Poisson and GEE with family Poisson and independence or singleton clusters as in your case. I used Poisson above because it is easier to type, i.e. no family or extras to include.

Solved – Importance of variables in Decission trees

You can use the following method to get the feature importance. First of all built your classifier.

clf= DecisionTreeClassifier()

now

clf.feature_importances_

will give you the desired results.

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Best Answer

Related Solutions

Solved – How to test how “good” the Poisson Model is

Solved – Importance of variables in Decission trees

Related Question