Solved – What does penalizing a function mean, and how is it implemented

machine learningprobability

I've alway seen this statement in academic papers, blog posts, documentation, etc. but I've never understood it. What does it mean to penalize a function, and what is a concrete example of it? Just to give an example, in a recent paper I've read, after taking the SVD of a matrix, they then used two functions to penalize the U and V matrices.

Best Answer

Let's say you want to achieve a certain objective. For instance, you want to find $z_i$ that minimize the sum of squares: $$\sum_i(y_i-z_i)^2$$ where $y_i$ are given. In this case obviously $z_i=y_i$ would minimize the sum, in fact it would make it zero.

Now, what if we wanted to impose some other condition on $z_i$? For example, I'd like to penalize the length of the piece-wise linear curve that goes through $z_i$. The longer the curve, the higher is the penalty.

Here's how I could do it: change the objective to the following: $$\sum_i(y_i-z_i)^2+\sum_i \sqrt{1+(z_i-z_{i-1})^2}$$

Now if you make $z_i\ne y_i$ then you can reduce the second term of the above objective a little more than increasing the first term, and the net effect would be lower objective than when $z_i=y_i$.

This is a general idea, and you can apply it to many situations such as SVD, where you're minimizing some kind of function too. You add a penalty to it, and get a different solution.

Related Solutions

Solved – Support vector regression for multivariate time series prediction

In the context of support vector regression, the fact that your data is a time series is mainly relevant from a methodological standpoint -- for example, you can't do a k-fold cross validation, and you need to take precautions when running backtests/simulations.

Basically, support vector regression is a discriminative regression technique much like any other discriminative regression technique. You give it a set of input vectors and associated responses, and it fits a model to try and predict the response given a new input vector. Kernel SVR, on the other hand, applies one of many transformations to your data set prior to the learning step. This allows it to pick up nonlinear trends in the data set, unlike e.g. linear regression. A good kernel to start with would probably be the Gaussian RBF -- it will have a hyperparameter you can tune, so try out a couple values. And then when you get a feeling for what's going on you can try out other kernels.

With a time series, an import step is determining what your "feature vector" ${\bf x}$ will be; each $x_i$ is called a "feature" and can be calculated from present or past data, and each $y_i$, the response, will be the future change over some time period of whatever you're trying to predict. Take a stock for example. You have prices over time. Maybe your features are a.) the 200MA-30MA spread and b.) 20-day volatility, so you calculate each ${\bf x_t}$ at each point in time, along with $y_t$, the (say) following week's return on that stock. Thus, your SVR learns how to predict the following week's return based on the present MA spread and 20-day vol. (This strategy won't work, so don't get too excited ;)).

If the papers you read were too difficult, you probably don't want to try to implement an SVM yourself, as it can be complicated. IIRC there is a "kernlab" package for R that has a Kernel SVM implementation with a number of kernels included, so that would provide a quick way to get up and running.

Solved – How to build the final model and tune probability threshold after nested cross-validation

Nested cross validation explained without nesting

Here's how I see (nested) cross validation and model building. Note that I'm chemist and like you look from the application side to the model building process (see below). My main point here is from my point of view I don't need a dedicated nested variety of cross validation. I need a validation method (e.g. cross validation) and a model training function:

model = f (training data)

"my" model training function f does not need any hyperparameters because it internally does all hyperparameter tuning (e.g. your alpha, lambda and threshold).

In other words, my training function may contain any number of inner cross validations (or out-of-bag or what ever performance estimate I may deem useful). However, note that the distinction between parameters and hyper-parameters typically is that the hyperparameters need to be tuned to the data set/application at hand whereas the parameters can then be fitted regardless of what data it is. Thus from the point of view of the developer of a new classification algorithm, it does make sense to provide only the "naked" fitting function (g (training data, hyperparameters)) that fits the parameters if given data and hyperparameters.

The point of having the "outer" training function f is that after you did your cross validation run, it gives you straightforward way to train "on the whole data set": just use f (whole data set) instead of the call f (cv split training data) for the cross validation surrogate models.

Thus in your example, you'll have 5+1 calls to f, and each of the calls to f will have e.g. 100 * 5 calls to g.

probability threshold

While you could do this with yet another cross validation, this is not necessary: it is just one more hyperparameter your ready-to-use model has and can be estimated inside f.

What you need to fix it is a heuristic that allows you to calculate such a threshold. There's a wide variety of heuristics (from ROC and specifying how important it is to avoid false positives compared to false negatives over minimum acceptable sensitivity or specificity or PPV or NPV to allowing two thresholds and thus an "uncertain" (NA) level and so on) that are suitable in different situations - good heuristics are usually very application specific.

But for the question here, you can do this inside f and e.g. using the predictions obtained during the inner cross validation to calculate ROC and then find your working point/threshold accordingly.

Specific Comments to parts of the question

I understand that I shouldn't report the performance from the CV used to pick the optimal hyperparameters as an estimate of the expected performance of my final model (which would be overly-optimistic) but should instead include an outer CV loop to get this estimate.

Yes. (Though the inner estimate does carry information in relation to the outer estimate: if it is much more optimisitc than the outer estimate, you are typically overfitting.)

I understand that the inner CV loop is used for model selection

Any kind of data-driven model tuning, really -> that includes tuning your cutoff-threshold.

(in this case, the optimal hyperparameters) and that the outer loop is used for model evaluation, i.e., the inner and outer CV serve two different purposes that often are erroneously conflated.

Yes.

That is, the hyperparameter tuning is part of "the method for building the model".

I prefer to see it this way as well: I'm chemist and like you look from the application side: for me a trained/fitted model is not complete without the hyperparameters, or more precisely, a model is something I can use directly to obtain predictions. Though as you note other people have a different view (without hyperparameter tuning). In my experience, this is often the case with people developing new models: hyperparameter tuning is then a "solved problem" and not considered. (side note: their view on what cross validation can do in terms of validation is also slightly different from what cross validation can do from the application side).