Solved – How does data augmentation reduce overfitting

classificationdata augmentationmachine learningoverfittingsampling

I'm trying to understant the benefit apported by the step of data augmentation in a classification algorithm.
I have a vector of hexadecimal strings and a column vector containing the label associated with the string in the same position. As an optional step in the classification algorithm, a data augmentation process is performed by subsetting the strings in pieces and replating the associated label for the number of split performed.

What are the benefit of this process?

Best Answer

Overfitting occurs when you have too few records relative to other parameters (e.g., predictors or features). I'm not familiar with your data, but it sounds like the subsetting is creating additional records.

Related Solutions

Solved – How to find classifier success performance

If you're just playing around with this, may I make two recommendations?

Start out with one of the classic datasets. This will let you compare your results with other published data. There are a lot floating around the internet (e.g., from the UCI Repository). Some are even built into matlab--you can load the Fisher Iris data set by running load fisheriris.
If possible, consider a two-class problem first. Both the training and evaluation get more complicated with multiple classes.

That said, onto your specific questions.

1 and 2) I'm still not quite sure I follow your question. if you've got strings of As, Bs, and Cs and you want to classify them based on the number of As, Bs, and Cs, why not just count up the As, Bs, and Cs and use that? I guess you could feed those counts into a classifier if the counts are noisy or your data breaks ties in a systematic but unspecified way (e.g AAABBBCC --> A, AABBBCCC --> C), but otherwise I'm not sure what it buys you.

Since I think I'm still missing something, maybe we should make sure we're on the same page about what a feature is? The Fisher Iris data set contains data from Iris flowers. There are four features: the length and width of the flower's petal and sepal (the leaf that covers the flower) and three classes (three species of flowers). The length and width features are continuous (like your proportions), but you could also have discrete or binary features too.

Coming up with clever feature representations is one of the harder parts of building a classifier system. It probably depends a lot on the domain and make take some trial-and-error.

2) kNN is actually pretty flexible. If your features are proportions (or other continuous values), then Euclidean distance is a perfectly reasonable choice. Other distance metrics might be reasonable too, depending your specific application. One thing to keep in mind is that you want your feature dimensions to be about the same size. If feature1 ranges from 0-1 and feature2 ranges from 0-100000, then feature1 may be ignored.

Actually, if you're comparing strings, have you thought about using an edit-distance function like Levenshtein Distance? That might be particularly appropriate if you have several "prototype" strings and you want to figure out which one best matches the (potentially noisy) input. You'd compute the edit distance between your string and the prototypes and pick the one with the lowest distance.

3) No, $k$ is the number of neighbors to consider; It doesn't have to related to the number of classes. When classifying a new data point with 1-NN, the new point gets the class of the closest data point in your training set. For 5-NN, you're going to assign the most common class of the 5 nearest data points, etc. To avoid ties, $k$ is usually odd for two-class classifiers; I would avoid $k=3$ if you have 3 classes, for the same reason.

4) I don't know much about ROC analysis for multiclass problems. I would be tempted to do either pairwise (A vs B, A vs C, B vs C) or one-versus all (A vs {BC}, B vs {AC}, C vs {AB}). There are few threads about ROC Surfaces; that might let you compare them all at once.

5) Again, for two-class problems, people typically compare the area under the curve: More area under the curve -> better classification. It looks like there is an analogous "volume under the ROC surface" (e.g., He and Frey, 2007) for multiclass problems.

Edit: Some answers to your new questions A) Sure. That seems like the fairest way to compare them. B) The matlab function's prototype looks like this (at least for the most recent version) Class = knnclassify(Sample, Training, Group, k, distance,rule)

Sample: Your test set (the data to be classified). Each row is a single example; the columns are the feature values. Sample(1,:) = your first example. Sample(2,:) = the 2nd, etc.

Training: Training data. The columns have to be the same as Sample, though obviously you can have a different number of rows.

Group: The class label for the training set. Training(1,:) is from class Group(1), etc. Can either be strings or numbers (numbers is probably easier).

k: Number of neighbors to consider. Say we've got a data point D that we want to classify. If k=1, then we find the closest point (see next parameter) to D and assign D the same class as that point. For larger $k$, we find the $k$ nearest points, and then use them to assign the label. Suppose $k=3$ and the three nearest points to D are from class A, class, B, and class A. Since the majority of the points are from class A, we assign D to class A too. Sometimes, there are ties (e.g., if the three closest points were from class A, B, and C). The rule parameter determines how those are broken.

distance: Determines which distance metric to use. See the docs for options, but you probably want Euclidean distance, at least to start.

rule: How ties are broken. This obviously only matters if $k>1$. Suppose you set $k=3$. If it's 'consensus', then the classifier doesn't classify examples where all three points are not from the same class (you get a nan or empty string, depending on how you arranged groups). If it's one of the other settings, you get the most common class of the three nearest points. If two or more classes are equally common, then 'random' picks one at random and 'nearest' favors the class of the closest point to break ties.

The classify() function is similar, but has a lot more options (see the docs). However, if you want a stock Naive Bayes classifier, I think something like classify(sample, train, groups, 'diaglinear', 'empirical'); will work.

C) For ROC analysis, you generally don't provide the features! Instead, you need the true class label, the classifier's output, and a "score" value that gives an estimate of how confident the classifier is in its output. See the matlab function perfcurve for the two-class version; you'll have to roll your own, I think, for ROC surfaces.

Machine Learning – Should Data Be Separated or Combined When Using Extra Categorical Feature?

I think you can merge the datasets and randomly shuffle the points. The model trained on the merged data with the categorical variable will be more robust if you are using a supervised learning approach. But if you want an unsupervised model, I recommend to not use the categorical variable and use separate pipelines.

If you use separate pipelines for your supervised approach, you can also train on one dataset and test on another dataset to check how the model performs. The data with similar distribution will have similar results I guess. This will be good to understand how the model performs in alien setting and the learned feature map space.

There should not be any issue with predicting labels while merging the dataset. The model will map them according to your features in a supervised approach.

Best Answer

Related Solutions

Solved – How to find classifier success performance

Machine Learning – Should Data Be Separated or Combined When Using Extra Categorical Feature?

Related Question