Solved – How to do a bias-variance analysis on a machine learning modelling process

bias-variance tradeoffmachine learningmodelingvariance-stabilizing

I searched on topics of the bias and variance trade-off and got back lots of questions with different levels of response. The information is scattering too much and unsystematic to answer my own question which stuck me.
So I open a new question here and would like to hackle how I generally run a machine learning modelling process and ask for help on the issue of bias-variance assessment where I was confused or had no idea. If anyone noticed any useful posts to my questions please help point out the links and many thanks.

I execute the process basically according to the instruction below:

enter image description here

Question 1: When I start at step one, I try to think an issue that all my following analysis will be based on this train-test split. How about another split and is it possible to cause an obvious change in analysis results (final models, performance metrics, important features etc.)? If so, that means the analysis was split-dependent.

I have some thoughts here: assuming there exists an unknown underlying distribution of the domain set X (the whole dataset except the target class y), if the sample distribution of training set S was approximately similar to that of test set T, then the learning process should reach a good performance. Otherwise, if the two sample distributions differed in some non-trivial way, then the learner/classifier from training set S may not probably do a good job on test set T.

I made a trial to execute a 10-fold cross validation on the whole dataset and to see how variable the ROC AUC score was. In my particular dataset, I got a set of 10 scores with large variation (from maximum 0.85 to minimum 0.59). This led to my worry whether the current split I adopted for analysis was poor one.

To expand Question 1, I would like to ask further: what is the topic of this issue in machine learning theory? Is it the model stability or model variance?

I tuned the hyper-parameter/parameters in step 2 via cross validation. I try to use validation curve (refer to here) as below to choose the parameter value and avoid under/over-fitting problems.

enter image description here

Question 2: If this curve worked well in detecting under/over-fitting problems, what metric should be used: accuracy score/error rate, presicion, recall, f1 or roc_auc? (My current project was a classification problem with an imbalanced binary target class.)

According to the definition of bias, average error rate could be the measure of bias, so accuracy score/error rate seemed more reasonable in bias/variance issue.
In fact, I concerned roc_auc score more than accuracy itself.

Question 3: how can we measure the variance? I saw a plot here and the gap was said to be able to indicate the potential variance (poorer performance on validation set than that on training set). Any other (better) way/metric can be used for this point?

enter image description here

In my analysis, the bias/variance check was run in the parameter tuning stage which seemed to return me a candidate model with a balanced bias/variance combination.

Question 4: does my bias/variance analysis miss some steps and how do I adjust it?

many ensemble learning methods have their own way to reduce variance or bias and I hope to objectively evaluate the bias/variance for each candidate model.

Thanks for reading my long question but it really help for me to deliver a professional practice in my daily work. Many thanks again.

Best Answer

Note 1: It is better to split your data into three sets instead of two. That is, split your data into training set, validation set, and test set. The validation set is used in step 2 of your diagram, where you train a model on your training set and evaluate the performance of your hyperparameters based on the validation set. It is not true to test the hyperparameters based on training set since it will probably find the hyperparameter that fits your training data, hence causes overfitting problem.

Note 2: The evaluations made in steps 2 and 4 depends on the particular split of your data. Because of this, it is suggested to (i) shuffle your data before splitting, (b) repeat your splitting many times and then report the average performance over these trials. Another solution is to use k-fold cross validation. The most accurate performance evaluation is obtained by leave one out method. The leave one out is really slow, and in the real world applications, it is usually enough to use 10-fold cross validation.

Note 3: You are right that if the training distribution is different from test distribution, the learned model and the evaluated performance is not suitable for test data. The topic of this issue in machine learning is domain adaptation.

Note 4: Since your data is imbalanced binary, I recommend using f-measure. The precision, recall, and accuracy are not enough to evaluate the performance of a learned model for imbalanced data.

Finally, "Andrew Ng", a well-known researcher in machine learning is working on a book called "Machine Learning Yearning". This book focuses on practical aspects of machine learning. Each chapter has around 2 pages. As I know, 12 chapters of this book is released. These chapters are related to your questions and I recommend you to read them.

Related Question