Motivation
I am currently reading the following book: Understanding machine learning by Shalev-Shwartz and Ben-David. The book uses statistics terminology in its machine learning theory, and it is not clear to me how to reconcile the stat terms with terms used in coding practice. (I do not restrict my question to definitions from the book, they should only serve as a reference point for what I mean by stat terms.)
I would like to be precise when I use the following terms that I encounter in statistics or in practice:
Question
Define:
error, loss, risk, empirical risk
error rate, accuracy
Using context:
I divide the sample dataset into train set and test set,then train on the train set and do the empirical risk (?) assessment on the test set. In practice, we often call that error rate. However the error in statistics is simply the distance between a single prediction and the correct prediction, risk has to do something with an expected value.
To be more specific about my motivation (this is not part of my question) I would like to paraphrase agnostic PAC learnability such that the terms align both with usual statistic definitions and e.g. common pytorch coding practices. For this reason, I need to be precise with the terms above.
Can you help me on this quest?
Best Answer
I think ChatGPT 3.5 gave a comprehensive, concise and correct answer. (I prompted it with the question as of 26 Feb, 2024.) Here it is:
Certainly! Let's clarify the terms:
Error:
Loss:
Risk:
Empirical Risk:
Error Rate:
Accuracy:
In your context, when you divide the dataset into a train set and a test set, training on the train set and assessing on the test set involves computing the empirical risk (training loss) during training and evaluating the model's performance in terms of error rate or accuracy on the test set. The key is to understand the statistical concepts and how they align with machine learning terminology in practice.