Solved – Ideal learning sample in machine learning

biasdata miningmachine learningpredictionsampling

I am constructing a model for the prediction of a binary (Yes/No) outcome. I have a learning sample that gives the machine 1500 examples of the "Yes" group and 500 example of the "No" group. Should I be using all the data I have for input to learn the machine? Would this be biased towards the "Yes"?

I had the thought of giving 500 "Yes" and 500 "No" examples, but I am not sure if this is going to positively or negatively my future predictions.

Thanks.

Best Answer

Most learning algorithms have a way to deal with skewed data sets. In general, use as much as you can for learning to increase generalization performance.

Related Solutions

Solved – Preparing for machine learning exam

The answer is that All of the methods can be used for the above problem.

Well, two things should be noted in these kinds of simple problems.

Is it a classification or a regression problem? You might have already guessed that it is a classification problem.
Are there any categorical values in the input features? If yes, does the chosen algorithm work with categorical variables.

The examiner may expect the answer that neural networks, SVM etc. don't work with categorical variables. But in fact you can encode a categorical variable as a series of binary variables. For example if the variable age group takes values {child, young, old}, then you may change this single variable to three binary variables; is_child, is_young and is_old. This way you can use svm or neural network.

Again linear regression looks like an unlikely candidate for a classification problem. But they can be used for classification as well. You don't expect any mentionable performance though.

Solved – Bias parameter in machine learning linear regression

That seems like really confusing terminology, but what it means is, irrespective of the input $x$, the data will tend to be centered around $b$. If $x=0$ for all observations, the output of the regression would be $b$ in each case.

Bias here refers to a global offset not explained by the predictor variable. Consider the equation of a line:

$$ y = mx + c $$ Here $m$ is slope and $c$ is the intercept. If we omit the constant intercept $c$, $m$ as well as explaining the relationship between $x$ and $y$, must also account for the overall difference in scale irrespective of the value of $x$.

To demonstrate, if we have a really simple linear model in R with a constant difference between the variables (a difference in scale), then ignoring the intercept causes us to incorrectly estimate the relationship between $x$ and $y$ (the slope).

x <- rnorm(100)
y <- (3*x) + 100
lm(y ~ x)
#> 
#> Call:
#> lm(formula = y ~ x)
#> 
#> Coefficients:
#> (Intercept)            x  
#>         100            3
lm(y ~ 0 + x)
#> 
#> Call:
#> lm(formula = y ~ 0 + x)
#> 
#> Coefficients:
#>      x  
#> -5.505

Best Answer

Related Solutions

Solved – Preparing for machine learning exam

Solved – Bias parameter in machine learning linear regression

Related Question