edit – more information about what the code given should represent

The following pseudocode outlines the problem as I have it

for each random seed in S
    randomise the data
    for k in 1 to 5
        create test / training data
        fit the model to the training data
        generate score

Therefore I will have $S * 5$ individual accuracy scores. My end score is an
average of these for which I would like to know the standard deviation.

original post

The following code represents my problem :

# S is the total number of random seeds to use
S = 3
# the size of each category, so original data will have 2n rows
n = 100
# number of "folds" to use
K = 5
# sample data
set.seed(2019)
original_data = data.frame(
  x = c(rnorm(n, 0.457, 0.01), c(rnorm(n, 0.508, 0.11))),
  y = c(rep(0, n), rep(1, n))
)
# will be a data frame to store the results. 
results = NULL
iteration = 1
for(s in 1:S){
  set.seed(s)
  rnd = sample(1:(2*n))
  # get randomised data
  td = original_data[rnd,]
  for(k in 1:K){
    # get test and training data
    trainset = td[1:140,]
    testset  = td[-(1:140),]
    # fit model and get scores
    m = glm(y ~ x, data = trainset, family = "binomial")
    # get probabilities and predicted values
    model_probabilities = predict(m, newdata=testset, 
                                type="response")
    model_predictions   = 1 * ( model_probabilities >= 0.5)
    # store results
    results = rbind(results, data.frame(
      seed = s, k = k, iteration = iteration,
      probability = model_probabilities, 
      prediction   = model_predictions,
      observed     = testset$y
    ))
    iteration = iteration + 1
  }
}

# table of predicted and observed
t = table(results$prediction, results$observed)
# convert into percentages 
t = 100 * round(prop.table(t),3)
# compute the accuracy 
accuracy = t[1,1] + t[2,2]
accuracy

With the output of :

> accuracy
[1] 51.1
> dim(results)
[1] 900   6

I want to know how to calculate the standard deviation for this accuracy measure.

edit – choice of $n$

still interested in the answer to this question, not sure if there's additional information required.

Initially I thought that I should just use

$$
\sqrt{
\frac{p(1-p)}{n}
}
$$

Where $n = $ number of rows in test set.

This doesn't seem to take into account that the accuracy score is averaged across many iterations, and I can't find literature for this

edit – still unanswered

Best Answer

Your procedure is overly complicated, just use bootstrap. With bootstrap you would randomly, with replacement, take samples of size $n$, out of your dataset of size $n$. At each iteration you would repeat the whole procedure, including fitting your model, making predictions, and calculating accuracy. You would repeat this many times (hundreds or more) and then simply calculate standard deviation of the estimated accuracies.

If you'd use samples smaller then $n$, the estimate would not reflect the actual variability of the data, it would overestimate the standard deviations (smaller samples vary more). If you use small number of iterations of the algorithm, your estimate of the standard deviation would itself not be precise.

Solved – Compute standard deviation of accuracy

edit – more information about what the code given should represent

original post

edit – choice of $n$

edit – still unanswered

Best Answer

Related Question

edit – more information about what the code given should represent

original post

edit – choice of $n$

edit – still unanswered

Best Answer

Related Solutions

Solved – Real World Challenge: Large difference between training and testing set accuracy

Solved – XGBoost – “Optimizing Random Seed”

Related Question