Solved – How to simulate data to be statistically significant

machine learningpythonsimulationstatistical significancet-test

I am in 10th grade and I am looking to simulate data for a machine learning science fair project. The final model will be used on patient data and will predict the correlation between certain times of the week and the effect this has on the medication adherence within the data of a single patient. Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I am looking to create a machine learning model which is able to learn from the relationship between the time of week, and have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). I am looking to simulate 1,000 patients worth of data. Each patient will have a 30 weeks worth of data. I want to insert certain trends associated with a time of week and adherence. For example, in one data set I may say that time slot 7 of the week has a statistically significant relationship with adherence. In order for me to determine whether the relationship is statistically significant or not would require me performing a two sample t-test comparing one time slot to each of the others and make sure the significance value is less than 0.05.

However, rather than simulating my own data and checking whether the trends I inserted are significant or not, I would rather work backwards and perhaps use a program that I could ask to assign a certain time slot a significant trend with adherence, and it would return binary data that contains within it the trend I asked for, and also binary data for the other time slots which contains some noise but does not produce a statistically significant trend.

Is there any program that can help me achieve something like this? Or maybe a python module?

Any help whatsoever (even general comments on my project) will be extremely appreciated!!

Best Answer

General Comments

  • "I am in 10th grade and I am looking to simulate data for a machine learning science fair project." Awesome. I did not care at all about math in 10th grade; I think I took something like Algebra 2 that year...? I can't wait until you put me out of a job in a few years! I give some advice below, but: What are you trying to learn from this simulation? What are you already familiar with in statistics and machine learning? Knowing this would help me (and others) put together some more specific help.

  • Python is a very useful language, but I'm of the opinion that R is better for simulating data. Most of the books/blogs/studies/classes I've come across on simulating data (also what people call "Monte Carlo methods" to be fancy) are in R. The R language is known as being "by statisticians, for statisticians," and most academics—that rely on simulation studies to show their methods work—use R. A lot of cool functions are in the base R language (that is, no additional packages necessary), such as rnorm for a normal distribution, runif for the uniform distribution, rbeta for the beta distribution, and so on. In R, typing in ?Distributions will show you a help page on them. However, there are many other cool packages like mvtnorm or simstudy that are useful. I would recommend DataCamp.com for learning R, if you only know Python; I think they are good for getting gently introduced to things

  • It seems like you have a lot going on here: You want data that are over time (longitudinal), within-subject (maybe using a multilevel model), and have a seasonal component to them (perhaps a time series model), all predicting a dichotomous outcome (something like a logistic regression). I think a lot of people starting out with simulation studies (including myself) want to throw a bunch of stuff in at once, but this can be really daunting and complicated. So what I would recommend doing is starting with something simple—maybe making a function or two for generating data—and then build up from there.

Specific Comments

It looks like your basic hypothesis is: "The time of the day predicts whether or not someone adheres to taking their medication." And you would like two create two simulated data sets: One where there is a relationship and one where there is not.

You also mention simulating data to represent multiple observations from the same person. This means that each person would have their own probability of adherence as well as, perhaps, their own slope for the relationship between time of day and probability of adhering. I would suggest looking into "multilevel" or "hierarchical" regression models for this type of relationship, but I think you could start simpler than this.

Also, you mention a continuous relationship between time and probability of adhering to the medication regimen, which also makes me think time series modeling—specifically looking at seasonal trends—would be helpful for you. This is also simulate-able, but again, I think we can start simpler.

Let's say we have 1000 people, and we measure whether or not they took their medicine only once. We also know if they were assigned to take it in the morning, afternoon, or evening. Let's say that taking the medicine is 1, not taking it is 0. We can simulate dichotomous data using rbinom for draws from a binomial distribution. We can set each person to have 1 observation with a given probability. Let's say people are 80% likely to take it in the morning, 50% in afternoon, and 65% at night. I paste the code below, with some comments after #:

set.seed(1839) # this makes sure the results are replicable when you do it
n <- 1000 # sample size is 1000
times <- c("morning", "afternoon", "evening") # create a vector of times
time <- sample(times, n, TRUE) # create our time variable

# make adherence probabilities based on time
adhere_prob <- ifelse(
  time == "morning", .80, 
  ifelse(
    time == "afternoon", .50, .65
  )
)

# simulate observations from binomial distribution with those probabilities
adhere <- rbinom(n, 1, adhere_prob)

# run a logistic regression, predicting adherence from time
model <- glm(adhere ~ time, family = binomial)
summary(model)

This summary shows, in part:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.02882    0.10738   0.268  0.78839    
timeevening  0.45350    0.15779   2.874  0.00405 ** 
timemorning  1.39891    0.17494   7.996 1.28e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Intercept represents the afternoon, and we can see that both the evening and morning are significantly higher probability of adhering. There are a lot of details about logistic regression that I can't explain in this post, but t-tests assume that you have a conditionally normally-distributed dependent variable. Logistic regression models are more appropriate when you have dichotomous (0 vs. 1) outcomes like these. Most introductory statistics books will talk about the t-test, and a lot of introductory machine learning books will talk about logistic regression. I think Introduction to Statistical Learning: With Applications in R is great, and the authors posted the whole thing online: https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf

I'm not as sure about good books for simulation studies; I learned just from messing around, reading what other people did, and from a graduate course I took on statistical computing (professor's materials are here: http://pj.freefaculty.org/guides/).

Lastly, you can also simulate having no effect by setting all of the times to have the same probability:

set.seed(1839)
n <- 1000
times <- c("morning", "afternoon", "evening")
time <- sample(times, n, TRUE)
adhere <- rbinom(n, 1, .6) # same for all times
summary(glm(adhere ~ time, binomial))

Which returns:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.40306    0.10955   3.679 0.000234 ***
timeevening -0.06551    0.15806  -0.414 0.678535    
timemorning  0.18472    0.15800   1.169 0.242360    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This shows no significant differences between the times, as we would expect from the probability being the same across times.

Related Question