Solved – What does it mean for the training data to be generated by a probability distribution over datasets

datasetdistributionsneural networks

I was reading the Deep Learning book and came across the following para (page 109, second para):

The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in each dataset are independent from each other and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example.The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted $p_{\text{data}}$. This probabilistic framework and the i.i.d. assumptions enables us to mathematically study the relationship between training error and test error.

Can somebody please explain to me the meaning of this paragraph?

On page 122 the last paragraph, it also gives an example

a set of samples $\{x(1), \dots, x(m) \}$ that are independently and identically distributed according to a Bernoulli distribution with mean $\theta$.

What does this mean?

Here are a few more specific questions.

  1. The probability distribution over datasets: What are the datasets? How is the probability distribution generated?

  2. The examples are independent of each other. Can you give me an example of where the examples are dependent?

  3. Drawn from the same probability distribution as each other. Suppose the probability distribution is Gaussian. Does the term "Same probability distribution" mean that all the examples are drawn from a Gaussian distribution with the same mean and variance?

  4. "This assumption enables us". What does this mean?

  5. Finally, for the last paragraph of page 122, it is given that the samples follow Bernoulli distribution. What does this mean intuitively?

Best Answer

  1. Probability distrubution over datasets: What are the datasets? How is the probability distribution generated?

Once we can estimate the underlying distributions of the input data, we essentially know how they are picked and can do good predictions. (generative model). Normally, we can assume an underlying distribution according to what we believe (inductive bias). For example, if we believe that there is a high probability that values are close to zero, we can take a Gaussian distribution with mean $0$ and tune the parameters like variance when we train. Datasets are, for example, set of all coin tosses and the distribution assumed will be binomial. When we do say maximizing log-likelihood for the actual data points, we will get those parameters which make the dataset fit into the distribution assumed.

  1. The examples are independent of each other. Can you give me an example of where the examples are dependent?

For example, we toss a coin and if we have a head we toss another otherwise we do not. Here there is a dependence between subsequent tosses

  1. Drawn from the same probability distribution as each other. Suppose the probability distribution is Gaussian. Does the term "same probability distribution" mean that all the examples are drawn from a Gaussian distribution with the same mean and variance?

  2. "This assumption enables us". What does this mean?

Yes. That is why (4) is said. Once you have a probability distribution from one example, you do not need other examples to describe the data generating process.

  1. Finally, for the last paragraph of page 122, it is given that the samples follow Bernoulli distribution. What does this mean intuitively?

It means that each example can be thought of as a coin toss. If the experiment was multiple coin tosses, you would have each coin toss independent with a probability of head to be $\frac{1}{2}$. Similarly, if you choose any other experiment, the result of each example can be thought of as a coin toss or an n-dimensional dice.

Generating examples means getting a distribution closest to what we see in the dataset for training. That is got by assuming a distribution and maximizing the likelihood of the given dataset and outputting the optimum parameters.

Related Question