In probability theory, statistical independence (which is not the same as causal independence) is defined as your property (3), but (1) follows as a consequence$\dagger$. The events $\mathcal{A}$ and $\mathcal{B}$ are said to be statistically independent if and only if:
$$\mathbb{P}(\mathcal{A} \cap \mathcal{B}) = \mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B}) .$$
If $\mathbb{P}(\mathcal{B}) > 0$ then if follows that:
$$\mathbb{P}(\mathcal{A} |\mathcal{B}) = \frac{\mathbb{P}(\mathcal{A} \cap \mathcal{B})}{\mathbb{P}(\mathcal{B})} = \frac{\mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B})}{\mathbb{P}(\mathcal{B})} = \mathbb{P}(\mathcal{A}) .$$
This means that statistical independence implies that the occurrence of one event does not affect the probability of the other. Another way of saying this is that the occurrence of one event should not change your beliefs about the other. The concept of statistical independence is generally extended from events to random variables in a way that allows analogous statements to be made for random variables, including continuous random variables (which have zero probability of any particular outcome). Treatment of independence for random variables basically involves the same definitions applied to distribution functions.
It is crucial to understand that independence is a very strong property - if events are statistically independent then (by definition) we cannot learn about one from observing the other. For this reason, statistical models generally involve assumptions of conditional independence, given some underlying distribution or parameters. The exact conceptual framework depends on whether one is using Bayesian methods or classical methods. The former involves explicit dependence between observable values, while the latter involves a (complicated and subtle) implicit form of dependence. Understanding this issue properly requires a bit of understanding of classical versus Bayesian statistics.
Statistical models will often say they use an assumption that sequences of random variables are "independent and identically distributed (IID)". For example, you might have an observable sequence $X_1, X_2, X_3, ... \sim \text{IID N} (\mu, \sigma^2)$, which means that each observable random variable $X_i$ is normally distributed with mean $\mu$ and standard deviation $\sigma$. Each of the random variables in the sequence is "independent" of the others in the sense that its outcome does not change the stated distribution of the other values. In this kind of model we use the observed values of the sequence to estimate the parameters in the model, and we can then in turn predict unobserved values of the sequence. This necessarily involves using some observed values to learn about others.
Bayesian statistics: Everything is conceptually simple. Assume that $X_1, X_2, X_3, ... $ are conditionally IID given the parameters $\mu$ and $\sigma$, and treat those unknown parameters as random variables. Given any non-degenerate prior distribution for these parameters, the values in the observable sequence are (unconditionally) dependent, generally with positive correlation. Hence, it makes perfect sense that we use observed outcomes to predict later unobserved outcomes - they are conditionally independent, but unconditionally dependent.
Classical statistics: This is quite complicated and subtle. Assume that $X_1, X_2, X_3, ... $ are IID given the parameters $\mu$ and $\sigma$, but treat those parameters as "unknown constants". Since the parameters are treated as constants, there is no clear difference between conditional and unconditional independence in this case. Nevertheless, we still use the observed values to estimate the parameters and make predictions of the unobserved values. Hence, we use the observed outcomes to predict later unobserved outcomes even though they are notionally "independent" of each other. This apparent incongruity is discussed in detail in O'Neill, B. (2009) Exchangeability, Correlation and Bayes' Effect. International Statistical Review 77(2), pp. 241 - 250.
Applying this to your student grades data, you would probably model something like this by assuming that grade
is conditionally independent given teacher_id
. You would use the data to make inferences about the grading distribution for each teacher (which would not be assumed to be the same) and this would allow you to make predictions about the unknown grade
of another student. Because the grade
variable is used in the inference, it will affect your predictions of any unknown grade
variable for another student. Replacing teacher_id
with gender
does not change this; in either case you have a variable that you might use as a predictor of grade
.
If you use Bayesian method you will have an explicit assumption of conditional independence and a prior distribution for the teachers' grade distributions, and this leads to unconditional (predictive) dependence of grades, allowing you to rationally use one grade in your prediction of another. If you are using classical statistics you will have an assumption of independence (based on parameters that are "unknown constants") and you will use classical statistical prediction methods that allow you to use one grade to predict another.
$\dagger$ There are some foundational presentations of probability theory that define independence via the conditional probability statement and then give the joint probability statement as a consequence. This is less common.
Unless there was some other clue as to the intended meaning, I'd interpret that as "is approximately distributed as".
It's fairly standard. Note that some of the other usual ways of indicating "approximation" by modifying a symbol don't really work with $\sim$.
Note that $\sim$ can be read as "is distributed as" and that adding the dot over a symbol at least sometimes indicates approximation -- compare $=$ with $\mathrel{\dot =}$.
So "$x \mathrel{\dot\sim} \mathcal N(0,1)$" could be read something like "$x$ is approximately distributed as standard normal". Personally, I don't mind the closer spacing in \dot\sim ($\dot\sim$) for that use.
Best Answer
Example: Say you have a group of men and women and know their handedness (left/right). It is like depicted in the table below $$\begin{array}{r|c|c | c} &\text{men}&\text{women} &\text{total}\\ \hline \text{left handed}&9&4&13\\\hline \text{right handed}&43&44&87\\\hline \text{total}&52&48&100 \end{array}$$
Say you pick randomly a person out of this group then it is $13\%$ probability that they are left handed. But if you know that the person is a woman, then the probability is $4/48 \approx 9 \%$.
To express this latter case, the probability of an event, given another event or condition, one uses the vertical bar symbol $\vert$.
$$P(X\vert Y) = \text{probability of event $X$ given/conditional on event $Y$}$$
So it is about both events $X$ and $Y$ happening. But, this is different from $P(X,Y)$, the probability that both $X$ and $Y$ are happening.
The probability for left handedness given that a person is a woman, is not equal to $4 \%$ the probability that someone is a woman and left handed.
The expression $X\vert Y$ occurs within the probability operator $P()$. But you should not read all the contents as a single event.