As a preliminary matter, it is worth noting that "independence" is a very vague condition unless it comes with a clear specification of what is independent from what. Conceptually, independence is a much broader concept, whereas stationarity of a time-series is a particular condition on the series that can be framed as a particular kind of independence. To answer your question I will show you how you can frame the condition of strict stationarity as an independence condition, and then I will discuss the intuition behind this condition.
Stationarity of a time-series can be framed as a type of independence: You have stated the definition of (strong) stationarity in your question, but I will reiterate it in my own notation:
Let $\boldsymbol{X} = \{ X_t | t \in \mathbb{Z} \}$ be a stochastic process. This process is said to be strongly stationary if for all time indices $t_1,...,t_k \in \mathbb{Z} \text{ }$ and all series values $x_{t_1}, ..., x_{t_k}$ we have:
$$\mathbb{P}(X_{t_1} \leqslant x_{t_1}, ..., X_{t_k} \leqslant x_{t_k}) = \mathbb{P}(X_{t_1+s} \leqslant x_{t_1}, ..., X_{t_k+s} \leqslant x_{t_k}) \quad \quad \text{for all }s \in \mathbb{Z}.$$
For any integer random variable $S$ we can define the shifted process $\boldsymbol{X}^S = \{ X_{t-S} | t \in \mathbb{Z} \}$, which shifts the stochastic process $\boldsymbol{X}$ forwards by $S$ time units. Using this randomly shifted time-series, we can now frame the requirement of strong stationarity in terms of an independence condition.
Theorem: The process $\boldsymbol{X}$ is strongly stationary if and only if, for all integer random variables variables $S \text{ } \bot \text{ } \boldsymbol{X}$ we have $S \text{ } \bot \text{ } \boldsymbol{X}^S$.
Proof: To show equivalence of the conditions we will first show that strong stationarity implies the independence condition ($\implies$) and then we will show that strong stationarity is implied by the independence condition ($\impliedby$).
($\implies$) Assume that strong stationarity holds and let $S \text{ } \bot \text{ } \boldsymbol{X}$ be an arbitrary integer random variable that is independent of the original process. Then for all time indices $t_1,...,t_k \in \mathbb{Z} \text{ }$ and all series values $x_{t_1},...,x_{t_k}$ we have:
$$\begin{equation} \begin{aligned}
\mathbb{P}(X^S_{t_1} \leqslant x_{t_1}, ..., X^S_{t_k} \leqslant x_{t_k} | S=s)
&= \mathbb{P}(X_{t_1-s} \leqslant x_{t_1}, ..., X_{t_k-s} \leqslant x_{t_k} | S=s) \\[6pt]
&= \mathbb{P}(X_{t_1-s} \leqslant x_{t_1}, ..., X_{t_k-s} \leqslant x_{t_k}) \\[6pt]
&= \mathbb{P}(X_{t_1} \leqslant x_{t_1}, ..., X_{t_k} \leqslant x_{t_k}). \\[6pt]
\end{aligned} \end{equation}$$
Since the right-hand-side of this equation does not depend on $s$, we have $S \text{ } \bot \text{ } \boldsymbol{X}^S$.
($\impliedby$) Let $S \text{ } \bot \text{ } \boldsymbol{X}$ be an integer random variable that is independent of the original process and has support on the whole set of integers (i.e., $\mathbb{P}(S=s)>0$ for all $s \in \mathbb{Z}$). Assume that the independence condition $S \text{ } \bot \text{ } \boldsymbol{X}^S$ holds. Then for all time indices $t_1,...,t_k \in \mathbb{Z} \text{ }$ and all series values $x_{t_1},...,x_{t_k}$ we have:
$$\begin{equation} \begin{aligned}
\mathbb{P}(X_{t_1} \leqslant x_{t_1}, ..., X_{t_k} \leqslant x_{t_k})
&= \mathbb{P}(X^S_{t_1+S} \leqslant x_{t_1}, ..., X^S_{t_k+S} \leqslant x_{t_k}) \\[6pt]
&= \mathbb{P}(X^S_{t_1+S} \leqslant x_{t_1}, ..., X^S_{t_k+S} \leqslant x_{t_k} | S=s) \\[6pt]
&= \mathbb{P}(X^S_{t_1+s} \leqslant x_{t_1}, ..., X^S_{t_k+s} \leqslant x_{t_k} | S=s) \\[6pt]
&= \mathbb{P}(X^S_{t_1+s} \leqslant x_{t_1}, ..., X^S_{t_k+s} \leqslant x_{t_k} | S=0) \\[6pt]
&= \mathbb{P}(X_{t_1+s} \leqslant x_{t_1}, ..., X_{t_k+s} \leqslant x_{t_k}). \\[6pt]
\end{aligned} \end{equation}$$
(The step from the first line to the second is allowed by our independence assumption.) Since this equation holds for all $s \in \mathbb{Z}$ we have established the strong stationarity of the original process. $\blacksquare$
From this theorem (which I just made up, but I imagine there is probably something like it in books somewhere) we can see that the condition of strict stationarity can be framed as an independence condition. If you have a stochastic process $\boldsymbol{X}$ and a random time-shift $S$ that is independent of the process, then stationarity occurs when the shifted process is independent of the time-shift variable itself. In other words, the joint distribution of the values in the stochastic process are not affected by knowledge of how much the process was shifted.
Your specific questions:
Should I test for stationarity or independence?
Independence of what? Stationarity is a type of independence, but if you have some other type of independence in mind, you will need to specify what that is. Whether you should test for particular types of independence depends on your interests in the problem, but it is generally useful to test for stationarity, since stationary time-series models have a number of well-known model forms that might be useful to you.
What is the difference between these two concepts in this context?
As you can see from the above, stationarity is a type of independence. The general concept of independence is broader than this, and encompasses any possible assertion of independence of two or more random variables. In the context of time-series analysis, it is common to deal with models that are stationary, or models that have time-based trends (either trend or drift terms) with an underlying stationary error series. In this case it is common to test whether trend or drift terms are present in the model.
What tests should I use in this situation?
All you have told us about your data is that it has a time variable, and you want to know if this variable is important. This isn't much to go on, but it sounds like you might want to formulate a time-series model for your data, and test to see if there is any trend or drift term that would lead to a general systematic change in values over time. This is effectively a test of stationarity against an alternative with a trend or drift in the model. To give more information we would need to know more about your data. (Perhaps in a new question?)
Statistics is concerned with phenomena that can be considered random. Even if you are studying a deterministic process, the measurement noise can make the observations random. We can simplify many problems by using simple models that considered all the unobserved factors as “random noise”. For example, the linear regression model
$$
\mathsf{height}_i = \alpha + \beta \,\mathsf{age}_i + \varepsilon_i
$$
does say that we model height as a function of age and consider whatever else could influence it as “random noise”. It doesn't say that we consider it as completely “random” meaning “chaotic”, “unpredictable”, etc. For another example, if you toss a coin, the outcome would be deterministic and depend only on the rules of physics, but it is influenced by many factors that contribute to its chaotic nature so we can as well consider it as a random process.
If you have a deterministic process and noiseless measurements of all the relevant data, you wouldn't need statistics for it. You would need other mathematics, for example, calculus, but not statistics. If you need to consider the noise and need to assume randomness, you do so. Nothing “arises” from probability distributions, they are only mathematical tools we use to model real-world phenomena.
Best Answer
In probability theory, statistical independence (which is not the same as causal independence) is defined as your property (3), but (1) follows as a consequence$\dagger$. The events $\mathcal{A}$ and $\mathcal{B}$ are said to be statistically independent if and only if:
$$\mathbb{P}(\mathcal{A} \cap \mathcal{B}) = \mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B}) .$$
If $\mathbb{P}(\mathcal{B}) > 0$ then if follows that:
$$\mathbb{P}(\mathcal{A} |\mathcal{B}) = \frac{\mathbb{P}(\mathcal{A} \cap \mathcal{B})}{\mathbb{P}(\mathcal{B})} = \frac{\mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B})}{\mathbb{P}(\mathcal{B})} = \mathbb{P}(\mathcal{A}) .$$
This means that statistical independence implies that the occurrence of one event does not affect the probability of the other. Another way of saying this is that the occurrence of one event should not change your beliefs about the other. The concept of statistical independence is generally extended from events to random variables in a way that allows analogous statements to be made for random variables, including continuous random variables (which have zero probability of any particular outcome). Treatment of independence for random variables basically involves the same definitions applied to distribution functions.
It is crucial to understand that independence is a very strong property - if events are statistically independent then (by definition) we cannot learn about one from observing the other. For this reason, statistical models generally involve assumptions of conditional independence, given some underlying distribution or parameters. The exact conceptual framework depends on whether one is using Bayesian methods or classical methods. The former involves explicit dependence between observable values, while the latter involves a (complicated and subtle) implicit form of dependence. Understanding this issue properly requires a bit of understanding of classical versus Bayesian statistics.
Statistical models will often say they use an assumption that sequences of random variables are "independent and identically distributed (IID)". For example, you might have an observable sequence $X_1, X_2, X_3, ... \sim \text{IID N} (\mu, \sigma^2)$, which means that each observable random variable $X_i$ is normally distributed with mean $\mu$ and standard deviation $\sigma$. Each of the random variables in the sequence is "independent" of the others in the sense that its outcome does not change the stated distribution of the other values. In this kind of model we use the observed values of the sequence to estimate the parameters in the model, and we can then in turn predict unobserved values of the sequence. This necessarily involves using some observed values to learn about others.
Bayesian statistics: Everything is conceptually simple. Assume that $X_1, X_2, X_3, ... $ are conditionally IID given the parameters $\mu$ and $\sigma$, and treat those unknown parameters as random variables. Given any non-degenerate prior distribution for these parameters, the values in the observable sequence are (unconditionally) dependent, generally with positive correlation. Hence, it makes perfect sense that we use observed outcomes to predict later unobserved outcomes - they are conditionally independent, but unconditionally dependent.
Classical statistics: This is quite complicated and subtle. Assume that $X_1, X_2, X_3, ... $ are IID given the parameters $\mu$ and $\sigma$, but treat those parameters as "unknown constants". Since the parameters are treated as constants, there is no clear difference between conditional and unconditional independence in this case. Nevertheless, we still use the observed values to estimate the parameters and make predictions of the unobserved values. Hence, we use the observed outcomes to predict later unobserved outcomes even though they are notionally "independent" of each other. This apparent incongruity is discussed in detail in O'Neill, B. (2009) Exchangeability, Correlation and Bayes' Effect. International Statistical Review 77(2), pp. 241 - 250.
Applying this to your student grades data, you would probably model something like this by assuming that
grade
is conditionally independent giventeacher_id
. You would use the data to make inferences about the grading distribution for each teacher (which would not be assumed to be the same) and this would allow you to make predictions about the unknowngrade
of another student. Because thegrade
variable is used in the inference, it will affect your predictions of any unknowngrade
variable for another student. Replacingteacher_id
withgender
does not change this; in either case you have a variable that you might use as a predictor ofgrade
.If you use Bayesian method you will have an explicit assumption of conditional independence and a prior distribution for the teachers' grade distributions, and this leads to unconditional (predictive) dependence of grades, allowing you to rationally use one grade in your prediction of another. If you are using classical statistics you will have an assumption of independence (based on parameters that are "unknown constants") and you will use classical statistical prediction methods that allow you to use one grade to predict another.
$\dagger$ There are some foundational presentations of probability theory that define independence via the conditional probability statement and then give the joint probability statement as a consequence. This is less common.