It is not a convention, but quite often $\theta$ stands for the set of parameters of a distribution.
That was it for plain English, let's show examples instead.
Example 1. You want to study the throw of an old fashioned thumbtack (the ones with a big circular bottom). You assume that the probability that it falls point down is an unknown value that you call $\theta$. You could call a random variable $X$ and say that $X=1$ when the thumbtack falls point down and $X=0$ when it falls point up. You would write the model
$$P(X = 1) = \theta \\
P(X = 0) = 1-\theta,$$
and you would be interested in estimating $\theta$ (here, the proability that the thumbtack falls point down).
Example 2. You want to study the disintegration of a radioactive atom. Based on the literature, you know that the amount of radioactivity decreases exponentially, so you decide to model the time to disintegration with an exponential distribution. If $t$ is the time to disintegration, the model is
$$f(t) = \theta e^{-\theta t}.$$
Here $f(t)$ is a probability density, which means that the probability that the atom disintegrates in the time interval $(t, t+dt)$ is $f(t)dt$. Again, you will be interested in estimating $\theta$ (here, the disintegration rate).
Example 3. You want to study the precision of a weighing instrument. Based on the literature, you know that the measurement are Gaussian so you decide to model the weighing of a standard 1 kg object as
$$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp \left\{ -\left( \frac{x-\mu}{2\sigma} \right)^2\right\}.$$
Here $x$ is the measure given by the scale, $f(x)$ is the density of probability, and the parameters are $\mu$ and $\sigma$, so $\theta = (\mu, \sigma)$. The paramter $\mu$ is the target weight (the scale is biased if $\mu \neq 1$), and $\sigma$ is the standard deviation of the measure every time you weigh the object. Again, you will be interested in estimating $\theta$ (here, the bias and the imprecision of the scale).
This question goes to the heart of what statistics is and how to to conduct a good statistical analysis. It raises many issues, some of terminology and others of theory. To clarify them, let's begin by noting the implicit context of the question and go on from there to define the key terms "parameter," "property," and "estimator." The several parts of the question are answered as they come up in the discussion. The final concluding section summarizes the key ideas.
State spaces
A common statistical use of "the distribution," as in "the Normal distribution with PDF proportional to $\exp(-\frac{1}{2}(x-\mu)/\sigma)^2)dx$" is actually a (serious) abuse of English, because obviously this is not one distribution: it's a whole family of distributions parameterized by the symbols $\mu$ and $\sigma$. A standard notation for this is the "state space" $\Omega$, a set of distributions. (I am simplifying a bit here for the sake of exposition and will continue to simplify as we go along, while remaining as rigorous as possible.) Its role is to delineate the possible targets of our statistical procedures: when we estimate something, we are picking out one (or sometimes more) elements of $\Omega$.
Sometimes state spaces are explicitly parameterized, as in $\Omega = \{\mathcal{N}(\mu, \sigma^2)|\mu \in \mathbb{R}, \sigma \gt 0\}$. In this description there is a one-to-one correspondence between the set of tuples $\{(\mu,\sigma)\}$ in the upper half plane and the set of distributions we will be using to model our data. One value of such a parameterization is that we may now refer concretely to distributions in $\Omega$ by means of an ordered pair of real numbers.
In other cases state spaces are not explicitly parameterized. An example would be the set of all unimodal continuous distributions. Below, we will address the question of whether an adequate parameterization can be found in such cases anyway.
Parameterizations
Generally, a parameterization of $\Omega$ is a correspondence (mathematical function) from a subset of $\mathbb{R}^d$ (with $d$ finite) to $\Omega$. That is, it uses ordered sets of $d$-tuples to label the distributions. But it's not just any correspondence: it has to be "well behaved." To understand this, consider the set of all continuous distributions whose PDFs have finite expectations. This would widely be regarded as "non-parametric" in the sense that any "natural" attempt to parameterize this set would involve a countable sequence of real numbers (using an expansion in any orthogonal basis). Nevertheless, because this set has cardinality $\aleph_1$, which is the cardinality of the reals, there must exist some one-to-one correspondence between these distributions and $\mathbb{R}$. Paradoxically, that would seem to make this a parameterized state space with a single real parameter!
The paradox is resolved by noting that a single real number cannot enjoy a "nice" relationship with the distributions: when we change the value of that number, the distribution it corresponds to must in some cases change in radical ways. We rule out such "pathological" parameterizations by requiring that distributions corresponding to close values of their parameters must themselves be "close" to one another. Discussing suitable definitions of "close" would take us too far afield, but I hope this description is enough to demonstrate that there is much more to being a parameter than just naming a particular distribution.
Properties of distributions
Through repeated application, we become accustomed to thinking of a "property" of a distribution as some intelligible quantity that frequently appears in our work, such as its expectation, variance, and so on. The problem with this as a possible definition of "property" is that it's too vague and not sufficiently general. (This is where mathematics was in the mid-18th century, where "functions" were thought of as finite processes applied to objects.) Instead, about the only sensible definition of "property" that will always work is to think of a property as being a number that is uniquely assigned to every distribution in $\Omega$. This includes the mean, the variance, any moment, any algebraic combination of moments, any quantile, and plenty more, including things that cannot even be computed. However, it does not include things that would make no sense for some of the elements of $\Omega$. For instance, if $\Omega$ consists of all Student t distributions, then the mean is not a valid property for $\Omega$ (because $t_1$ has no mean). This impresses on us once again how much our ideas depend on what $\Omega$ really consists of.
Properties are not always parameters
A property can be such a complicated function that it would not serve as a parameter. Consider the case of the "Normal distribution." We might want to know whether the true distribution's mean, when rounded to the nearest integer, is even. That's a property. But it will not serve as a parameter.
Parameters are not necessarily properties
When parameters and distributions are in one-to-one correspondence then obviously any parameter, and any function of the parameters for that matter, is a property according to our definition. But there need not be a one-to-one correspondence between parameters and distributions: sometimes a few distributions must be described by two or more distinctly different values of the parameters. For instance, a location parameter for points on the sphere would naturally use latitude and longitude. That's fine--except at the two poles, which correspond to a given latitude and any valid longitude. The location (point on the sphere) indeed is a property but its longitude is not necessarily a property. Although there are various dodges (just declare the longitude of a pole to be zero, for instance), this issue highlights the important conceptual difference between a property (which is uniquely associated with a distribution) and a parameter (which is a way of labeling the distribution and might not be unique).
Statistical procedures
The target of an estimate is called an estimand. It is merely a property. The statistician is not free to select the estimand: that is the province of her client. When someone comes to you with a sample of a population and asks for you to estimate the population's 99th percentile, you would likely be remiss in supplying an estimator of the mean instead! Your job, as statistician, is to identify a good procedure for estimating the estimand you have been given. (Sometimes your job is to persuade your client that he has selected the wrong estimand for his scientific objectives, but that's a different issue...)
By definition, a procedure is a way to get a number out of the data. Procedures are usually given as formulas to be applied to the data, like "add them all up and divide by their count." Literally any procedure may be pronounced an "estimator" of a given estimand. For instance, I could declare that the sample mean (a formula applied to the data) estimates the population variance (a property of the population, assuming our client has restricted the set of possible populations $\Omega$ to include only those that actually have variances).
Estimators
An estimator needn't have any obvious connection to the estimand. For instance, do you see any connection between the sample mean and a population variance? Neither do I. But nevertheless, the sample mean actually is a decent estimator of the population variance for certain $\Omega$ (such as the set of all Poisson distributions). Herein lies one key to understanding estimators: their qualities depend on the set of possible states $\Omega$. But that's only part of it.
A competent statistician will want to know how well the procedure they are recommending will actually perform. Let's call the procedure "$t$" and let the estimand be $\theta$. Not knowing which distribution actually is the true one, she will contemplate the procedure's performance for every possible distribution $F \in \Omega$. Given such an $F$, and given any possible outcome $s$ (that is, a set of data), she will compare $t(s)$ (what her procedure estimates) to $\theta(F)$ (the value of the estimand for $F$). It is her client's responsibility to tell her how close or far apart those two are. (This is often done with a "loss" function.) She can then contemplate the expectation of the distance between $t(s)$ and $\theta(F)$. This is the risk of her procedure. Because it depends on $F$, the risk is a function defined on $\Omega$.
(Good) statisticians recommend procedures based on comparing risk. For instance, suppose that for every $F \in \Omega$, the risk of procedure $t_1$ is less than or equal to the risk of $t$. Then there is no reason ever to use $t$: it is "inadmissible." Otherwise it is "admissible".
(A "Bayesian" statistician will always compare risks by averaging over a "prior" distribution of possible states (usually supplied by the client). A "Frequentist" statistician might do this, if such a prior justifiably exists, but is also willing to compare risks in other ways Bayesians eschew.)
Conclusions
We have a right to say that any $t$ that is admissible for $\theta$ is an estimator of $\theta$. We must, for practical purposes (because admissible procedures can be hard to find), bend this to saying that any $t$ that has acceptably small risk (when being compared to $\theta$) among practicable procedures is an estimator of $\theta$. "Acceptably" and "practicable" are determined by the client, of course: "acceptably" refers to their risk and "practicable" reflects the cost (ultimately paid by them) of implementing the procedure.
Underlying this concise definition are all the ideas just discussed: to understand it we must have in mind a specific $\Omega$ (which is a model of the problem, process, or population under study), a definite estimand (supplied by the client), a specific loss function (which quantitatively connects $t$ to the estimand and is also given by the client), the idea of risk (computed by the statistician), some procedure for comparing risk functions (the responsibility of the statistician in consultation with the client), and a sense of what procedures actually can be carried out (the "practicability" issue), even though none of these are explicitly mentioned in the definition.
Best Answer
A “statistic” has the rather trivial definition of being a function of the data, so counting how many points there are is a function of the data. Sure, sample size is a statistic.
A “parameter” is a knob you turn to get some distribution to behave a certain way. If you want a normal distribution centered at 7, turn $\mu$ up to 7. If you want it spread out a lot, turn $\sigma^2$ up to 81.
“Population size” is a strange idea, and you can find differing opinions about if it can exist. You may think that if you observed every person, then you’ve observed the population. Say you found that people now are taller than people 200 years ago, having measured everyone in 1819 and 2019. If you then do a hypothesis test of their heights, you’re saying that what you’re interested in is the process that generates human heights, and the humans you observed are the ones who happened to be born.
I say that a parameter is some characteristic of a distribution (in the mathematical sense of being a CDF). Therefore, population size is $\infty$ and not a parameter.