You have to go back to see where your equation $(1)$ came from.
For one dimension the probability of particles having a velocity between $\vec v_{\rm x}$ and $\vec v_{\rm x}+ d\vec v_{\rm x}$ is given by
$$f(\vec v_{\rm x})\;d\vec v_{\rm x} =\sqrt{\frac{m}{ 2\pi kT}}\exp \left[\frac{-mv_{\rm x}^2}{2kT}\right]\,d\vec v_{\rm x}$$
The speed distribution is given by
$$f(v_{\rm x})\;dv_{\rm x} =2\,\sqrt{\frac{m}{ 2\pi kT}}\exp \left[\frac{-mv_{\rm x}^2}{2kT}\right]\,dv_{\rm x}=\sqrt{\frac{2m}{ \pi kT}}\exp \left[\frac{-mv_{\rm x}^2}{2kT}\right]\,dv_{\rm x}$$
the factor $2$ being there because the speed (magnitude) of $\vec v_{\rm x}$ is the same as that of $-\vec v_{\rm x}$.
This is in agreement with your equation $(3)$.
Equation $(1)$ came from the idea that in three dimensions there is no preferred direction so
$f(\vec v_{\rm x},\vec v_{\rm xy},\vec v_{\rm z})\,d\vec v_{\rm x}d\vec v_{\rm y}d\vec v_{\rm z} = f(\vec v_{\rm x})d\vec v_{\rm x}\,f(\vec v_{\rm y})d\vec v_{\rm y}\,f(\vec v_{\rm z})d\vec v_{\rm z}$
You now have to count all the speeds which are the same ie the magnitude of the velocity $v$ is the same where $v^2 = v^2_{\rm x}+v^2_{\rm y}+v^2_{\rm z}$.
In this three dimensional case the volume of a shell of radius $v$ and thickness $dv$ is $d\vec v_{\rm x}d\vec v_{\rm y}d\vec v_{\rm z}= 4 \pi v^2 dv$ is being considered which results in your equation $(1)$.
$$f(v) \,dv = \sqrt{\left(\frac{m}{2 \pi kT}\right)^3}\, 4\pi v^2 \exp \left[\frac{-mv^2}{2kT}\right] \,dv$$
The equivalent distribution for two dimensions with an area of a ring of radius $v$ and thickness $dv$ is $d\vec v_{\rm x}d\vec v_{\rm y}= 2 \pi v\, dv$ and $v^2 = v^2_{\rm x}+v^2_{\rm y}$ is
$$f(v) \,dv = \left(\frac{m}{2 \pi kT}\right)\, 2\pi v\, \exp \left[\frac{-mv^2}{2kT}\right] \,dv$$
As I recall from my studies of statistical mechanics, the Maxwell Boltzmann distribution is a probability distribution for the speed of all particles in a system [...]
The Maxwell-Boltzmann (MB) distribution $p(v)$ gives you the probability that a given particle has speed $v$. To be more precise, the integral $\int_{u}^{u'} p(v)dv$ gives you the probability that a given particle has speed between $u$ and $u'$. In other words, it tells you if you measure the velocity of a random particle in the whole system, how likely it is that you find it around the value $v$.
So, no contradiction here, and Wikipedia is right. You are not looking at some probability density $p(\mathbf v_1 \dots \mathbf v_N)$ for the whole system, but really at something that concerns single particles.
The results are well fitted by an equilibrium distribution probably because you are averaging over a time interval much larger than the time needed for the system to reach equilibrium. Take smaller and smaller time intervals (all of them starting at $t=0$), and you will start to see deviations from the MB. However, if you want to do this I suggest that you average over all the particles, because reducing the time window means that you are reducing your statistical sample.
It could actually be very instructive to plot $p(v,t)$ for some values of $t$ starting at $t=0$, because you will be able to see the system evolving from some non-equilibrium speed distribution to a MB, and you will also know exactly how long it takes to do so.
Best Answer
Simple approach (does not account for temperature)
Let's focus on one particular site, say site $m$, where $0 \leq m < N$, and look at a single iteration of the quanta exchange process. The probability that $m$ loses a quantum in that iteration is $1/N$, because you could have chosen any of the $N$ sites. Likewise, the probability of $m$ gaining a quantum is $1/N$, because whichever site is chosen to lose a quantum, one of the $N$ sites has to be given it.
For $m$ to get $n$ quanta, it has to be chosen to gain a quantum $n$ times in a row without ever losing one. Since the probability of gain and loss is the same, the probability for this whole chain of events is $2^{-n}$. By multiplying the total number of sites with the probability to find a site with $n$ quanta, you get $N(n) = 2^{-n} N$ sites with $n$ quanta.
Remark 1: Actually, the probability of $m$ losing a quantum is higher or lower, because some sites will already have 0 quanta and cannot lose any more and $m$ might or might not be among them. I guess for sufficiently large $N$ this difference is simply negligible. This probably also is the reason for writing $N(n) \approx 2^{-n} N$ instead of "$=$".
Remark 2: In the process of $m$ getting $n$ quanta, there could also be "loops", in which $m$ first loses quanta and then gains them again or just does not participate in any quantum exchange processes, but since these loops may be arbitrarily long, for each step it does not matter if $m$ loses or gains or nothing happens, so their probability is just 1.
More sophisticated approach
If you have $N$ sites and want to distribute $NQ$ quanta on them, the number of possibilities to do this is $$ \Omega(NQ) = \begin{pmatrix} NQ + N - 1 \\ N - 1 \end{pmatrix} \approx \begin{pmatrix} NQ + N \\ N \end{pmatrix} = \frac{(NQ + N)!}{(NQ)! N!}~, $$ where the parts in parenthesis are binomial coefficients, not vectors. To see why this is true, encode the system's state as a sequence of dots and bars (e.g. "....|..|.|..."), where dots represent quanta and bars separate sites (in the example, site 1 has 4 quanta, site 2 has 2, site 3 has 1 and so on). Out of the $NQ + N - 1$ characters of the code, you have to choose $N-1$ to be bars, hence the binomial coefficient.
Now let's assume that site $m$ has $n$ quanta. The number of possibilities to distribute the remaining $NQ-n$ quanta on the remaining $N-1$ sites is $$ \Omega_s(n) = \begin{pmatrix} NQ - n + N - 2 \\ N-2 \end{pmatrix} \approx \begin{pmatrix} NQ - n + N \\ N \end{pmatrix} = \frac{(NQ - n + N)!}{(NQ - n)! N!}~. $$ Next we calculate $$ \log \Omega_s(n) = \log(NQ - n + N)! - \log(NQ-n)! - \log N! \\ \overset{\text{Stirling's approximation}}\approx (NQ - n + N) \log(NQ - n + N) - NQ + n - N - (NQ - n) \log(NQ - n) + NQ - n - N \log N + N \\ =(NQ - n + N) \log(NQ - n + N) - (NQ - n) \log(NQ - n) - N \log N \\ =NQ \left( 1 - \underbrace{\frac{n}{NQ}}_{=: x} + \frac 1 Q \right) \log \left( NQ \left( 1 - \frac{n}{NQ} + \frac 1Q \right) \right) - NQ \left( 1 - \frac{n}{NQ} \right) \log \left( NQ \left( 1 - \frac{n}{NQ} \right) \right) - N \log N \\ = NQ ( 1 - x + 1/Q) (\log(NQ) + \log(1-x+1/Q)) - NQ(1-x) (\log(NQ + \log(1-x)) - N \log N \\ \overset{\text{Taylor expansion in $x$ around 0}}= \log \Omega(NQ) + (-NQ \log(NQ + N) + NQ \log NQ)x + \mathcal O(x^2) \\ =\log \Omega(NQ) + \log \left( \frac{NQ}{NQ + N} \right) n + \mathcal O(x^2)~. $$ Thus, there follows $$ \log \frac{\Omega_s(n)}{\Omega(NQ)} = \log \Omega_s(n) - \log \Omega(NQ) = \log \left(\frac{NQ}{NQ + N} \right) n~, $$ and exponentiation yields $$ \frac{\Omega_s(n)}{\Omega(NQ)} = \exp \left( \log \left( \frac{NQ}{NQ + N} \right) n \right) = \left( \frac{NQ}{NQ + N} \right)^n~. $$ Because all of the different possibilities in $\Omega_s(n)$ and $\Omega(NQ)$ are equally probable, this is the probability of finding site $m$ with $n$ quanta. If you set $Q = 1$, this becomes $(N / (2N))^n = 2^{-n}$, like before.