Regarding the second part of the question: say ${\bf X} = \{X_1, X_2 ... X_d\}$ is a $d-$dimensional Bernoulli random variable. The joint probability function, in general, takes $2^d$ values, though the condition $\sum p({\bf x})=1$ leaves us with $2^d-1$ degrees of freedom. To represent it in some compact and handy way, the following representation is possible (I devised it some years ago for some problem I had at hands; I guess it's well known).
I assume that our Bernoulli takes values in $\{-1,1\}$ instead of the more usual
$\{0, 1\}$. Then, the joint probability can be written as
$$P_{\bf X}(x_1,x_2,\cdots x_d)=\frac{1}{2^3} \sum_{S_\lambda} a_{S_\lambda} \prod_{k \in S_\lambda} x_k$$
where $S_\lambda$ are all the subsets of the set $\{1, 2, \cdots ,d\}$ and $a_{S_\lambda} = E[\prod_{k \in S_\lambda} X_k]$ ($a_{\emptyset}=1$).
For example, for $d=3$:
$$P_{\bf X}(x_1,x_2,x_3)=\\
=\frac{1}{2^3}\left( a_{123} x_1 x_2 x_3 +
a_{12} x_1 x_2 + a_{23} x_2 x_3 + a_{13} x_1 x_3 +
a_{1} x_1 + a_{2} x_2 + a_{3} x_3 + 1
\right)$$
where $a_{123}=E[X_1 X_2 X_3]$ $a_{12}=E[X_1 X_2]$, $a_{2}=E[X_2]$, etc.
This representation has two advantages: first, the coefficients do not change if we add or suppress (marginalize) components: if we have the above formula and want to compute $P(x_1,x_2)$ or $P(x_1,x_2,x_3,x_4)$, we only must add or suppress terms, (and change the normalization factor). Second, independent components (recall that independent is the same as uncorrelated here) are easily spotted: for example, if $x_2$ is independent from $x_1$, then we'll have: $a_{12}=a_{1} a_2$. Further, $cov(X_1,X_2)=a_{12} - a_1 a_2$
A disadvantage of this representation is that, the condition $|a_{i...k}|\le1$ is necessary but not sufficient to guarantee that it corresponds to a valid probability function. To check that, we must recover the values of the probability function. But the relation between this ${\bf p} =(p_{\emptyset},p_1,p_2 ... p_{12} ... p_{123})$ and the coefficients ${\bf a} =(a_{\emptyset},a_1,a_2 ... a_{12} ... a_{123})$ is straightforward: ${\bf a} = {\bf M p} $ with $m_{i,j}=\pm 1$ (the sign depends on the parity of common elements in the sets corresponding to that matrix row and column), and ${\bf M^{-1}}= {\bf M}^t / 2^d$ (I leave some details out, I doubt they will be missed).
So, if we are given the first and second moments of ${\bf x}$, we automatically have the first and second order coefficients. The rest of the coefficients are arbitrary, except for the restriction that they must lead to a valid probability function ($0 \le p_{i..k} \le 1$).
Regarding the first question, the idea is interesting, but I doubt there is some "maximum" distance. A mere suggestion: given that we are restricting to a variable with fixed first and second moments, to compute its deviation from the corresponding multivatiate gaussian I'd try with the entropy (substracted for the entropy for the corresponding gaussian, which should be bigger).
Added: To get explicitly ${\bf M}$:
Note that in my notation for the joint probability function ${\bf p} =(p_{\emptyset},p_1,p_2 ... p_{12} ... p_{123})$ the subindexes denote the components that take positive value; hence, for example $p_{1}=P(X_1=1,X_2=-1,X_3=-1)$, $p_{23}=P(X_1=-1,X_2=1,X_3=1)$, etc.
Say we want to compute $a_{12}$.
$$a_{12}=E(X_1 X_2) = p_{123} (1)(1) + p_{12} (1)(1) + + p_{23} (-1)(1) + \cdots + p_1 (1)(-1) + p_{\emptyset}(-1)(-1)$$
Or
$$a_i = \sum_j (-1)^{\#(S_i \setminus S_j)} p_j$$
where the indexes $i,j$ run over the $2^n$ subsets, and $\#(S_i \setminus S_j)$ is the cardinality of the difference set operation. That term, then, gives the elements of the matrix ${\bf M}$
Update: The matrix ${\bf M}$ is a type of Hadamard matrix, specifically (if I'm not mistaken) a Walsh matrix.
Best Answer
An important property of the multivariate normal distribution, is that if $X$ has a n-dimensional normal distribution, then $BX+c$ has a m-dimensional normal distribution for any $m\times n$ matrix $B$ and $m$ dimensional column vector $c$ .
It can be shown that $$\mathbb{E}[BX + c] = B\mathbb{E}[X]+c \quad \text{ and } \quad \text{Var}(BX+c)=B\text{Var}(X)B^T$$
Using this we can characterize the multivariate normal distribution as an affine transformation of independent $N(0,1)$ variables.
The construction goes as follows: Suppose we want to construct a normal distribution with mean vector $\mu$ and covariance matrix $\Sigma$. Consider $n$ independent $N(0,1)$ variables, then $(X_1,...,X_n)$ has a n-dimensional normal distribution with mean $0$ and covariance matirx $I$ (the identity matrix). Consider the transformation $$Y=\Sigma^{1/2}X + \mu,$$ where $\Sigma^{1/2}$ is the symmetric square root of $\Sigma$ (see https://en.wikipedia.org/wiki/Square_root_of_a_matrix#By_diagonalization)
Y has a n-dimensional normal distribution with mean $\mu$ and covariance matrix $$Var(Y) = \Sigma^{1/2} I (\Sigma^{1/2})^T=\Sigma.$$
An interesting consequence is, that if $\Sigma$ has rank $k$, then $Y$ is concentrated on a $k$ dimensional affine subspace of $\mathbb{R}^n$ and if $k<n$ then $Y$ is concentrated on a set of Lebesgue measure $0$, which means that a density cannot exist.