I think the Wikipedia article contains the proof, but in any case it is easy. The proof of the correctness of the Durstenfeld algorithm is the same as for the original procedure by Fisher and Yates, by induction on the number $n$ of elements to permute (the cases $n=0$ is trivial). One starts with the items in any order; since they are going to be randomly permuted, this order is of no importance. To get a random permutation of $n$ distinct items, the first item must be chosen uniformly at random. Once this element is chosen, the remaining elements, which we can arrange in any order (and the only difference between Fisher-Yates and Durstenfeld is which order we start out with for the remainder) must be permuted uniformly at random, which is done by applying the algorithm recursively to the list of those remaining items.
Alternatively, one can show (also by induction), that calling the random number generator $n$ times to generate an element of the Cartesian product $[n]\times[n-1]\times\cdots\times[1]$, there is a bijection between the sequences produced by the generator and the resulting permutations of the sequence, so supposing the Cartesian product is sampled uniformly at random, so is the permutation produced.
The difficulty is not the probabilistic argument, but the requirements imposed on the random generator. When producing a random permutation of a large number of items (say $n=10000$), it is very likely that the Cartesian product to be sampled is simply larger than the state space of the random generator, which means that a given generator cannot possibly generate all $n!$ permutations, let alone generate them uniformly. This has nothing to do with the algorithm used for producing the permutation. Whether this is a problem depends on what the permutation is going to be used for. If it is just to assure some mixing in a statistical procedure I think it doesn't matter if there are certain permutations that will never show up; the important point here is just that the subset of permutations that can actually be produced passes any easily executed statistical test (such as counting the number of fixed points) in the same way as the set of all permutations would. If however the procedure is used to produce a drawing for a lottery where large amounts of money are involved, it would be quite embarrassing if a player could prove (maybe by exhaustive search of the state space) that the generator used for drawing cannot possibly produce the combination he happens to hold, as for him it then is no longer a game of chance, but just a fraud. The question arises what one really means by a "random permutation".
So to answer the added question, it is clearly so that:
Any permutation generation algorithm based on a pseudo-random number generator will fail to sample uniformly from the space of all permutations of a sufficiently large fixed size; for instance, any size such that the number of permutations exceeds that of the state space of the pseudo-random number generator.
I commend to you the Hungarian method. It's an algorithm that I'm not up to writing out here, but it's available in many combinatorics, discrete math, and graph theory textbooks, also many places on the web: Wikipedia in particular will get you started.
Best Answer
Using the excellent indications given by @Rodrigo de Azevedo under the following form where $\|\|_F$ denotes the Frobenius norm (see remark at the bottom), your issue is equivalent to :
$$\text{minimize} \ \|AP-B\|_F^2=\|AP\|_F^2-2\langle AP,B\rangle+\|B\|_F^2$$
$$\text{minimize} \ \|AP-B\|_F^2=\|A\|_F^2-2 \ \text{trace}(P^TA^TB)+\|B\|_F^2$$
Have a look at https://en.wikipedia.org/wiki/Frobenius_inner_product
As $\|A\|_F^2+\|B\|_F^2$ is constant, it remains to :
$$\text{maximize} \ \text{trace}(PC) \ \text{with} \ C:=A^TB$$
on the group of permutation matrices $P$.
Edit :
Operation trace($PC$) can be understood on the particular $3 \times 3$ case :
$$\text{trace} \begin{pmatrix}1&0&0\\0&0&1\\0&1&0\end{pmatrix}\begin{pmatrix}c_{11}&c_{12}&c_{13}\\c_{21}&c_{22}&c_{23}\\c_{31}&c_{32}&c_{33}\end{pmatrix}=c_{11}+c_{23}+c_{32},$$
A straightforward generalization to the $n \times n$ case explains that the objective is to find, among all sums :
$$s_{\sigma}=c_{1\sigma(1)}+c_{2\sigma(2)}+\cdots+c_{n\sigma(n)} \ \text{,} \ \sigma \in \frak{S}_n$$
the one which is maximal.
Fortunately, this can be done by the efficient Hungarian algorithm applied on matrix $C$, or more exactly because its direct form deals with minimization, an adapted version of it for a maximization context.
Why do we say efficient ? Because this algorithm has complexity $O(n^4)$ instead of the $O(n!)$ complexity of the brute force approach.
Remarks :
In fact, operation $A \to AP$, where $P$ is a permutation matrix, provides a permutation of columns, which is clearly equivalent to a permutation of lines provided by $A \to PA$.
Connected : https://math.stackexchange.com/q/175893.