What you've stumbled upon is called the "Gibbs paradox", and the resolution is to divide the phase space for entropy calculations in statistical mechanics by the identical particle factor, which reduces the number of configurations.
Since the temperature is unchanged in the process, the momentum distribution of the atoms is unimportant, it is the same before and after, and the entropy is entirely spatial, as you realized. The volume of configuration space for the left part is:
${V_1^N \over N!}$
and for the right part is:
${V_2^N\over N!}$
And the total volume of the 2N particle configuration space is:
$(V_1V_2)^N\over (N!)^2$
When you lift the barrier, you get the spatial volume of configuration space
$(V_1 + V_2)^{2N} \over (2N)!$
When $V_1$ and $V_2$ are equal, you naively would expect zero entropy gain. But you do gain a tiny little bit of entropy by removing the wall. Before you removed the wall, the number of particles on the left and on the right were exactly equal, now they can fluctuate a little bit. But this is a negligible amount of extra entropy in the thermodynamic limit, as you can see:
${(2V)^{2N}\over (2N)!} = {2^{2N}(N!)^2\over (2N)!}{V^{2N}\over (N!)^2}$
So that the extra entropy from lifting the barrier is equal to:
$ \log ({(2N)!\over 2^{2N}(N!)^2})$
You might recognize the thing inside the log, it's the probability that a symmetric +/-1 random walk returns to the origin after N steps, i.e. the biggest term of the Pascal triangle at stage 2N when normalized by the sum of all the terms of Pascal's triangle at that stage. From the Brownian motion identity or equivalently, directly from Stirling's formula), you can estimate its size as ${1\over \sqrt{2\pi N}}$, so that the logarithm goes as log(N), it is sub-extensive, and vanishes for large numbers.
The entropy change in the general case is then exactly given by the logarithm of the ratio of the two configuration space volumes before and after:
$e^{\Delta S} = { V_1^N V_2^N \over (N!)^2 } { (2N)! \over (V_1 + V_2)^{2N}} = { V_1^N V_2^N \over ({V_1 + V_2 \over 2})^{2N}} {(2N)!\over 2^{2N}(N!)^2}$
Ignoring the thermodynamically negligible last factor, the macroscopic change in entropy, the part proprtional to N, is:
$\Delta S = N\log({4 V_1 V_2 \over (V_1 + V_2)^2})$
up to a sign, it is as you calculated.
Additional comments
You might think that it is weird to gain a little bit of entropy just from the fact that before you lift the wall you knew that the particle numbers were exactly N, even if that entropy is subextensive. Wouldn't that mean that when you lower the wall, you reduce the entropy a tiny subextensive amount, by preventing mixing of the right and left half? Even if the entropy decrease is tiny, it still violates the second law.
There is no entropy decrease, because when you lower the barrier, you don't know how many molecules are on the left and how many are on the right. If you add the entropy of ignorance to the entropy of the lowered wall system, it exactly removes the subextensive entropy loss. If you try to find out how many molecules are on the right vs how many are on the left, you produce more entropy in the process of learning the answer than you gain from the knowledge.
Part of my PhD thesis was on this stuff, so I hope I can give a satisfactory answer.
Maximum entropy production and minimum entropy production are different types of principle with different domains of application. Before discussing the answer I should make clear that the maximum entropy production principle (which I'll call MaxEP) is really a collection of different hypotheses by different authors, some of which are more plausible than others, and none of which has an accepted theoretical justification. However, there is some empirical evidence in the work of Paltridge from the 70s, e.g. this paper. A very simple one-parameter version of Paltridge's model can be found in this paper by Lorenz et al., and in the discussion below I will keep as close as possible to the version of MaxEP that Lorenz et al. use.
As you say, Prigogine's principle of minimum entropy production (henceforth MinEP) only applies in near-equilibrium situations. It was once hypothesised to be much more widely applicable. This hypothesis has now been disproven, and one must be careful to bear this in mind when reading old material on the subject. (For the moment I've lost track of the paper that disproves this idea, but it's a pretty solid mathematical result. If I find it again I'll update this answer.)
With these caveats out of the way, the basic difference is this:
For linear, near-equilibrium systems that only admit a single steady state, MinEP says that all of the system's transient states have a higher entropy production than the steady state. A transient state is a temporary state that is not a steady state. MinEP compares steady states with non-steady states.
For some yet-to-be-determined class of non-linear, far-from-equilibrium systems that admit a continuum of possible steady states, MaxEP says that the system is most likely to be found in the steady state with the greatest entropy production. MaxEP compares steady states to other steady states, but says nothing about transient states.
So aside from the fact that the two principles apply to quite different types of system (linear versus highly non-linear), they also make quite different types of claim. One can imagine a system that admits many possible steady states, but whose transient states all have a higher entropy production than any of its steady states. For such a system, MinEP and MaxEP could apply simultaneously. If so then starting from a non-steady initial state, its entropy production would reduce over time until it reached a steady state and would remain constant thereafter; but nevertheless the steady state that it reaches is most likely to be the one with the highest entropy production.
Unfortunately there is a depressing amount of literature in which these points are not well appreciated. It seems that people often think MaxEP implies that entropy production should increase over time as the system approaches a steady state. But this isn't true for a lot of systems, and I think this mistake in reasoning might be one of the reasons why MaxEP doesn't have a great reputation as a hypothesis.
As for literature that addresses this distinction, I seem to remember there being some fairly readable discussion in this book chapter by Dewar. Another place to look is Edwin Jaynes' criticism of the minimum entropy production principle. It doesn't really mention MaxEP (because Jaynes seems not to have been aware of Paltridge's papers) but it gives some strong hints towards it, and I found it extremely helpful in understanding the nature of MinEP and why a different type of principle is needed. Finally, I suppose I could also humbly point you to my paper on MaxEP, which doesn't discuss MinEP but tries to clarify some points about how MaxEP is applied, and to resolve some serious theoretical problems with the principle. These papers deal with some of the issues I've skipped over above, such as what it means for a system to have "possible" steady states that are different from the actual one.
Edit to reply to comment
The OP has commented that maybe the above implies that systems always choose the most entropy-producing state they "could" be in, regardless of whether this is a transient or a steady state, but for the transient states the maximum possible entropy production can reduce over time as the system converges to a steady state.
There are several ways I can address this. The first possibility is to say that above I was talking only about the version applied by Paltridge and by Lorenz et al., because this is the only version with even the tiniest little sliver of empirical evidence. It's very, very important to note that this version of MaxEP doesn't say anything at all about transient states. As Paltridge has said (as the OP points out), his version of MaxEP is just an empirical observation and not a theoretical claim, and it's an observation of the atmosphere's steady state, not its transient ones.
It's also important to note that there are few if any systems other than atmospheres that have been observed to obey a principle similar to Paltridge's. (There are claims for other systems, mostly in the Earth sciences, but I don't find these very convincing. There are no laboratory-based observations of Paltridge's principle as far as I know, although this is partly because the experimental crowd have their own completely different "principle of maximum entropy production" that they like to play with, in which systems choose between a finite number of steady states instead of a continuum.) So we already know that MaxEP as an empirical principle is not broadly applicable to all non-linear systems, and it shouldn't be surprising that we get contradictions if we try to imagine it applying too broadly. It might well be that MaxEP, if it is a valid principle at all, will turn out to apply only to thermally-driven turbulent fluids in steady state with very large Reynolds numbers, and not to any other type of system.
However, in addition to considering the empirical evidence due to Paltridge, we can consider the theoretical claims that have been made about MaxEP. In my opinion the most advanced such arguments are due to Dewar (2003, 2005). Dewar does make the claim that MaxEP is broadly applicable - in fact, he says it's applicable to all systems in a steady state, but that all steady-state systems maximise their entropy production subject to constraints, and most systems are more heavily constrained than atmospheres, so that it's difficult to use MaxEP to make predictions about them. (This sounds like circular reasoning but it isn't. It's very similar to the way equilibrium system maximise their entropy subject to constraints such as conservation laws.) But again, Dewar's theory does not make any claims at all about transient states. Dewar's proof cannot be interpreted in the way the OP suggests, because it only compares steady states to other steady states, not to transient ones.
(As a side note, I should say that although I think Dewar's work is the closest thing we have to a theoretical explanation of Paltridge's observations, I don't think it's quite correct. My paper, linked above, attempts to resolve what I see as a serious logical contradiction in his approach. This is a different contradiction from the one we've been discussing so far, and has to do with the fact that Dewar's version of MaxEP makes different predictions depending on where you draw the system's boundary.)
I could just leave it there. However, in my paper I do make the claim that Dewar's version of MaxEP (or something like it) can be extended to transient states, in something quite similar to the way you suggest. Like Dewar, I try to extend Jaynes' MaxEnt thermodynamics to deal with non-equilibrium states. Briefly, the idea is that if we maximise the information entropy of the system's microscopic state at time $t_1$, subject to the knowledge we have about the system from measurements made at time $t_0$ then, trivially, we've maximised the rate of increase of information entropy between times $t_0$ and $t_1$. Identifying this information entropy with the thermodynamic entropy is trickier than it might seem at first, but if we can do that then we've reached a version of MaxEP that does indeed apply to all states, transient or otherwise.
However, I don't think it leads to a contradiction if you look at it in this way. The reason is that, given the knowledge constraints formed by the measurements at $t_0$, there is exactly one macrostate at every time $t>t_0$ that maximises the (information) entropy subject to those constraints; it cannot be any other way. This means, I think, that within this framework it is not possible for the situation you suggest to arise, and transient states with high entropy productions must always lead to steady states with high entropy productions. (But, having thought about it a bit more just now, this is all subject to an additional constraint of reproducibility that I don't think I spelt out very clearly in the paper. This needs more thought on my part.)
Important Note
For the sake of it not getting lost, there is an in-depth and (currently) on-going discussion of this answer and related issues in this chat room.
Best Answer
Let me start with the first sentence in your question:
which is very close to the statement in the introductory part of wikipedia page you cited. However, this is not a consistent way to express the minimum energy principle in thermodynamics. The reason for inconsistency should become clear by looking at formulas. In the case a thermodynamic state is fixed by the value of entropy, volume, and number of particles, the fundamental function from which the whole thermodynamic behavior can be obtained is the internal energy $U(S,V,N)$. Now, it is clear that once the independent variables are fixed, a unique value for $U$ is possible. There is one thermodynamic state and it is not clear which should be the states "among which energy should be minimum".
Actually, the correct statement of the minimum principle for energy is the following: in an equilibrium system at fixed entropy, volume and number of particles, and subject to internal constraints controlled by a set of parameters $X_{\alpha}$, the internal energy is a function $U(S,V,N;\{X_{\alpha}\})$ and the final equilibrium state, obtained after removal of the constraints, corresponds to the minimum of the energy among the all the possible values of the constraint variables $X_{\alpha}$ (see Callen's textbook on Thermodynamics for a reference).
Starting from the correct statement of the minimum principle, a first observation is that it is more general than just the convexity property of the function $U(S,V,N)$. Indeed, from the minimum principle, one can derive convexity of $U(S,V,N)$. But there are cases where the minimum principle provides results which are not derivable from convexity. For example, if one can determine different functions of energy at fixed $S,N$, as a function of $V$, minimum energy allows to determine for each $V$ the equilibrium state.
What about intuition? Frankly, I think that in the case of the minimum energy principle, is far from being intuitive. The main reason is that the underlying condition of constant entropy is difficult to manage both from the experimental and from the conceptual point of view. However, since from the minimum of energy $U(S,V,N;\{X_{\alpha}\})$ one can easily obtain similar minimum principles for the Legendre transforms of energy (Helmholtz free energy, Gibbs free energy), the difficult condition of fixed volume and entropy can be transformed into the conceptually and experimentally easier conditions of minimum at fixed temperature and volume or temperature and pressure.
Notwithstanding the previous words of caution about the non-intuitive condition of constant entropy, an example with a fluid system could help to get a better understanding. Let me start recasting in a correct way the situation, if it should be analyzed in term of minimum energy principle.
There is a composite system made by two compartments such that initially the first compartment contains a fluid (the same in both compartments for simplicity) described by the thermodynamic variables $S_1,V_1,N_1$, and the second by $S_2,V_2,N_2$. $V_1,N_1$ and $V_2,N_2$ remain always fixed.
The energy of this composite system is the sum of the energies of the two subsystems and, being filled with the same fluid (for example both Neon gas), the same function $U$ of entropy, volume and number of particles describes both. By introducing the subscript $tot$ for the extensive quantities describing the composite system we have $S_{tot}=S_1+S_2$, $V_{tot}=V_1+V_2$ and $N_{tot}=N_1+N_2$. For a given partition of the total entropy into a value $S_1$ and $S_2=S_{tot}-S_1$ (this is the constraint on our composite system) we have $$ U_{tot}(S_{tot},V_{tot},N_{tot};S_1)=U(S_1,V_1,N_1)+U(S_{tot}-S_1,V_2,N_2). $$ The minimum energy principle applied to the present case says that if we eliminate the constraint that system $1$ should have entropy $S_1$, but always keeping fixed $S_{tot}$, the final equilibrium state of the composite system will correspond to the value of $S_1$ which minimize $U_{tot}$.
That there should be a minimum can be seen by noting that $U(S,V,N)$, at fixed $V$ and $N$ must be an increasing function of $S$ (let's recall that $\left.\frac{\partial{U}}{\partial{S}}\right|_{V,N}=T\gt 0$). So, $U_{tot}$ is the sum of an increasing and a decreasing (convex) function in the interval $0<S_1<S_{Tot}$ and therefore there should have a minimum.
It is possible to check everything explicitly in the case of a perfect gas in two equal volume containers with the same density. The total energy is $$ U_{tot} \propto \left( e^{\frac{2S_1}{3N_1k_B}} + e^{\frac{2(S_{tot}-S_1)}{3N_1k_B}} \right), $$ which has a minimum at $S_1=S_{tot}/2$.
In a less formal way, one could say that the reason for the minimum is directly connected to the constraint of keeping fixed the total entropy. Since entropy is proportional to the logarithm of the number of states, a fixed total entropy in our composite system is equivalent to keep fixed the product of the number of states of system $1$ and system $2$. The way the number of states varies with energy provides the mechanism on which the minimum principle is based.
A final remark on microstates. Discussion of the minimum energy principle can be based, as in the previous paragraphs on a completely macroscopic thermodynamic description. Of course, thermodynamic variational principles can be translated into the language of statistical mechanics. However, statistical mechanics is more naturally expressed in the framework of entropy and its Legendre transforms. So, in the case of microscopic description it is easier (more intuitive) to work with maximum principles.