You make a good point which requires us to be more careful about what Fermat's Principle says and how the proof proceeds. The upshot of what I'm going to say is
The statement of the Law of Reflection must include an appropriate constraint.
Here's what I mean in detail. First, let's give a precise statement of Fermat's Principle:
Fermat's Principle. Let $\mathscr C_3$ denote the set of all continuous curve segments in three dimensions. Let points $A$ and $B$ in three dimensions be given. Suppose that a light ray begins at point $A$ and ends at point $B$, and suppose that the path of the light ray is constrained to not lie in some subset $\chi\subseteq \mathscr C_3$, then the path that light takes between $A$ and $B$ is a critical point of the travel time functional for any variation of paths contained in the set $\mathscr C_3\setminus\chi$.
We can use this principle to prove either of the following two statements, all three of which one might be inclined to call the Law of Reflection.
Law of Reflection 1. If light is emitted in a given direction towards a mirror, then (i) the light will travel in a straight line towards the mirror along the initial direction, (ii) it will hit the mirror, (iii) it will reflect in a straight line, and (iv) the angle of incidence will equal the angle of reflection.
Law of Reflection 2. If light is emitted from a point above a mirror, and if the light makes contact with the mirror, then (i) the light will travel in a straight line from its initial point to the point of contact, (ii) it will reflect in a straight line, and (iii) the angle of incidence will equal the angle of reflection.
Notice that in both of these cases, there is a constraint that one needs to take into consideration when determining the path of least time. In the first statement above, the constraint set $\chi$ is the set of all continuous paths whose initial directions do not coincide with that of the specified initial direction. In the second statement of the Law, the constraint set $\chi$ is the set of all continuous paths that do not make contact with the mirror.
Note that if you don't include a constraint, and if you simply pick any two points above the mirror, then, of course Fermat's Principle tells you that the path followed by light is the straight line segment joining those two points. But that's fine, because the Law of Reflection doesn't answer the question "given any two points $A$ and $B$ above a mirror, and given that a light ray goes from $A$ to $B$, what is the path that the light ray must take?" In fact, this question doesn't have a unique answer. The answer depends on the constraints.
They are equivalent.
The formal study of this kind of problem is called "The Calculus of Variations", and it requires that you have some level of understanding of integration and of partial derivatives.
You may imagine parameterizing the path taken in any way you want, say
$$\vec{f}(t;\, \alpha,\beta,\delta,\dots)$$
where the function describes the position of the light ray at time $t$ and $\alpha$, $\beta$, $\delta$ etc are a set of numbers from which you build the path that you are proposing to take (perhaps they represent the angles the light takes through each material in the way). Then you find the arrival time $T$ such that $f(T;\dots) = \text{destination}$ and plot $T$ as a function of the parameters $\alpha$, $\beta$, $\delta$ etc.
The arrival time $T$ will have it's smallest value for the set of parameters that describe the path that is actually taken.
But this kind of math has certain limitations and one of them is that it doesn't actually know the difference between maximum and minimum (nor indeed can it tell either of those apart from "inflection points" which I'm not going to explain but you should have heard of if you have studies some calculus).1 Formally it is said to yield a "stationary action".
1 There are several questions around the site about manipulations of the "Lagrangian" to cause the physical path to occur at a maximum instead of a minimum, which is equivalent.
Best Answer
You seem to have a lot of questions, and other responses don't really answer the core so well.
Why is Fermat's principle true? How did Fermat know it?
Assume that you have any medium satisfying the wave equation, $v^2 \nabla^2 f = \ddot f$. This holds for taut strings, for light in the Maxwell equations, for vibrations on a drum, etc.
Then it turns out that this equation is satisfied in one dimension for any function of one argument $f(x \pm v t)$, so long as that is the structure of those arguments. In 3 dimensions we have to use the 3D Pythagorean theorem, but it is still $f(x - v_x t, y - v_y t, z - z_y t)$ as long as $v_x^2 + v_y^2 + v_z^2 = v^2.$
In other words: any "lump" moving along a straight trajectory at speed $v$ in any direction solves the wave equations. And straight lines are the minimum-distance trajectories! So this is already promising!
Fermat also knew, from reading Greek sources, that any light reflections follow the minimum-distance path. This is not too hard: we know that the angle of incidence of reflected waves is the same as the angle of reflection; this means that we just need to prove that for any other path it's a longer path. So, suppose we start at $(-1, 0)$, follow a path to some point $(x, 1)$ for some $x$, and then end up at $(+1, 0)$, both of the latter through straight lines: what's special about the $x = 0$ in the middle? We see from the Pythagorean theorem that this total distance is $$d = \sqrt{(x + 1)^2 + 1^2} + \sqrt{(x - 1)^2 + 1^2},$$and even the Greeks could understand (without algebra or calculus) that this expression is at a minimum for $x = 0$. To do it without calculus: if you square both sides you'll find that much of the complexity drops out, leaving just $$d^2 = 2 x^2 + 4 + 2 \sqrt{x^4 + 4}.$$Since $x^4$ and $x^2$ both have minimums at $x=0$ and $\sqrt{\bullet}$ is monotonic (always-increasing, hence preserves minimums/maximums), you can see that the minimum of this expression is likewise $x = 0.$
Fermat knew about straight lines and knew about reflections, but he was talking to the follower of a mathematician known as René Descartes, who had plagiarized Snell's law (Snell had not published it), giving a crazy derivation which assumed that light moved "slower" in more-dense material even though he thought light traveled infinitely fast everywhere. Both Descartes and Snell had achieved the same law, that there was some parameter $k$ such that in refraction, $\sin \theta_i = k~\sin\theta_2$. This was experimentally correct.
Fermat thus has these two ideas: the Greek idea that reflections and straight lines are least-distance paths, and the Cartesian idea that maybe light travels slower in the denser medium. He basically just threw out the idea that light travels infinitely fast, then calculated the time to travel. From a point $(-1, -1)$ through a point $(x, 0)$ into the point $(1, 1)$, we know that Snell's law says that $${1 + x\over\sqrt{1^2 + (1 + x)^2}} = k ~ {1 - x \over \sqrt{1^2 + (1 - x)^2}}$$ for some $k$.
The trick here is, Fermat knew a little calculus. Not too much calculus, but presumably enough to see that the above expressions are hiding a chain rule:$$\frac{d}{dx} \sqrt{1^2 + (1 + x)^2} = -k ~\frac{d}{dx} \sqrt{1^2 + (1 - x)^2}$$or,$$\frac{d}{dx} \left(\sqrt{1^2 + (1 + x)^2} + k \sqrt{1^2 + (1 - x)^2}\right) = 0$$When a derivative equals 0, that means we're at a minimum or maximum. By saying that $k = v_1 / v_2$ we find directly that $\frac{d}{dx} \left(\frac{L_1}{v_1} + \frac{L_2}{v_2}\right) = \frac{d}{dx} \left(T_1 + T_2\right) = 0$. So Fermat was able to work out that Descartes' new law could indeed be worked out from the "least total time" principle. And, of course, in a homogeneous medium the least-distance paths of the Greek school were least-time paths too, so all paths are least-time paths: hence Fermat's principle.
At the time science didn't quite have the "Experiment shows it, therefore it's true" character: instead it was very common for every result to be justified with some sort of mathematical beauty, as a perfect God would surely provide a perfect universe and mathematics was humanity's most pure, perfect, enduring art. So Fermat tried to convince some Cartesians that everything flowed more naturally from his least-time principle, but they thought it was some crazy heuristic, and was dubious at best.
Why can't light follow other paths?
In classical electromagnetism, we have as a huge milestone in physics, James Clerk Maxwell proving that light was an electromagnetic wave. In addition to satisfying the wave equation and straight paths, you can find out that in electrically-polarizable mediums, light travels at a slightly slower speed than $c$, its speed in vacuum.
Light in electromagnetism turns out to always carry a momentum proportional to $1/\lambda$, where $\lambda$ is its wavelength. So the straight-line paths law amounts to saying that momentum and energy are conserved; the reflection law says that energy is conserved and momentum is only changed by a force perpendicular to the surface of reflection; and it turns out that Snell's law is also all about momentum-conservation, since waves can't go out of the interface between media faster than it comes in, so both waves are at the same frequency and their wavelengths go like $\lambda_i = f / v_i$.
So, in classical electromagnetism, we can just say that these come about because of conservation of energy and momentum.
Least action principles
A guy named Lagrange came up with a new way to do Newtonian mechanics, requiring a huge extension of calculus called "the calculus of variations." He found out that Newtonian mechanics could often be converted into an "action principle" that assigned to every trajectory of a system through its possible paths a number, called the action of that path. It turned out that Newtonian mechanics just said, "of all the paths that the system could take between these two points, the only ones it does take are paths of least action relative to other paths 'nearby'." The connection is that if you have a potential energy $U_P(t)$ and a kinetic energy $K_P(t)$ both defined on the path $P$ then the action for a path is the time integral of their difference,$S[P] = \int_P~dt \big[K_P(t) - U_P(t)\big].$
The least-action principle works perfectly as a least-time principle if the "action" for light does not depend on anything special, $K_P - U_P = \text{constant}.$ If this is just the frequency of the light, then you trivially get all of these laws.
Quantum Least-Action
Schwinger, Tomonaga, and Feynman shared a Nobel Prize for a theoretical extension of quantum mechanics which is based on action principles. This is probably going to be the simplest you will get for a theoretical basis for why everything follows a least-action principle.
The idea is, suppose that you get really hard-nosed about saying "I am only going to calculate probabilities for a particle, like a photon, to be emitted from a source and absorbed by a detector. Each probability will be based on an amplitude, which consists of a scale factor $s$ times a 2D rotation matrix $R(\theta)$." [This is because rotations are the simplest waves; also a scaled 2D rotation matrix is a complex number.] "The probability associated with the amplitude $s R(\theta)$ will be $s^2,$ and we add these amplitudes by matrix-sums and multiply them by matrix-products. If we have an event that can happen in a bunch of different ways, we use the sum of their amplitudes; if we have an event that depends on a bunch of other paths happening in sequence, we use the product of their amplitudes. Otherwise, if the action of a path is $S$ then typically the amplitude is $R(S / h),$ where $h$ is Planck's constant."
The resulting theory has all of the wavey interference patterns of any wave theory you'd like, but fundamentally works upon particles. In addition, because $h$ is so tiny, sums of amplitudes tend to rotate into oblivion, not generating any useful material for the probability to build upon, unless $S$ is near its minimum: so the classical limit of the theory $h \rightarrow 0$ is obviously the least-action principle. For light, the action principle makes this into the rotation matrix $R(2\pi~f~t)$, the simplest wave.
So that's the most fundamental reason that we know of that light might take the minimum time path: Maybe everything takes all paths, but interfering based on this general "action" quantity, and light's action happens to be just its frequency times the time it has traveled.