The solver uses a collocation method, that is, at each interval of the mesh it discretizes the differential equation by approximating the solution by a polynomial of a fixed degree and adding the necessary number of equations to equal the number of free parameters by demanding exact equality of the ODE at some interior points of the interval. For the given polynomial ODE this results in algebraic equations for the coefficients of the polynomial. The solver then proceeds to find a numerical solution of the BVP by solving this large non-linear system. After that, the error between the mesh points is estimated and the mesh refined towards a more homogeneous error estimate.
The point now is that the discretized differential equation will have more than one solution. Typically, the other solutions are not smooth, have large jumps between them, so that they are easily detectable by the solver. Or more often, the initial approximation is much closer to the smooth solution than to any of the extraneous solutions.
However, for stiff systems that might no longer be the case. "stiff" means that you have fast and slow components. The fast components require a very small step size to be modeled correctly in the discretization, while the many steps that that requires might lead to an undue accumulation of floating point errors in the other components. So it requires extra effort to find a workable balance.
In the collocation model, the large derivatives of the fast components lead to the situation where even the sampled exact solution is visually not smooth. That is, even some of the normal steps of the exact solution look like big jumps. So the true numerical solution is no longer easily distinguishable from the extraneous jumpy solutions of the algebraic system.
In the given system, the exact solution for large $K$ jumps at $x=1$ from essentially $c_2(x)\approx 0$ to $c_2(1)$ in the last part of the interval of length about $\Delta t=1/\sqrt{K}$. The square root is arbitrary for $1\gg\Delta t\gg1/K$, but captures the essence of the situation.
Close to $x=1$ (there can be no boundary layers at other points of the interval) the slow component $c_1$ can be treated as a constant, so that $c_2$ behaves qualitatively like an exponential function
$$
c_2(x)=c_2(1)\,e^{K\,c_1(1)\,(x-1)}=c_2(1)\,e^{-K\,c_1(1)\,(1-x)}
$$
This situation is captured by $e^{-Kt}$ close to $t=0$, i.e., using the substitution $t=c_1(1)(1-x)$. For $K$ large the function $e^{-Kt}$ goes from the appreciable value $1$ at $t=0$ to the very small value $e^{-\sqrt{K}}$ only a very small distance away at $t=1/\sqrt{K}$. Even more, if one wants the steps in the function values of $e^{-K\cdot n\,dt}$ to be small, one would need $K\,dt$ to be small, $1\gg K\,dt\gg 1/K$, for instance by choosing $dt=K^{-3/2}$.
The mesh refinement algorithm would have to detect this situation. However, without telling the algorithm about this situation (for instance by structuring the initial mesh this way), it would have to create a sizable number, up to $O(K)$, of mesh points with a distance of $K^{-1-\epsilon}$ close to $x=1$ to zero-in to the singularity. For $K=10^5$ this would requite mesh points in the order of some thousands or ten thousands, which will likely exceed any moderate size of Nmax.
The characteristic ODEs are
$$\frac{dx_{1}}{\ln x_{2}} = \frac{dx_{2}}{x_{2} u} = \frac{du}{u}$$
The last equality implies
ln u
$$u - \ln x_{2} = C_{1}$$
The first and third ratios imply
\begin{align}
d x_{1} &= \frac{(u - C_{1}) du}{u} \\
\implies u - C_{1} \ln u - x_{1} &= C_{2} \\
\implies u - C_{1} + C_{1} - C_{1} \ln u - x_{1} &= C_{2} \\
\implies \ln x_{2} + (u - \ln x_{2}) (1 - \ln u) - x_{1} &= C_{2} \\
\end{align}
and hence
$$\ln x_{2} + (u - \ln x_{2})(1 - \ln u) - x_{1} = f(u - \ln x_{2})$$
for some arbitrary differentiable function $f$. You can check this satisfies the PDE by differentiation
\begin{align}
u_{x_{1}}(1 - \ln u) - (u - \ln x_{2}) u_{x_{1}} u^{-1} - 1 &= u_{x_{1}} f' \\
\implies u_{x_{1}} \left(1 - \ln u - (u - \ln x_{2}) u^{-1} - f' \right) &= 1 \\
\implies u_{x_{1}} &= \frac{1}{\left(1 - \ln u - (u - \ln x_{2}) u^{-1} - f' \right)} \\
x_{2}^{-1} + (u_{x_{2}} - x_{2}^{-1})(1 - \ln u) - (u - \ln x_{2}) u_{x_{2}} u^{-1} &= (u_{x_{2}} - x_{2}^{-1}) f' \\
\implies u_{x_{2}} \left(1 - \ln u - (u - \ln x_{2}) u^{-1} - f' \right) &= - x_{2}^{-1} (\ln u + f') \\
\implies u_{x_{2}} &= \frac{- x_{2}^{-1} (\ln u + f')}{\left(1 - \ln u - (u - \ln x_{2}) u^{-1} - f' \right)}
\end{align}
so that
\begin{align}
\implies \ln x_{2} u_{x_{1}} + x_{2} u u_{x_{2}} &= \frac{\ln x_{2} - u (\ln u + f')}{\left(1 - \ln u - (u - \ln x_{2}) u^{-1} - f' \right)} \\
&= \frac{u (\ln x_{2} - u (\ln u + f'))}{\left(u - u \ln u - (u - \ln x_{2}) - u f' \right)} \\
&= \frac{u (- u \ln u + \ln x_{2} - u f')}{\left(- u \ln u + \ln x_{2} - u f' \right)} \\
&= u
\end{align}
I leave the domain of convergence and initial condition to you (hint for initial condition; the solution to the PDE can also be written as $$u - \ln x_{2} = g(\ln x_{2} + (u - \ln x_{2})(1 - \ln u) - x_{1})$$ for some arbitrary differentiable function $g$, then note that evaluating both sides at $(x_{1}, x_{2}, u) = (t + 1, e, 1)$ makes the LHS vanish identically).
Best Answer
It's certainly possible to turn these three second-order differential equations into six first-order ones. Your differentiation function will therefore have six inputs --- three coordinates and their three first derivatives --- and six outputs. (I think Matlab uses a vector of length six.) The first three are trivial: the velocities your function spits out should be the velocities you put in:
\begin{align} \frac d{dt} r &= r' \tag1\\ \frac d{dt} \theta &= \theta' \tag2\\ \frac d{dt} y &= y' \tag3 \end{align}
The other three derivatives you have to isolate from your equations of motion. The simplest route seems to want two auxiliary definitions. From your first equation,
$$ \Delta = r'' - \frac Mm y'' = r\theta'^2 + g\left(\sin\theta - \frac Mm \right). $$
From (the derivative of) your second,
$$ \Sigma = r'' + y'' = -\frac d{dt}(r\theta') = - r'\theta' - r\theta''. $$
(If I'm misreading your second equation and you mean a constant $R$ rather than the dynamical $r$, that's an easy substitution here, with $R' = 0$; note also $\theta''$ below.) Then the derivatives of the velocities that you input are
\begin{align} \frac d{dt} r' &= \frac{M\Sigma + m\Delta}{M+m} \tag4 \\ \frac d{dt} \theta' &= \frac{-2 r'\theta'- g\cos\theta}r \tag5 \\ \frac d{dt} y' &= \frac{M\Sigma - m\Delta}{2M} \tag6 \\ \end{align}
player100 points out in a comment that this system of equations actually isn't sensitive to a constant offset in $y$: the system is first-order in $y'$ rather than second-order in $y$. However, if $y$ is what interests you, then it's probably better to include it in your integrator than to leave it out. Integrating algorithms which make decisions about dynamically changing step sizes do so based on the error terms in the data that they have to look at.
You ask in a comment about when, in general, we can determine whether a system of equations is under/overdetermined; I think that's an interesting question, but too broad for this answer.
Note that a previous version of this answer put the "trivial" derivatives in the wrong place. Sorry about that --- it's been a long time since I
ode45
'd anything, and I'm rusty.