The definition in Milnor and Stasheff is a bit of a hybrid between a purely "coordinate chart" definition you alluded to (requiring transition functions to be smooth), and a purely Euclidean space definition, which runs as follows:
An $n$-dimensional manifold is a subset of $\mathbb{R}^A$ (here $A$ may be much bigger than $n$) such that each point has a neighborhood which is the graph of a differentiable function over a suitable coordinate subspace $\mathbb{R}^n\subset\mathbb{R}^A$.
Note that in this definition we only need the standard coordinate planes (it is not necessary to take arbitrary subspaces), by the implicit function theorem.
For example, the circle in the plane is the graph of a function of type $\sqrt{1-t^2}$ near every point, either over the $x$-axis or over the $y$-axis.
The "coordinate chart" definition has the advantage that no apriori structure is assumed on $M$ (other than being a set). In particular, the topology results from the smooth structure imposed by the transition functions.
From this point of view, the Milnor-Stasheff definion has a disadvantage that we must already know about topological spaces and the notion of a homeomorphism.
I think your questions will be answered most clearly if we try to identify the assumptions under which the definitions you wrote for differentiability make sense. The first thing to note is, like you wrote, that the first definition tells you when a map is differentiable and what is its differential while the second definition only tells you when a map is differentiable. A differentiable map from a surface also has a differential, but it is usually discussed separately from the definition of differentiability.
The first definition you have given makes sense for functions $f \colon D \rightarrow \mathbb{R}$ defined on an open subset $D \subseteq V$ of some normed vector space $(V, ||\,||)$. We write $x - a$, and we require from
the map $L \colon V \rightarrow \mathbb{R}$ to be linear, so $V$ better have the structure of a vector space and we use norm to make sense of the limit and estimate the size of $x - a$. The interpretation of $L$ we have in mind is that given $v \in V$ with $||v|| = 1$, $Lv = \frac{d}{dt} f(a + tv)|_{t =0}$ is directional derivative of $f$ in the direction of $v$.
This definition makes sense even if $D$ is not open if we take the limit in $D$, treating $D$ for example as a metric space with a metric induced from the norm $||\,||$. However, this raises many problems. For example, if $D = \{ (x,y,z) \, | \, z = 0 \} \subset \mathbb{R}^3$, then the derivative $L$ won't be unique. The limit $x \to a$ is taken in $D$, so $x - a$ always lie in $D$ and thus $L(x-a)$ that appears in the limit depends only on how $L$ acts on the subspace $D$ and not on the whole $\mathbb{R}^3$. The problem is that in $D$, one can approach a point $a$ using only directions that lie in the xy plane and so it doesn't make sense to require a priori from $L$ to be defined on the whole vector space $\mathbb{R}^3$ but to be defined only on the directions that are relevant to the limit. Of course, we can take $D$ to be something like $\{ (x,|x|,0) \, | \, x \in \mathbb{R} \}$ and then there are only two directions from which we can approach $a = (0,0,0)$ inside $D$ and so it doesn't make sense to encode the directional derivative in an operator that is defined on a vector space.
That is why $D$ is taken be an open set and so each point $a \in D$ can be approached inside $D$ through all possible directions and we say that $f$ is differentiable at $x = a$ if all possible directional derivatives can be "encoded uniformly in a linear operator" $L \colon \mathbb{R}^3 \rightarrow \mathbb{R}$.
Now, if $S \subseteq \mathbb{R}^3$ is a regular surface, it is never open in $\mathbb{R}^3$ so we need a different approach. We think of a regular surface as a two-dimensional object and so to check the differentiability of $f$ at $p$, we intuitively know that we need to check it only with respect to the directions the point $p$ can be approached inside $f$. The easiest definition is then to compose $f$ with a coordinate chart around $x = p$ (which turns $f$ into a function of two variables) and then to say $f$ is differentiable if the composition is differentiable.
Connecting to what I wrote before, if you read further, you'll see that the fact that $S$ is a regular surface and not an arbitrary subset of $\mathbb{R}^3$ guarantees that at each $p \in S$ there is a two-dimensional affine vector subspace of $\mathbb{R}^3$ called that tangent plane that consists of the velocities ("directions") of all the curves that pass through $p$ and live on $S$. This tangent plane depends on the point $p \in S$ and changes when we move $p$ around. The differential of $f$ at $p \in S$ will be defined as a linear map $df_p \colon T_pS \rightarrow \mathbb{R}$ and if $f$ is the restriction of a differentiable map $\tilde{f} \colon \mathbb{R}^3 \rightarrow \mathbb{R}$, then under appropriate identifications, $df_p$ will be the restriction of $d\tilde{f}_p$ to the two dimensional subspace of "relevant" directions tangent to $S$.
Best Answer
First of all, you are right about assuming that do Carmo's definition could be easily generalized to submanifolds in any $\mathbb{R}^n$, or in any $M$.
The definitions are slightly different in general. Let's see why.
1) Jeffrey Lee's $\Rightarrow$ do Carmo (generalized).
Suppose $p\in S$, then there is a chart $(U,x)$ in $M$ such that $x(U\cap S)=x(U)\cap (\mathbb{R}^k \times \{0\})$, we can perform a translation in $\mathbb{R}^{n}$ ($n=$ dim $M$) to get $c=0$. Now, just identifying $\mathbb{R}^k$ and $\mathbb{R}^k \times {0}$, we have a parametrization in the sense of do Carmo: $y=x^{-1}|_V: V=x(U\cap S)\rightarrow M$ which is clearly $C^{\infty}$ and a homeomorphism onto its image (because its the restriction of a chart, which is everything good you could desire) and its domain is open in $\mathbb{R}^k$ because $x(U)$ is open in $\mathbb{R}^n$. The injectivity of its differential comes from the fact that $y \circ x|_{U \cap S}=Id_V$, now use the chain rule.
2) do Carmo's (generalized) $\Rightarrow$ Jeffrey Lee's.
Suppose you have $p\in S$, $V\subset M$ open and a map $y:U \rightarrow V\cap S$ satistying 1), 2), 3), with $U$ open subset of $\mathbb{R}^k$. You need to garantee that the map $y^{-1}|:V\cap S \rightarrow U\subset \mathbb{R}^k$ is just the restriction of a chart $(x,W)$ around $p$ in $M$. You can prove that as follows.
Since $y$ is $C^\infty$, one-one and its differential is one-one, it is a diffeomorphism onto its image (one of the implications of the Inverse Function Theorem). Now since this work is all local, and $M$ is locally $\mathbb{R}^n$, we can solve the situation in $\mathbb{R}^n$ and then translate it to $M$ without any difficulty. In these case, once the $y^{-1}$'s are diffeomorphisms from some open sets of $S$ to $\mathbb{R}^k$, you can easily construct extensions for them by just using that $\mathbb{R}^n$ splits orthogonally as $\mathbb{R}^k\times \mathbb{R}^{n-k}$, and use this locally to define a chart in an open subset of $\mathbb{R}^n$ by just sliding up and down along segments orthogonal to your open in $S$, something like $x((p,0)+t(0,v))=y(p,0)+t(0,v)$ (your open may not be contained in $\mathbb{R}^k\times 0$, but I put it like that for the sake of simplicity. I'm sure you can adjust it to work in the "general" case). This would work because in $\mathbb{R}^n$ any diffeomorphism is a chart of its structure.
To conclude, both definitions are equivalent even in the general case when your ambient is any manifold, and your surfaces are actually submanifolds.
Remark: My original answer was wrong in one implication. Full credit to @Thomas, who made me aware of my mistake.