$\lambda f.\lambda x. x$ is a function, not a number. But the point is that we can use it as a model of a number.
In modern mathematics, we understand things not as what they are, but as what they do. We don't spend a lot of time talking about the intrinsic essence of the number 1, or asking "but what is the number 1?" That's a question for philosophers, maybe.
Instead, we ask if something behaves the way we want the number 1 to behave.
What do we need to say that some collection of objects behaves like numbers? Well, at the very least, we would like there to be an infinite collection of the objects, all distinguishable, and we would like to be able to do addition and multiplication and the like. If we take the object that we identify as being the number 1, and add it to itself, we should get the object that we identify as being the number 2. If it doesn't do that, then it isn't very much like numbers.
You may want to consider the way computers handle numbers. The number 1 inside the computer is represented as some pattern of electron flow through some chunk of silicon. What does that have to do with the number 1? Is it really a number 1? Hard to say! But it does act like a number 1 in the following sense: we can ask the computer to add 1+1 and it will do some magic thing, and the result will be the pattern of electrons that has been designated as representing the number 2. But what do the electrons have to do with numbers? Are they really numbers? Is there 1 electron somewhere, and we add to it another 1 electron to get 2 electrons? (No.) Is the magic thing really addition? It's hard to see how, but for most purposes, we don't care.
The Church numerals are similar. Whether or not they are numbers isn't the question. The question is whether they act like numbers, or rather whether they can be made to act like numbers. For example, if you define
$$\begin{align}
\operatorname{plus} & = \lambda m n f x . (m f (n f x))\\
1 & = \lambda f x . f x\\
2 & = \lambda f x . f (f x)
\end{align}$$
you get a plus function: $\operatorname{plus} 1\, 1$ is in fact the same function as $2$. And that is because this thing we are calling “$2$” is essentially the operation that applies some function twice to an argument, the “$1$” is the operation that applies a function once to an argument, and $plus$ takes two numerals $m$ and $n$, and then a function and an argument, and applies the function first $n$ times to the argument and then applies the function $m$ times to the result, for a total of $m+n$ applications. So $\operatorname{plus} 1\, 1$ does the same thing as $2$ when applied to any arguments $f$ and $x$ (namely, it computes $f(f(x))$) and is therefore the same function; in the $\lambda$-calculus, the two objects are indistinguishable.
There's nothing special about this definition; there are many different ways one could define numbers in $\lambda$-calculus. For example:$\def\pair{\operatorname{pair}}$
$$
\begin{align}
0 & = \pair true \;true \\
1 & = \pair false \;0 \\
2 & = \pair false \;1 \\
3 & = \pair false \;2 \\
& \cdots
\end{align}
$$
This represents each number $n$ as a linked list of length $n$ falses followed by a true. We can test if a number is zero by examining the head of the list; it's $true$ if the number is zero and false otherwise. We can easily define a successor (“$+1$”) operation: $$\operatorname{successor} = \lambda n . \pair false\; n$$ and with enough ingenuity we can define addition and multiplication functions. Is this somehow less legitimate than the Church numerals? No; it's just less convenient. Do we prefer the Church numerals because they better capture the true essence of numbers? No, we only prefer the Church numerals because definition of $plus$ is simpler.
And again: What does it mean for something to be true or false? In this context we are not concerned with abstract truth or falsity. We just want something that behaves in a certain way: We want true and false to be different, and distinguishable, and we want to have some sort of if-then-else construction so that $\mathbf{if}\; b\; \mathbf{then}\; p \;\mathbf{else}\; q$ evaluates to $p$ when $b$ is true, and to $q$ when $b$ is false. There is nothing more to it than that. The definitions
$$
\begin{align}
\mathbf{true} & = \lambda x y . x \\
\mathbf{false} & = \lambda x y . y\\
\mathbf{if}\; b\; \mathbf{then}\; p \;\mathbf{else}\; q & = \lambda b p q . b p q
\end{align}
$$
do exactly that, and that's all we require of them. We could just as easily have defined them this way:
$$
\begin{align}
\mathbf{true} & = \lambda x y . y \\
\mathbf{false} & = \lambda x y . x\\
\mathbf{if}\; b\; \mathbf{then}\; p \;\mathbf{else}\; q & = \lambda b p q . b q p
\end{align}
$$
Notice that I switched the definitions of $\mathbf{true}$ and $\mathbf{false}$ here—$\mathbf{true}$ now has the definition that usually belongs to $\mathbf{false}$, and vice versa—but I also changed the definition of $\mathbf{if}$ to match, so that $\mathbf{if}\; b\; \mathbf{then}\; p \;\mathbf{else}\; q$ will still be $p$ when $b$ is $\mathbf{true}$. Is this somehow saying that truth and falsity are interchangeable? No, it says nothing of the sort. It says that in our model they are interchangeable. But nobody ever said that the model was supposed to capture all the properties of truth, like that it is true that snow is white. It's only intended to capture the one property of truth that we intended to model, which is that $\mathbf{if}\; b\; \mathbf{then}\; p \;\mathbf{else}\; q$ is $p$ whenever $b$ is true, and $q$ whenever $b$ is false.
There's no difference. Lambda Calculus can't tell if you write $\lambda x . M$, if you write $f(x) = M$ or if you write $\hat{x} . M$ (which apparently was church's original notation? I haven't fact checked that, though).
The reason people care about lambda calculus is because it has exactly 3 combinatorial rules from which all computation follows. The dream is that, by studying these (very simple!) combinatorial rules (and we as mathematicians, and arguably as humans, are very good at reasoning about combinatorics) we can understand something about computation itself.
If you compare even the length of the definition of lambda calculus to the length of the definition of turing machines you'll see why people like this framework.
As a last reason, it's very flexible. We have the 3 basic combinatorial rules, but if we want to express other properties, or make certain operations primitive, it's very easy to extend the lambda calculus to make reasoning about more complicated computation easy.
There's a deep connection between lambda calculus and category theory that's also interesting and worth studying in its own right, but that's a longer discussion.
I hope this helps ^_^
Best Answer
Welcome to Math.SE!
You have correctly intuited that we somehow need to end up with two functions of equal "arity"* to solve this problem. However, you don't need (nor can you have) a term that checks for the number of arguments that its input takes.
If you play around with the standard combinators (Example 2.10) for a bit, you'll notice the following:
so if you choose the term $f$ to have "arity" 3, the lambda expression $(\lambda e.Sefe)$ will always reduce to something of "arity" 2 when applied to $I$ or $K$. We don't need to check the arity: we can just arrange for it to be a fixed, known quantity. From here on, a clever choice of $f$ will lead us to the solution.
Set $f$ to $(\lambda a. \lambda b. \lambda c. c)$. We have
Consequently, the term $\lambda e.Se(\lambda a. \lambda b. \lambda c. c)eyx$ has the desired property (you can simplify it further if you wish).
* The concept of "arity" is of course not precise. Dwelling too much on it would potentially mislead, and at best distract from the relevant considerations. E.g. does it really make sense to say that $I$ has arity 1? After all, one could equally well say that $I$ has arity 3 since it takes 3 arguments in an expression such as $I(\lambda a. \lambda b.a)pq$.