What is the point of having values in the codomain that can not be output by the function, how does that aid in describing the function?
Here are a few reasons why we allow some functions to not be surjective.
As Lubin mentioned, the range of a function can be difficult to determine. For example, determining the range of a polynomial of high even degree (such as $P(x) = x^6 - 3x^2 + 6x$) amounts to finding the zeroes of a high-degree polynomial (such as $P'(x) = 6x^5 - 6x + 6$, whose roots are not expressible as radicals), a difficult task in general. We could get around this by defining the codomain of every function $f$ to be $\operatorname{im} f$ (that is, $\{y\,:\,f(x)=y\text{ for some }x\in X \}$), but that doesn't really add any information.
It's nice to separate surjective functions from other functions because surjective functions are dual to injective functions. When I say "dual" I'm referring to, for example, the following fact: a function $f:A\to B$ is injective if and only if there is a function $g:B\to A$ such that $g\circ f=1_A$ (by $1_A$ I mean the identity function on $A$); a function $f:A\to B$ is surjective if and only if there is a function $g:B\to A$ such that $f\circ g=1_B$. When you study the branch of mathematics known as category theory, you'll see that it's very natural to have dual properties like this.
Does this also mean that the domain can include numbers that the are not inputs to the function?
As others have remarked, the domain of a function can include other objects than numbers. For example, you could define a function which takes as input a person and returns his age. In any case, a function must be defined on all possible input values. The answer to your second question is no.
And is it also then true that is a function is "onto" the codomain is the same as the image? So surely any function can be "onto" if you just change the what the codomain is?
That's exactly right. You can make any function onto by changing the codomain. But as I remarked earlier, in general we don't know what the image of a function is and so it doesn't add any information to restrict the codomain.
What I'm really trying to ask I guess is the range/image of a function is defined by the function, what defines the codomain?
The codomain usually arises naturally in the definition of the function. For example, whenever you have a function which returns a number, the natural choice of codomain is $\mathbb R$. Of course, if by "number" you mean "complex number" then the codomain could be $\mathbb C$; if by "number" you mean "quaternion" then the codomain could be $\mathbb H$.
On the other hand, owing to the set-theoretic fact that "there is no set containing everything," it's not possible to pick a single universal codomain for functions.
When I wrote up this answer I realized that I used to ask the same questions as you, but I stopped once I had learned enough mathematics. I can't give you a single profound reason why we don't make all functions surjective besides a pragmatic one: surjectivity is a useful notion, and getting rid of it would be unprofitable.
Yes, this has plenty to do with the derivative. In particular, what you describe is the backwards difference operator, which is just defined as
$$\nabla f(n)=f(n)-f(n-1).$$
This is an operator of interest on its own, but the connection to calculus is that we can consider this as telling us the "average" slope between $n-1$ and $n$.
What you are doing is iterating the operator. In particular, one often writes
$$\nabla^{k+1} f(n)=\nabla^k f(n)-\nabla^k f(n-1)$$
to meant that $\nabla^k f(n)$ is the result of applying this operator $k$ times. For instance, one has that $\nabla^3 n^3 = 6$, as you note. More generally $\nabla^k n^k = k!$, and this lets us recover a polynomial function from its table, which is what you were up to in sixth grade.
However, we can take things further by trying to interpret these numbers - and there is a natural interpretation. For instance, $\nabla^2 f(n)$ represents how quickly $f$ is "accelerating" over the interval $[n-2,n]$, since it tells us about how the average slope changes between the interval $[n-2,n-1]$ and the interval $[n-1,n]$. If we keep going, we get that $\nabla^3 f(n)$ tells us how the acceleration changes between an interval $[n-3,n-2]$ and $[n-2,n]$. We can keep going like this for physical interpretations.
However, this operator has a problem: We'd like to interpret the values as accelerations or as slopes, but $\nabla^k f(n)$ depends on the values of $f$ across the interval $[n-k,n]$. That is, it keeps taking up information from further and further away from the point of interest. The way one fixes this is to try to measure the slope over a smaller distance $h$ rather than measure it over a length of $1$:
$$\nabla_h f(n)=\frac{f(n)-f(n-h)}h$$
which is now the average slope of $f$ between $n-h$ and $n$. So, if we make $h$ smaller, we start to need to know $f$ across a smaller range. This gives better meanings to higher order differences like $\nabla_h^k f(n)$, since now they only depend on a small portion of $f$.
The derivative is just what happens to $\nabla_h$ when you send $h$ to $0$. It captures only local information about the function - so, it captures instantaneous slope or instantaneous acceleration and so on. In particular, one can work out that $\nabla f(n)$ is just the average of the derivative over the interval $[n-1,n]$. One can also work out that $\nabla^2 f(n)$ is a weighted average* of the second derivative over the interval $[n-2,n]$ and $\nabla^3 f(n)$ is another weighted average of the third derivative over $[n-3,n]$.
In particular, if the $k^{th}$ derivative is constant, then it coincides with $\nabla^k f(n)$. One can also find results that if the $k^{th}$ derivative is linear, then $\nabla^k f(n)$ differs from it by at worst a constant. In particular, $\nabla$ is good at capture "global" effects (like the highest order term in a polynomial and its coefficient) but bad at capturing "local" effects (like instantaneous changes in the slope). So, in some sense, $\nabla$ is just a rough approximation of the derivative, and has similar interpretations, just doesn't work nearly as cleanly.
(*Unfortunately, "weighted average" here is hard to explain rigorously without calculus. For the benefit of readers with more background, I really mean "convolution" assuming that $f$ is actually differentiable enough times for any of this to make sense)
Best Answer
For something to be a function, you need one output for each input. You do not need vice versa. Your equation $x^2+y^2=1$ does not define $x$ or $y$ as a function of the other because (for example) if I give you an $x$ there are $0$ or $2$ values for $y$ that satisfy the equation. If you write $y=\sqrt{1-x^2}$ for $-1 \le x \le 1$ you have a valid function because we have restricted the $y$ values to $[0,1]$ and for any $x$ in the domain there is only one $y$ in the range that corresponds. It is still not vice versa because for $y=\frac 35$ you can have either $x=\frac 45$ or $x=-\frac 45$, but that is not a problem.