A function $f:\>x\mapsto f(x)$ given by some expression has a "natural" domain of definition $D(f)$: the set of all $x$ in the realm of discourse (${\mathbb R}$ or ${\mathbb C}$, say) for which $f(x)$ can be evaluated without asking questions. In most cases $f$ is continuous throughout $D(f)$, which means that for all $x_0\in D(f)$, when $x$ is sufficiently near $x_0$ then $f(x)$ is very near to $f(x_0)$.
Now some $f$'s may have "exceptional points" where they are not continuous, e.g., the sign-function, which is defined on all of ${\mathbb R}$, but is discontinuous at $0$. Above all, the set $D(f)$ may have "real" or "virtual" boundary points, where $f$ is a priori undefined. But nevertheless we have the feeling that $f$ has a "reasonable" behavior in the neighborhood of such a point. Examples are $x\mapsto{\sin x\over x}$ at $x=0$ (a "real" boundary point of $D(f)$), or $x\mapsto e^{-x}$ when $x\to\infty$ (here $\infty$ is a "virtual" boundary point of $D(f)$).
All in all the concept of "limit" is a tool to handle such "exceptional", or: "limiting", cases. An all-imortant example is of course the following: When $f$ is defined in a neighborhood of $x_0$ we are interested in the function
$$m:\quad x\mapsto{f(x)-f(x_0)\over x-x_0}$$
which has an "exceptional" point at $x_0$. It is impossible to plug in $x:=x_0$ into the definition of $m$.
This brings me to your point 4. which gets to the heart of the matter. I'd rewrite the central sentence as follows: In the definition of the limit of $f(x)$ for $x\to c$ it says that I can make $f(x)$ as close to the value $L$ as I wish, as long as I'm willing to make $x$ sufficiently close to $c$. The idea is: While it is in most cases impossible to put $x:=c$ in the definition of $f$, we want to describe how $f$ behaves when $x$ is very close to $c$.
You then go on to say that "this definition is supposed to be mathematically rigorous, but using these as close and sufficiently close don't look rigorous to me".
The whole $\epsilon$-$\delta$ business serves exactly the purpose to make the colloquial handling of as close and sufficiently close that you are lamenting rigorous.
Life would be simpler if we could define $\lim_{x\to c}f(x)=L$ by the condition $|f(x)-L|\leq |x-c|$, or maybe $|f(x)-L|\leq 100|x-c|$. But four centuries of dealing with limits have taught us that the $\epsilon$-$\delta$ definition of limit, arrived at only around 1870 or so, captures our intuition about them in an optimal way. It takes care as well of the unforeseeable cases when the error $|f(x)-L|$ can be made as small as we we want, but we need an extra effort in the nearness of $x$ to $c$, e.g., $|x-c|<\epsilon^2$ instead of ${\epsilon\over100}$.
You have two conflicting goals here. If $y$ is arbitrary, then $x^y$ only
makes sense for $x>0$. Imagine, for example, that $y = \frac{1}{2}$.
Then $x^y = \sqrt{x}$ - what does that mean for negative $x$?
Note that switching to complex numbers doesn't help much - negative numbers do have
square roots then, but those are non-unique, and what's worse, the number
of solutions is highly dependent on $y$! E.g., $y^n = x$ has $n$ solutions in
$\mathbb{C}$. Which is $x^\frac{1}{n}$ supposed to be?
So you'll have to distinguish between two cases. One is $f(x)^{g(x)}$ for
positive $f$, and the other is $f(x)^k$ for constants $k \in \mathbb{Z}$
(i.e., no fractional exponents). You could generalize the second case to
$f(x)^{g(x)}$ for functions $g$ which take only integral values, but since
such functions are either constant or non-continuous, that case isn't really
interesting for purposes of differentiation, I think.
BTW, a far more interesting (and maybe solvable!) question is how to deal with
non-negative $f$, which nevertheless may take the value zero. $f(x)^{g(x)}$ is perfectly well-defined for those, but you'll still run into problems with the logarithm. Now, in some cases these problems are due to the fact that the derivative does, in fact, not exists at these points. But not in al cases! For example, $f(x) = x^2$ has derivative $0$ at $x=0$. The reason is, basically, that since $g$ is constant in this case, then $g' ln f$ doesn't matter, because $g' = 0$, and similarly for $f'\frac{g}{f}$. But you can't just cancel things that way in all cases - that will produce wrong answers sometimes, because it actually depends on how fast things go to zero respectively infinity.
You might ask, then, why the non-uniqueness mentioned above doesn't prevent us from sensibly defining $\sqrt[x]{x}$ - after all, $y^n = x$ has two solutions for positive $x$ even in \mathbb{R}$. The reason is twofold
The number of solutions doesn't explode as badly. We have one solution of $y^n = x$ for odd $n$, and two for even $n$.
There's an order on $\mathbb{R}$, which makes the definition of $\sqrt[n]{x}$ as the (unique!) positive solution of $y^n = x$ quite natural.
The effect of (1) and (2) is, for example, that while it's not true that $\sqrt[n]{x^n} = x$, we do get at least that $\sqrt[n]{x^n} = |x|$. Trying to do the same over the complex numbers fails horribly. We could attempt to define $\sqrt[n]{x}$ as the solution of $y^n =x$ with the smallest angle (assuming we agree to measure angles counter-clockwise from the real axis). But then an $n$-th root always has an angle smaller than $\frac{2\pi}{n}$, so $\sqrt[n]{x^n}$ and $x$ would have very little in common except that their $n$-th power is $x^n$.
Best Answer
First let's try to understand why the derivative of the function $f$ given by $f(x) = x^2$ is equal to $2x$ and not to $x$. (The product rule and the power rule are both generalizations of this.)
Imagine that you have a square whose sides have length $x$. Now imagine what happens to its area if we increase the length of each side by a small amount $\Delta x$. We can do this by adding three regions to the picture: two thin rectangles measuring $x$ by $\Delta x$ (say one on the right of the square and another on the top) and one small square measuring $\Delta x$ by $\Delta x$ (say added in the top right corner.) So the change in the area $x^2$ is equal to $2x \cdot \Delta x + (\Delta x)^2$. If we divide this by $\Delta x$ and take the limit as $\Delta x$ approaches zero, we get $2x$.
So geometrically what is happening is that the small square in the corner is too small to matter, but you have to count both rectangles. If you only count one of them, you will get the answer $x$; however, this only tells you what happens when you lengthen, say, the horizontal sides and not the vertical sides of your square to get a rectangle. This is a different problem than the one under consideration, which asks (after we put it in geometrical terms) how the area varies as we lengthen all the sides.