It's a mixture of three distributions, and can be found pretty easily by brute force, if one allows oneself to handwave over some important details (e.g., "is $\lambda < \mu$").
Let $n$ be the number of customers in the system; $n=0$ means no-one is being served or waiting, $n=1$ means one customer is being served but no-one is waiting, etc. Let $p_n$ be the steady-state (assumed from here on) probability that $n$ customers are in the system.
Clearly, if $n=0$, the time $t$ to the next service start is distributed $f(t|n=0) = \text{Exp}\{\lambda\}$, and if $n \geq 2$, the time to the next service start is distributed $f(t|n \geq 2) = \text{Exp}\{\mu\}$, since the next service will start immediately upon completion of the current one. If $n=1$, the time to the next service start is the maximum of the time to the next arrival and the time to the next service completion. This latter distribution has the following easily-derived form:
$f(t|n=1) = \lambda e^{-\lambda t} + \mu e^{-\mu t} - (\lambda + \mu)e^{-(\lambda+\mu)t}$
The mixture probabilities correspond to $p_0$, $1-p_0-p_1$, and $p_1$ respectively. The probabilities $p_0$, $p_1$, and $1-p_0-p_1$ can be written as:
$$p_0 = \left[\sum_{k=0}^{\infty}\left({\lambda \over{\mu}}\right)^k\right]^{-1} = 1 - {\lambda\over{\mu}}$$
$$p_1 = \left({\lambda \over{\mu}}\right)p_0 = {\lambda\over{\mu}} - \left({\lambda\over{\mu}}\right)^2$$
$$1-p_0-p_1 = \left({\lambda\over{\mu}}\right)^2$$
Writing the whole thing out, with some rearranging of terms, gives:
$$f(t) = {\lambda(\mu-\lambda)\over{\mu}}\left(e^{-\lambda t} + e^{-\mu t}\right) + \left({\lambda\over{\mu}}\right)^2\left(\lambda e^{-\lambda t} + \mu e^{-\mu t} - (\lambda+\mu)e^{-(\lambda+\mu)t}\right)$$
My source for probability formulae was Kleinrock, Queuing Systems.
Edit: The derivation of $f(t|n=1)$ is below, written as the derivation of the maximum of two independent exponential variates $x \sim \text{Exp}\{\lambda\}$ and $y \sim \text{Exp}\{\mu\}$. The corresponding CDFs are $F_X(x) = 1-\exp{\{-\lambda x\}}$ and $F_Y(y) = 1-\exp{\{-\mu y\}}$.
We'll approach this using the "cumulative distribution function technique". Note first that the statement "$\max(x,y) \leq t$" is equivalent to "$x \leq t$ and $y \leq t$". The probability that $\max(x,y) \leq t$ is just the product of the probabilities that $x \leq t$ and $y \leq t$ (as $x$ and $y$ are independent.) Writing this out gives:
$$F_{\max(x,y)}(t) = \left(1-e^{-\lambda t}\right)\left(1-e^{-\mu t}\right) = 1 - e^{-\lambda t} - e^{-\mu t} + e^{-(\lambda+\mu) t}$$
and taking the derivative with respect to $t$ gets you to the density function.
Memorylessness is a property of the following form:
$$\Pr(X>m+n \mid X > m)=\Pr(X>n)\ .$$
This property holds for $X_1=\ \text{time to the next event in a Poisson process}\ $, but it doesn't hold for $X_k=\ \text{time to the}\, k^\text{th}\, \text{event in a Poisson process}\ $ when $k>1$.
As for how to show it, you could try to do it from first principles.
If you can show that the essentially equivalent form $P(X>s+t)\neq P(X>s)P(X>t)$, (for $s, t>0\ $), that would be sufficient; you already know the distribution for $X_k$.
Best Answer
I think the main advantage to modeling the arrival distribution as Poisson is the memory-less property. It greatly simplifies the subsequent calculations to be able to assume that the number of arrivals in a particular time interval depends only on the length of the interval, rather than when the interval occurs, how many people came before, etc. Of course, this assumption is not always appropriate. For example, modeling the number of patients arriving to a doctor's office for routine checkups could be considered memory-less, since it would tend to average out to a constant rate over a long period of time. On the other hand, this assumption would likely be inappropriate for modeling the arrival rate of patients to an emergency room, since this would be more likely to follow a boom-and-bust pattern.