In an $M/M/1/K$ process, where $P_n$ is the proportion of time $n$ people are queuing or being served and $0 \le n \le K$, you can reach balance with
$$P_n=\left(\dfrac{\lambda}{\mu}\right)^n P_0 $$
and, if you write $\rho = \frac{\lambda}{\mu}$ as the ratio of the arrival and service rates, then this gives $P_n=\rho^n P_0$. Since $\sum_{n=0}^K P_n=1$, this results in $$P_n=\rho^n \dfrac{1-\rho}{1-\rho^{K+1}}$$
which with $K \to +\infty$ and $0 \le \rho \lt 1$ would give the $M/M/1$ result of $P_n \to \rho^n (1-\rho)$
With finite K, you no longer have $1-P_0=\rho= \frac{\lambda}{\mu}= {\lambda}T_s$ as the expected proportion of time the server is working, which seems to be the point of your question. Instead you have $$1-P_0=\rho\dfrac{1-\rho^{K}}{1-\rho^{K+1}}$$ (with $1-P_0=\frac{K}{K+1}$ when $\rho=1$), which is less than $\rho$, meaning that the server expects to work less when the queue has limited capacity, much as you might intuitively expect
Have you tried thinking of it this way? Fundamentally, X(N) is usually measured by a benchmark running at steady state as close to 100% utilization as the SUT and load drivers allow. This is known as the "internal throughput rate" or ITR. What we are really interested in is the External Throughput Rate or ETR, which is ITR times some function of Utilization. Now if we think of the scalability law in hardware terms the are 2 things consider:
- If we measured ITR in terms of available cores the USL curve follows change in throughput from 1 to N cores. If we make a big assumption that cores are fully used before cores are added then:
For each ITR meausrement of m out of n cores, we are essentially also measuring ITR at m/n utilization of n cores. In other words the scaling curve is a proxy for the saturation curve. Using this we can back our way into ETR as a function of utilization.
- The second thing to consider is that in the short run, utilization must be in n+1 states from 0/N to N/N. All measured utilization values come from averaging these states over time. In other words utilization is quantized and actually does hop around the states of 100% usage of m/N cores where m is a random value. This means that our assumption is not all that wild.
Once we have ETR as a function of utilization we can then proceed to find the Response time. Response time will be between 1/TP(1) and 1/TP(m).
There is a metric call the TPI (TeamQuest Performance Indicator) which is the ratio of the Service Time to the Response Time. This "Key Performance Indicator" eliminates the need to understand service time, but still allows us to understand the queuing effects and relative response time of various solutions.
Using a Queueing Model we can come up with a Usage based Performance Indicator which tells us how much the queueing is effecting the solutions being considered. We can plot this indicator v utilization and get characteristic curve which yields insight into the system. Both UPI and utilization are bound by 0 and 1.
The plot has four quadrants. Quadtrant 1 Utilization < 0.5, UPI > 0.5. This is where UPI curves for viable systems start. Quadrant 2. Utilization > 0.5, UPI > 0.5 This is where a well running system should be. In this quadrant Response time is still near Service Time and ETR approaches ITR. Quadrant 3. Utilization > 0.5, UPI < 0.5.This is where UPI curves terminate. Response Time >> Service time as ETR approaches ITR. Quadrant 4. Utilization < 0.5 , UPI < 0.5. This is the quadrant that systems need to avoid. ETR << ITR, and Response Time >> Service Time.
For the M/G/1 queuing model UPI = 1 /(1 + c^2 x u/(1-u)) where u is utilization and c is the "Index of Variability" or the Stdev/Mean of the utilization. Using UPI may eliminate the need to understand Tserv(u).
Hope this helps.
Best Answer
You're right that the formula for utilization is $\rho = \frac{\lambda}{c\mu}$. This describes the proportion of total service capacity being used in the system, so is the whole system utlization. The combined $c$ servers can serve at a maximum rate $c\mu$ and jobs arrive on average at rate $\lambda$, so this value will be less than 1.
The same formulas also describes the proportion of each server's time being used (because the M/M/c model does not keep track of which server is serving which job, only on the number of jobs in the system).