\begin{align}
f\big(\boldsymbol x^{k+1} \big) = f\big(\boldsymbol x^k + \alpha_k \boldsymbol s^{k}\big) &= \frac12 \big(\boldsymbol x^k + \alpha_k \boldsymbol s^{k}\big)^T Q \big(\boldsymbol x^k + \alpha_k \boldsymbol s^{k}\big) + \boldsymbol q^T \big(\boldsymbol x^k + \alpha_k \boldsymbol s^{k}\big) -\beta\\
&= \frac12 \big(\boldsymbol x^k\big)^T Q \boldsymbol x^k + \frac12 \alpha_k^2 \big(\boldsymbol s^{k}\big)^T Q \boldsymbol s^{k} + \alpha_k \big( \boldsymbol s^k \big)^T Q \boldsymbol x^{k} + \boldsymbol q^T \boldsymbol x^k + \alpha_k\boldsymbol q^T \boldsymbol s^{k} -\beta \\
&= f\big(\boldsymbol x^k \big) + \frac12 \alpha_k^2 \big(\boldsymbol s^{k}\big)^T Q \boldsymbol s^{k}+ \alpha_k \big( \boldsymbol s^k \big)^T Q \boldsymbol x^{k} + \alpha_k \big( \boldsymbol s^{k}\big)^T \boldsymbol q\\
&= f\big(\boldsymbol x^k \big) + \frac12 \alpha_k^2 \big(\boldsymbol s^{k}\big)^T Q \boldsymbol s^{k}+ \alpha_k \big( \boldsymbol s^k \big)^T \Big[Q \boldsymbol x^{k} + \boldsymbol q\Big] =: g(\alpha_k)
\end{align}
Assuming a fixed $\boldsymbol x^{k}$, $g(\alpha_k) $ is essentially a univariate, scalar parabola with unique minimum (recall that this is what we want) at $g'(\alpha_k) = 0$. Performing the derivative, one obtains
\begin{align}
0 &\overset{!}{=} \alpha_k \big( \boldsymbol s^k \big)^T Q \boldsymbol s^{k}+ \big( \boldsymbol s^k \big)^T \Big[Q \boldsymbol x^{k} + \boldsymbol q\Big] \\
\Rightarrow \alpha_k &= -\frac{ \big( \boldsymbol s^k \big)^T \Big[Q \boldsymbol x^{k} + \boldsymbol q\Big]}{\big( \boldsymbol s^k \big)^T Q \boldsymbol s^{k}}\end{align}
By definition of $f\big(\boldsymbol x^k\big)$, $$\nabla f\big(\boldsymbol x^k\big) = Q \boldsymbol x^k + \boldsymbol q$$
and thus
\begin{align}
\alpha_k &= -\frac{ \big( \boldsymbol s^k \big)^T \nabla f\big(\boldsymbol x^k\big)}{\big( \boldsymbol s^k \big)^T Q \boldsymbol s^{k}} = -\frac{ \nabla f\big(\boldsymbol x^k\big)^T \boldsymbol s^k }{\big( \boldsymbol s^k \big)^T Q \boldsymbol s^{k}} \end{align}
as desired.
Unfortunately, it is not possible. What you are looking for is a function $g(\alpha,\mu, L)$ that is dependent only on the Lipschitz constant $L$, the strong convexity constant $\mu$, and the set size $\alpha$, such that $\lim_{\alpha \to 0}g(\alpha, \mu, L)=0$ and
\begin{align}
||x_{k+1}-x_{k}|| \leq g(\alpha,\mu,L) ||x_{k}-x^{*}||
\end{align}
for any convex set X.
The issue is that the function $g$ has no ''information'' about the set $X$ you are projecting onto, it is independent of the set $X$. Therefore, by choosing the set $X(\alpha)$ as a function of $\alpha$ we can force $g$ to not have the property $\lim_{\alpha \to 0}g(\alpha, \mu, L)=0$.
Consider the following simple example $f(x) = \frac{1}{2}x^{2}$, so $\nabla f(x) = x$.
Then $x_{k+1} = P_{X(\alpha)}((1 - \alpha) x_{k})$ and let $P_{X(\alpha)} = \{x\ |\ x\geq 1-\alpha,\ x\in\mathbb{R}\}$. Notice, that the minimizer for the set $X(\alpha)$ is $x^{*} = (1-\alpha)$.
Now let us set $x_0 = 1$, then we see that the condition $||x_{k+1}-x_{k}|| \leq g(\alpha,\mu,L) ||x_{k}-x^{*}||$ at the first iteration is,
\begin{align}
||x_{1}-x_{0}|| &\leq g(\alpha,\mu,L) ||x_{0}-x^{*}|| \\
||P_{X(a)}((1 - \alpha)1) - 1|| &\leq g(\alpha,\mu,L) ||1-(1-\alpha)|| \\
||(1 - \alpha) - 1|| &\leq g(\alpha,\mu,L) ||1-(1-\alpha)|| \\
\alpha &\leq g(\alpha,\mu,L) \alpha \\
1 &\leq g(\alpha,\mu,L)
\end{align}
This example shows that for any $\alpha \in (0,1)$ you can construct a set $X$ and an initial condition $x_0$ such that $g(\alpha,\mu,L) \geq 1$. Therefore, there is no function $g$ such that $\lim_{\alpha \to 0}g(\alpha, \mu, L)=0$ and
\begin{align}
||x_{k+1}-x_{k}|| \leq g(\alpha,\mu,L) ||x_{k}-x^{*}||
\end{align}
for any convex set X.
Best Answer
Beware that there are several formulations of Nesterov's method (which are not all completely equivalent), the one you are referring to likely comes from the machine community. Please allow me to propose the following (classical) formulation in which the choice of $d_k$ will appear more clearly: $$ \begin{cases} y_k &= x_k + \delta_k(x_k-x_{k-1})\\ x_{k+1} &=y_k - \alpha_k\nabla f(y_k) \end{cases}, $$ where $\delta_k\in[0,1)$ is an hyper-parameter (Nesterov's historical choice is $\delta_k = k/k+3$). Now, Armijo's rule consists in choosing the step-size $\alpha_k$ so as to ensure a sufficient decrease of $f$ between two iterations $x_k$ and $x_{k+1}$. Using your notations, in Armijo's strategy, we are looking for a step-size $\alpha_k$ of the form $s\gamma^m$ where $m\in\mathbb{N}$ is what we have to choose. The update thus rewrites $x_{k+1} = x_k-s\gamma^m\nabla f(y_k)$ and we want, $$f(x_k)-f(x_{k+1})=f(x_k)-f(x_k-s\gamma^m\nabla f(y_k))\geq - \sigma s\gamma^m\nabla f(x_k)^T(-\nabla f(y_k)).$$
So in short, $d_k=-\nabla f(y_k)$ is the direction in which we update $x_k$.
The reason why Armijo's rule is not so popular for Nesterov's method might be that it requires one to evaluate $\nabla f(x_k)$ (in addition to $\nabla f(y_k))$.