OK, I think I have a full answer at this point, so let me post it.
Step 1. (algebra).
If $P$ is an $n\times n$ stochastic matrix and $\lambda$ is an eigenvalue of $P$ with $|\lambda|=1$, then $\lambda^k=1$ for some $k\le n$ and $1,\lambda,\lambda^2,\dots\,\lambda^{k-1}$ are eigenvalues of $P$.
Indeed, let $x$ be an eigenvector. WLOG, $\lambda\ne 1$. Let $S=\{j:|x_j|=\max_k|x_k|\}$. Then if $i\in S$ and $p_{ij}>0$, then $j\in S$. Now, if $i\in S$ and $p_{ij}>0$, then $x_j=\lambda x_i$. Thus, if we have some entry in $x$, we also have $\lambda^k$ times this entry for every $k$, but the number of different entries is at most $n$. Moreover, we can assume that one of the entries is $1$ and split the indices in $S$ into groups $S_m$ by the rule $j\in S_m$ iff $x_j$ is in the half-open counterclockwise arc from $\lambda^m$ to $\lambda^{m+1}$ so that $i\in S_m$, $p_{ij}>0$ imply $j\in S_{m+1}$. From here we immediately see that $P-\lambda^qI$ is not invertible for every $q$ (the $S$-block annihilates the vector $y_j=\lambda^{qm}$ for $j\in S_m$ and the full determinant has the determinant of the $S$-block as a factor. Thus, all powers of $\lambda$ are eigenvalues.
Step 2. (compactness argument).
Consider a convergent sequence $P_k$ of $n\times n$ stochastic matrices with the limit $P$. Assume that $P_k$ have eigenvalues $\lambda_k$ and $\lambda_k$ are not contained in any Stolz angle. Then we may assume that $\lambda_k\to\lambda$, $|\lambda|=1$. Clearly, $\lambda$ is an eigenvalue of $P$. If $\lambda\ne 1$, then $P$ has several eigenvalues summing to $0$ (powers of $\lambda$), so $\operatorname{Tr} P\le n-2$, which makes it not a limit point of your set. But If $\lambda=1$, it is even worse, because, if $\lambda_k$ approach $1$ tangentially, then $\lambda_k^{m_k}$ can tend to any point on the unit circle but they are also eigenvalues of $n\times n$ stochastic matrices (powers of $P_k$) and so are their limits. Thus, we have some fixed (but depending on $n$) Stolz angle, containing all the eigenvalues of your matrices.
Step 3. (harmonic analysis)
Let $f(m)=\sum_{k=1}^n c_k\lambda_k^m$ for $m\ge 0$ and $0$ for $m<0$ where $\lambda_k$ lie in some fixed Stolz angle $A$. Then
$$
Vf=\sum_{m\in\mathbb Z}|f(m+1)-f(m)|\le C(A,n)\max_m |f(m)|
$$
Proof:
We begin with a
Complex analysis lemma
Let $F(z)=\sum_{k=1}^n c_k e^{\mu_k}z$ where $\mu_k\in\mathbb C$, $|\mu_k|\le 1$. Then $F$ has at most $C(n)$ zeroes in the unit disk.
Proof: Let $m$ be the maximum of $|f|$ over the unit disk. Then the first $n$ derivatives at the origin are bounded by $C(n)m$. But $\Phi(t)= F(zt)$ ($|z|=1$) satisfies an $n$-th order differential equation $\Phi^{(n)}=\sum_{k=0}^{n-1}b_k\Phi^{(k)}$ with coefficients $b_k$ obtained by expansion of the polynomial $\prod_{k=1}^n (x-z\mu_k)$, which are bounded by $2^n$, say. The standard ODE theory implies that $\Phi$ is bounded by $C'(n)m$ on $[-2,2]$, so the ratio of the maximum of $F$ over the disk of radius $2$ and over the unit disk is bounded, which is enough to control the number of zeroes in the unit disk (each Blaschke factor moves it up fixed number of times). Rescaling and covering, we conclude that if $|\mu_k|<\mu$, then there may be only $C(n,K)$ zeroes in the disk of radius $K/mu$.
Now,
Induction
If $n=1$, the claim is obvious: the maximum is just $c_1$ and $|\lambda_1^k-\lambda_1^{k+1}|\le C(A)(|\lambda_1|^k-|\lambda_1|^{k+1})$
Let $n>1$. Write $\lambda_k=e^{-\mu_k}$ with $|\mu_k|\le C(A)\operatorname{Re\mu_k}$. Let $\mu=\max|\mu_k|=\mu_n$. Note that $f$ is the trace of $F(z)=\sum_{k=1}^n c_ke^{-\mu_k z}$ on integers. The derivative of the real or imaginary part of $F(t)$ can have only $C(2n,K)$ zeroes on $[0,K/\mu]$, so the real and the imaginary parts have a bounded number of intervals of monotonicity there whence $f$ has variation dominated by its maximum on $[0,K/\mu]$. Now, choose $K=K(A)$ so that $\gamma=\lambda_n^N$ is less than $1/2$ in absolute value where $N\approx K/\mu$. The function $g(m)=f(m+N)-\gamma f(m)$ for $m\ge 0$ and $0$ for $m<0$ is bounded by $2\max|f|$ and has one term less. Thus, by the induction assumption, $Vg\le C(n)\max|f|$.
To recover $f$ from $g$, note that $f(m)-\gamma f(m-N)=G(m)$ where $G(m)=g(m-N)$ for $m\ge N$ and $G(m)=f(m)$ for $m\le N$. Note that $VG$ is still under control because we have bounded both $Vg$ and the part of $Vf$ corresponding to the interval $[0,N]$. Now it remains to iterate this recurrence to get $f(m)=G(m)+\gamma G(m-N)+\gamma^2 G(m-2N)+\dots$ and to use the shift invariance of and the triangle inequality for the total variation.
Step 4. (the end)
Each entry of the matrix $P^k$ is of this form (assuming that the eigenvalues are distinct, which is a dense case). Thus, the total variation of each entry is bounded by some $C(n)$ depending on $n$ only. This is equivalent to $\sum_k\|P^k-P^{k+1}\|\le C(n)$ but the sequence of norms ($\ell^\infty$) is non-increasing, so it is $O_n(k^{-1})$.
Feel free to ask questions :). I suspect this all is written in some obscure textbooks but to do the literature search now is beyond my abilities.
I found a proof of this problem for the case of $\varphi(x)=\sum_i{x_i\log x_i-x_i}$. If you find there is anything mistake in the proof, please let me know. Thank you!
The case of $\varphi(x)=\sum_i{x_i\log x_i-x_i}$ can be proved with the method similar to Lindblad, Completely positive maps and entropy inequalities, 1975 and Lindblad, Expectations and entropy inequalities for finite quantum systems, 1974. The inequality can be strengthened as
$$
\sum_i{\mathrm{KL}\left(\mathbf{x}_i^\top A\mathbf{x}_i,\mathbf{x}_i^\top B\mathbf{x}_i\right)}\leq
D_{vN}\left(A,B\right)
$$
Actually, a very similar result has been proposed in some papers about quantum information theory, such as the two papers referred above. The referred result is that for any trace preserving map $\Phi$, given by $\Phi(A)=\sum_{i=1}^n{V_iAV_i^\top}$ and $\sum_{i=1}^n{V_i^\top V_i}=1$, we have that $\mathrm{tr}\left(\Phi(A),\Phi(B)\right)\leq D_\phi(A,B)$, where $A,B$ are both density operators which are Hermitian positive definite matrices satisfying $\mathrm{tr}A=\mathrm{tr}B=1$ and $\varphi(x)=x\log x$.
We found that if the trace constraints $\mathrm{tr}A=\mathrm{tr}B=1$ are dropped and $\varphi(x)=x\log x$ is replaced with $\varphi(x)=x\log x-x$, the inequality still holds.
The proof is outlined as following:
The von Neumann divergence has the following additivity property with Kronecker product:
$$D_{vN}(A\otimes P,B\otimes P)=D_{vN}(A,B)\cdot\mathrm{tr}P$$
Using the joint convexity and the additivity, we can prove that the von Neumann divergence has the monotonicity with partial trace as
\begin{equation*}
\begin{split}
D_{vN}(\mathrm{tr}_2(A),\mathrm{tr}_2(B))
=&D_{vN}\left(\mathrm{tr}_2(A)\otimes\frac{\mathbf{I}_2}{m},
\mathrm{tr}_2(B)\otimes\frac{\mathbf{I}_2}{m}\right)
/\mathrm{tr}\left(\frac{\mathbf{I}_2}{m}\right)\\\
=&D_{vN}\left(\sum_{j=1}^N{p_jW_jAW_j^+},\sum_{j=1}^N{p_jW_jBW_j^+}\right)\\\
\leq&\sum_{j=1}^{N}{p_jD_{vN}\left(W_jAW_j^+,W_jBW_j^+\right)}\\\
=&D_{vN}(A,B)\end{split}
\end{equation*}
For any trace preserving map $\Phi$, given by $\Phi(A)=\sum_{i=1}^n{V_iAV_i^\top}$ and $\sum_{i=1}^n{V_i^\top V_i}=1$, it can be represented as a unitary operation+partial tracing. Therefore, we have that
\begin{equation*}
\begin{split}
D_{vN}\left(\Phi(A),\Phi(B)\right)
=&D_{vN}\left(\mathrm{tr}_2(W(A\otimes\mathbf{s}\mathbf{s}^\top)W^\top),
\mathrm{tr}_2(W(B\otimes\mathbf{s}\mathbf{s}^\top)W^\top)\right)\\\
\leq&D_{vN}\left(W(A\otimes\mathbf{s}\mathbf{s}^\top)W^\top,
W(B\otimes\mathbf{s}\mathbf{s}^\top)W^\top\right)\\\
=&D_{vN}\left(A,B\right)
\end{split}
\end{equation*}
Then the sum of relative entropy of the quadratic forms could be represented as matrix divergence and bounded.
\begin{equation*}
\begin{split}
\sum_i{\mathrm{KL}\left(\mathbf{x}_i^\top A\mathbf{x}_i,\mathbf{x}_i^\top B\mathbf{x}_i\right)}
=&\sum_{i,j}{(\mathbf{x}_i^\top\mathbf{x}_j)^2
\mathrm{KL}(\mathbf{x}_i^\top A\mathbf{x}_i,\mathbf{x}_j^\top B\mathbf{x}_j)}\\\
=&D_{vN}(\sum_i{X_iAX_i^\top},\sum_i{X_iBX_i^\top})\\\
\leq&D_{vN}\left(A,B\right)
\end{split}
\end{equation*}
where $X_i=\mathbf{x}_i\mathbf{x}_i^\top$.
If there is any mistake in the proof, please let me know. Any other suggestions are also welcomed. Thank you very much!
Best Answer
Define,
$$A(\alpha) = \sum_i \alpha_i A_i.$$
In your case (ignoring constraints for the moment), you have
$$\min\quad\mbox{trace}(A(\alpha)^{-1}) = \sum_i e_i^TA(\alpha)^{-1}e_i,$$
which makes it somewhat tricky to optimize.
So, as per your request, here is a somewhat simpler setup that you may find useful. Consider,
Now, look at the following two problems:
$$\min_{\alpha}\quad \|A(\alpha)\|_2,$$
and
$$\min_{\alpha}\quad c^TA(\alpha)^{-1}c.$$
Both of these can be cast into SDPs. For the first, we replace it by $$\begin{eqnarray} \min_{\alpha,t}\quad &t\\\\ &\begin{bmatrix} tI & A(\alpha)\\\\ A(\alpha) & tI \end{bmatrix} \succeq 0 \end{eqnarray} $$
For the second, use a similar Schur-complement based idea to obtain $$\begin{eqnarray} \min_{\alpha,t}\quad &t\\\\ &\begin{bmatrix} A(\alpha) & c\\\\ c^T & t \end{bmatrix} \succeq 0 \end{eqnarray} $$
You can add more constraints on $\alpha$ if you want. However, because these things are SDPs, they won't be too easy to solve for really large-matrices.