Suppose $\xi_k \in \partial f_k(x)$, then
$f_k(z) \ge f_k(x) + \langle \xi_k, z-x \rangle$ for all $z$.
Hence
$\sum_k f_k(z) \ge \sum_k f_k(z) + \langle \sum_k \xi_k, z-x \rangle$ for all $z$ and so we have
$\sum_k \xi_k \in \partial \sum_k f_k(x)$.
Note that for convex functions in a practical context, it is typically the case that we have
equality and lack of equality is pathological in some sense.
For example, a sufficient condition (Rockafellar) is
$\cap_k \operatorname{ri} ( \operatorname{dom} f_k ) \neq \emptyset$.
As an aside, the notion of subdifferential can be extended to locally Lipschitz functions (subgradient) where the containment goes in the
opposite direction (ignoring pathologies), that is the subgradients satisfy
$\partial \sum_k f_k(x) \subset \sum_k \partial f_k(x)$. An easy
example if $f_1(x)= \max(0,x)$, $f_2(x) = -f_1(-x)$ in which case we have $f_1(x)+f_2(x) = x$ and so
$\partial \sum_k f_k(0)= \{1\}$, $\sum_k \partial f_k(0) = [0,2]$.
Of course, $f_2$ is not convex, so there is no contradiction here.
You need to start by understanding that traditional asymptotic analysis of sublinear/linear/superlinear/quadratic convergence of optimization algorithms isn't directly comparable to the non-asymptotic analysis of first-order methods in terms of the number of iterations needed to obtain an epsilon-optimal solution (e.g. $O(1/\epsilon)$), or the reduction that can be obtained after $k$ iterations (e.g. $O(1/k)$.)
Asymptotic analysis is about what happens in the limit as $x^{(k)}$ approaches $x^{*}$. The other kind of analysis is about what happens after a relatively small number of iterations, well before the algorithm has shown its limiting behavior.
Quasi-Newton and Newton methods have asymptotic convergence rates that are at least superlinear (quadratic for Newton's method.) In practice, we often run these algorithms until they have converged to the full available accuracy of the floating point (typically double precision) arithmetic in use. As a result, we typically see the limiting asymptotic behavior of these methods. For example, in Newton's method, it is common to see that the number of accurate digits in the solution doubles in each of the last few iterations until the limits of the floating point arithmetic are reached. Any superlinear or quadratically convergent algorithm that approaches the limiting behavior will have a fully accurate solution in a few iterations.
In comparison, in the brave new world of first-order methods for large-scale convex but non-smooth problems, we can seldom achieve more than 3-4 digits of accuracy in the solution. The algorithms are typically stopped long before full convergence. We're not particularly interested in the asymptotic convergence rate (which is typically sublinear) because we'll never get anywhere near the limiting behavior. Rather, we're interested in how many iterations are required to find a solution that is adequate for our purposes.
In particular, we define an $\epsilon$ approximate solution to be a solution $\hat{x}$, with $f(\hat{x})-f(x^{*}) \le \epsilon$. For a fixed $\epsilon$, we can then analyze how many iterations are required to obtain an $\epsilon$ approximate solution and show that (for example) our algorithm can find an $\epsilon$ approximate solution in $O(1/\epsilon)$ iterations. Alternatively, we might analyze our algorithm to show that after $k$ iterations, the difference $f(x^{(k)})-f(x^{*})$ is (for example) $O(1/k)$. Here the constant hidden by the $O()$ would involve the distance from $x^{(0)}$ to $x^{*}$.
It's possible for two first order methods to both have sublinear asymptotic convergence in the limit as $x^{(k)}$ approaches $x^{*}$ but have $O(1/k)$ and $O(1/k^{2})$ convergence respectively as measured by $f(x^{(k)})-f(x^{*})$. Clearly, the $O(1/k^{2})$ algorithm is to be preferred, even though the asymptotic convergence is sublinear in both cases.
One final comment is that none of this relates to the polynomial time complexity of interior point methods for LP, SOCP, and SDP. That's yet another way of analyzing the convergence of some optimization algorithms.
It's important from the point of view of computational practice in solving very large-scale optimization problems that polynomial time sometimes isn't good enough. For truly large-scale problems you might not even be able to complete one iteration in the available time. Similarly, a superlinearly convergent algorithm isn't of much use if you can't run it for enough iterations to see the asymptotic superlinear convergence.
Some further reading:
Beck, Amir. First-Order Methods in Optimization. Philadelphia : SIAM-Society for Industrial and Applied Mathematics, 2017.
Bertsekas, Dimitri P. Convex Optimization Algorithms. 1 edition. Belmont, Massachusetts: Athena Scientific, 2015.
Nesterov, Yurii. Lectures on Convex Optimization. 2nd ed. New York, NY: Springer, 2018.
Best Answer
Let $x\in int \ dom f$ and $g\in \partial f(x)$. Then for all $y$ in $dom \ f$, $$ f(y) \le f(x) +\langle g, y-x\rangle + \frac L2 \|y-x\|^2 $$ and $$ f(x) +\langle g, y-x\rangle \le f(y). $$ Hence $$ \langle g, y-x\rangle \le f(y)-f(x) \le \langle g, y-x\rangle + \frac L2 \|y-x\|^2. $$ Set $y=x+td$ with $\|d\|=1$ and $t>0$ small enough. Then dividing the above inequality by $t$ gives $$ \langle g, d\rangle \le \frac{ f(x+td)-f(x) }t \le \langle g, d\rangle + \frac L2 t. $$ Passing to the limit $t\searrow0$ proves $f'(x;d) = \langle g,d\rangle$ for all $d$. Hence $f'(x)=g$ and $f$ is differentiable at $x$.