Just in case someone is following, I want to post a somewhat negative answer to my second question. I found an example that satisfies the assumptions, and achieves an efficiency arbitrarily close to 1.
The example is inspired on Laplace distribution with an unknown location parameter $\theta$, and p.d.f. $f(x|\theta) = \frac 1 2 e^{-|x-\theta|}$. In this case, when $p=\frac 1 2$, both estimators coincide and the efficiency is 1. This is due to the fact that (i) the MLE of the location parameter of a Laplace distribution is the median, and (i) the distribution is symmetric.
The problem with Laplace distribution is that it does not satisfy our assumptions: its log likelihood is not differentiable because of the absolute value in the exponent. The trick is to replace the absolute value by an analytic approximation, such as $\frac 1 k \ln(\cosh(k x))$, which converges point-wise to the absolute value as $k\rightarrow\infty$. Indeed, the sequence of distributions given by
$$
f_k(x|\theta) = \frac {a_k} 2 \cosh(k (x-\theta))^{- 1 / k},
$$
where $a_k$ is a normalization constant, achieve an efficiency of 1 as $k$ goes to infinity.
Bummer.
$\newcommand\th\theta\newcommand\R{\mathbb R}$In your logistic regression model, there is no function $f$ such that the condition $p\ge f(n)$ guarantees the zero training error.
Indeed, let us say that a point $x_i$ in your data is red if $y_i=1$ and blue of $y_i=0$. Let us say that $\th\in\R^p$ separates the red and blue points -- that is, has zero training error --- if $\th^Tx_i>0$ if $x_i$ is red and $\th^Tx_i<0$ if $x_i$ is blue.
Then, for any natural $p$, the zero training error cannot be attained by any $\th$ if e.g. (i) one of the $x_i$'s is $0$ or (ii) there are two red points of the form $u$ and $au$ for some real $a\ge0$ and some $u\in\R^p$ or (iii) there are two red points $u$ and $v$ and a blue point of the form $au+bv$ for some real $a,b\ge0$.
On the other hand, if the training data $\{(x_1,y_1),\dots,(x_n,y_n)\}$ admits some $\th_*\in\R^p$ that separates the red and blue points (that is, has zero training error), then your formula
$$\th^*:= \text{arg max}_{\th\in\R^p}\sum_{i=1}^n(y_i\ln h_{\th}(x_i)+(1-y_i)\ln(1-h_{\th}(x_i))$$
makes no sense, because then the supremum of
$$H(\th):=\sum_{i=1}^n(y_i\ln h_{\th}(x_i)+(1-y_i)\ln(1-h_{\th}(x_i))$$
over all $\th\in\R^p$ is not attained. Rather, this supremum (equal $0$) is "attained" only in the limit, when $\th=t\th_*$, $t\to\infty$, and, as above, $\th_*\in\R^p$ separates the red and blue points (that is, has zero training error).
Best Answer
The proof you are looking for should be in
Check the relationship of your curvature condition with Assumption 2 of that paper. The proof there should give you the main ideas, in particular the standard argument to prove that $\hat \theta-\theta^*\in C_3(S)$.
This proof fails for random design matrices when $|S|>>\sqrt n$ because the curvature condition in $\ell_\infty$ norm cannot hold. It is possible to overcome this difficult for Gaussian random design matrices, see