The reason is the following. We use the notation:
$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$
Then
$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$ $$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} ),$$ [ this used: $ 1 = \frac{(1+e^{-\theta x^i})}{(1+e^{-\theta x^i})},$ the 1's in numerator cancel, then we used: $\log(x/y) = \log(x) - \log(y)$]
Since our original cost function is the form of:
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$
Plugging in the two simplified expressions above, we obtain
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$, which can be simplified to:
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right],~~(*)$$
where the second equality follows from
$$-\theta x^i-\log(1+e^{-\theta x^i})=
-\left[ \log e^{\theta x^i}+
\log(1+e^{-\theta x^i} )
\right]=-\log(1+e^{\theta x^i}). $$ [ we used $ \log(x) + \log(y) = log(x y) $ ]
All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As
$$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$
$$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$
the thesis follows.
Lets try to derive why the logarithm comes in the cost function of logistic regression from first principles.
So we have a dataset X consisting of m datapoints and n features. And there is a class variable y a vector of length m which can have two values 1 or 0.
Now logistic regression says that the probability that class variable value $y_i =1$ , $i=1,2,...m$ can be modelled as follows
$$
P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}}
$$
so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.
This can be combined into a single equation as follows, ( actually $y_i$ follows a Bernoulli distribution)
$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.
The likelihood of the entire dataset $\mathbf{X}$ is the product of the individual data point likelihoods. Thus
$$ P(\mathbf{X}|\mathbf{y}) = \prod_{i=1}^{m} P(\mathbf{x}_i | y_i) = \prod_{i=1}^{m} h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
Now the principle of maximum likelihood says that we find the parameters that maximise likelihood $P(\mathbf{X}|\mathbf{y})$.
As mentioned in the comment, logarithms are used because they convert products into sums and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood.So we take the natural logarithm as maximising the likelihood is same as maximising the log likelihood, so log likelihood $L(\theta)$ is now:
$$ L(\theta) = \log(P(\mathbf{X}|\mathbf{y}) = \sum_{i=1}^{m} y_i \log(h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log(1 - h_{\theta}(\mathbf{x}_i)) $$.
Since in linear regression we found the $\theta$ that minimizes our cost function , here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of $L(\theta)$. Maximization of $L(
\theta)$ is equivalent to minimization of $ -L(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be,
$$ J(\theta) = - \dfrac{1}{m} L(\theta)$$
$$ = - \dfrac{1}{m} \left( \sum_{i=1}^{m} y_i \log (h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log (1 - h_{\theta}(\mathbf{x}_i)) \right )$$
Now we can also understand why the cost for single data point comes as follows:
the cost for a single data point is $ = -\log( P(\mathbf{x}_i | y_i))$, which can be written as $ - \left ( y_i \log (h_{\theta}(\mathbf{x}_i)) + (1 - y_i) \log (1 - h_{\theta}(\mathbf{x}_i) \right )$.
We can now split the above into two depending upon the value of $y_i$. Thus we get
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (h_{\theta}(\mathbf{x}_i)) , \text{ if } y_i=1$, and
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (1 - (h_{\theta}(\mathbf{x}_i) ) , \text{ if } y_i=0 $.
Best Answer
For convenience, define some variables and their differentials $$\eqalign{ z &= X^T\theta, &\,\,\,\, dz = X^Td\theta \cr p &= \exp(z), &\,\,\,\, dp = p\odot dz \cr h &= \frac{p}{1+p}, &\,\,\,\, dh = (h-h\odot h)\odot dz \,= (H-H^2)\,dz \cr }$$ where
$\,\,\,\,H={\rm Diag}(h)$
$\,\,\,\,\odot$ represents the Hadamard elementwise product
$\,\,\,\,\exp$ is applied elementwise
$\,\,\,\frac{p}{1+p}$ represents elementwise division
The cost function can be written in terms of these variables and the Frobenius inner product (represented by a colon) $$\eqalign{ J &= -\frac{1}{m}\Big[y:\log(h) + (1-y):\log(1-h)\Big] \cr }$$
The differential of the cost is $$\eqalign{ dJ &= -\frac{1}{m}\Big[y:d\log(h) + (1-y):d\log(1-h)\Big] \cr &= -\frac{1}{m}\Big[y:H^{-1}dh - (1-y):(I-H)^{-1}dh\Big] \cr &= -\frac{1}{m}\Big[H^{-1}y - (I-H)^{-1}(1-y)\Big]:dh \cr &= -\frac{1}{m}\Big[H^{-1}y - (I-H)^{-1}(1-y)\Big]:H(I-H)dz \cr &= -\frac{1}{m}\Big[(I-H)y - H(1-y)\Big]:dz \cr &= -\frac{1}{m}(y-h):X^Td\theta \cr &= \frac{1}{m}X(h-y):d\theta \cr }$$ The gradient $$\eqalign{ G =\frac{\partial J}{\partial\theta} &= \frac{1}{m}X(h-y) \cr }$$ (NB: Your gradient is missing the $\frac{1}{m}$ factor)
The differential of the gradient $$\eqalign{ dG &= \frac{1}{m}X\,dh \cr &= \frac{1}{m}X(H-H^2)\,dz \cr &= \frac{1}{m}X(H-H^2)X^T\,d\theta \cr\cr }$$ And finally, the gradient of the gradient (aka the Hessian) $$\eqalign{ \frac{\partial^2J}{\partial\theta\,\partial\theta^T} &= \frac{\partial G}{\partial\theta} = \frac{1}{m}X(H-H^2)X^T \cr }$$