A bit of context on logistic regression. When dealing with logistic regression, you want to compute the probability $p$ of the binomial variable "YES" / "NO", or "heart attack" / "no heart attack" etc...by stating that such probability depends on a certain number of variables, let us say
$$x_1,\dots,x_p$$
through
$$\log(\frac{p}{1-p}):=\beta_0+\beta_1 x_1+\dots+\beta_p x_p.$$
This is a choice (we could have used other models, like the probit) motivated by the fact that $p=p(\beta,x)$ is a probability (i.e. is a real number between 0 and 1) and the logit transformation
$$\operatorname{logit}(p):=\log(\frac{p}{1-p})$$
is invertible, with inverse
$$p(\beta,x)=\frac{1}{1+exp(-\beta_0-\beta_1 x_1-\dots-\beta_p x_p)}.$$
Given a set of $n<p$ measurements $(y_{1},x_{1,1},\dots,x_{1,p})$, $\dots$,
$(y_{n},x_{1,n},\dots,x_{1,n})$, with $y_{n}$ either equal to $0$ (no event) or $1$ (event), we want to estimate the parameters $(\beta_0,\beta_1,\dots,\beta_p)$, i.e. the missing piece in our regression scheme.
This task is usually performed using maximum likelihood methods, which end up to the numerical algorithm called "Newton Raphson" for the parameter estimation.
The recursive algorithm provides you with an answer in most of the cases; divergences can occur for different reasons, however. One particularly interesting reason is called "complete separation" of one or even more variable. You will surely meet this topic in applications.
Try to perform a logistic regression for the following easy vectors of data (in this order! Please, do not shuffle the components...)
$$y=(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1)$$
$$x=(0,0,0,0,0.1,0.2,0.3,0.4,0.5,0.6,0,0,0,0.9,1)$$
...what do you get, as an answer?
The logistic regression can be theoretically motivated by the principle of maximum entropy: in fact, if we are supposed to use it on the binomial variable "YES" / "NO", or "heart attack" / "no heart attack" in presence of certain constraints,it is possible toshow that the probability distribution for such variable that maximizes the (Shannon) entropy is the logistic distribution. In this sense, the logistic regression, which is so common in applications, plays a special role.
Lets try to derive why the logarithm comes in the cost function of logistic regression from first principles.
So we have a dataset X consisting of m datapoints and n features. And there is a class variable y a vector of length m which can have two values 1 or 0.
Now logistic regression says that the probability that class variable value $y_i =1$ , $i=1,2,...m$ can be modelled as follows
$$
P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}}
$$
so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.
This can be combined into a single equation as follows, ( actually $y_i$ follows a Bernoulli distribution)
$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.
The likelihood of the entire dataset $\mathbf{X}$ is the product of the individual data point likelihoods. Thus
$$ P(\mathbf{X}|\mathbf{y}) = \prod_{i=1}^{m} P(\mathbf{x}_i | y_i) = \prod_{i=1}^{m} h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
Now the principle of maximum likelihood says that we find the parameters that maximise likelihood $P(\mathbf{X}|\mathbf{y})$.
As mentioned in the comment, logarithms are used because they convert products into sums and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood.So we take the natural logarithm as maximising the likelihood is same as maximising the log likelihood, so log likelihood $L(\theta)$ is now:
$$ L(\theta) = \log(P(\mathbf{X}|\mathbf{y}) = \sum_{i=1}^{m} y_i \log(h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log(1 - h_{\theta}(\mathbf{x}_i)) $$.
Since in linear regression we found the $\theta$ that minimizes our cost function , here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of $L(\theta)$. Maximization of $L(
\theta)$ is equivalent to minimization of $ -L(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be,
$$ J(\theta) = - \dfrac{1}{m} L(\theta)$$
$$ = - \dfrac{1}{m} \left( \sum_{i=1}^{m} y_i \log (h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log (1 - h_{\theta}(\mathbf{x}_i)) \right )$$
Now we can also understand why the cost for single data point comes as follows:
the cost for a single data point is $ = -\log( P(\mathbf{x}_i | y_i))$, which can be written as $ - \left ( y_i \log (h_{\theta}(\mathbf{x}_i)) + (1 - y_i) \log (1 - h_{\theta}(\mathbf{x}_i) \right )$.
We can now split the above into two depending upon the value of $y_i$. Thus we get
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (h_{\theta}(\mathbf{x}_i)) , \text{ if } y_i=1$, and
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (1 - (h_{\theta}(\mathbf{x}_i) ) , \text{ if } y_i=0 $.
Best Answer
We are given a data vector $\textbf{x}$ and a class vector $\textbf{y}$. The class vector tells us which of two classes $\{0,1\}$ the data instances belong to.
We want to come up with a function $h_\theta(x_i)$ that helps us estimate the classes $y_i$ as best as we can.
You can think of $h_\theta(x_i)$ as the probability that $y_i=1$, given $x_i$ and $\theta$.
$$P(y_i=1|x_i,\theta)=h_\theta(x_i)$$
Likewise, $1-h_\theta(x_i)$ is the probability that $y_i=0$, given $x_i$ and $\theta$.
$$P(y_i=0|x_i,\theta)=1-h_\theta(x_i)$$
We can combine the two formulas in a clever way using exponents:
$$P(y_i|x_i,\theta)=h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$
(Note that one of the terms is always reduced to 1 because one of the exponents is always zero.)
Since all instances are independent, the total probability over all instances $i$ is just the product of all the individual probabilities:
$$P(\textbf{y}|\textbf{x},\theta)=\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$
We are hoping to maximize the probability of the output vector $\textbf{y}$, or equivalently, maximize its $\textbf{log}$.
$$\log\big( P(\textbf{y}|\textbf{x},\theta)\big)=\log\big(\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$
$$=\sum_i \log\big(h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$
$$=\sum_i \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$
This is a function with a maximum, but since we want to use gradient descent, we can just throw a negative sign in front to turn the maximum into a minimum, and scale it by the number of instances $m$ for convenience. (This makes the error more or less invariant to the number of instances.)A negative sign is added to the front since we want to reduce the cost function. The log loss function is strictly increasing, but by adding a negative sign we invert it.
$$J(\theta) = -\frac{1}{m}\sum_i^m \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$
Why did we bother taking the $\log$? Because it's easier to take the derivative of a sum rather than the derivative of a product (imagine all that product rule!). You'll find this trick is used a lot in machine learning to make differentiation easier.
You also asked why we chose $h_\theta(x_i)$ to be the sigmoid function. A couple reasons are:
To me it seems like the use of sigmoid is more of an engineered solution rather than something we arrived at from a mathematical proof. It has nice properties and seems to work.
In neural networks, some people prefer using alternatives to the sigmoid like $arctan$ and $tanh$, but I don't think it makes that much of a difference in most cases.