The $\frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).
So now the question is why there is an extra $\frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=\frac{1}{m} \sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.
The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.
Lets try to derive why the logarithm comes in the cost function of logistic regression from first principles.
So we have a dataset X consisting of m datapoints and n features. And there is a class variable y a vector of length m which can have two values 1 or 0.
Now logistic regression says that the probability that class variable value $y_i =1$ , $i=1,2,...m$ can be modelled as follows
$$
P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}}
$$
so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.
This can be combined into a single equation as follows, ( actually $y_i$ follows a Bernoulli distribution)
$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.
The likelihood of the entire dataset $\mathbf{X}$ is the product of the individual data point likelihoods. Thus
$$ P(\mathbf{X}|\mathbf{y}) = \prod_{i=1}^{m} P(\mathbf{x}_i | y_i) = \prod_{i=1}^{m} h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
Now the principle of maximum likelihood says that we find the parameters that maximise likelihood $P(\mathbf{X}|\mathbf{y})$.
As mentioned in the comment, logarithms are used because they convert products into sums and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood.So we take the natural logarithm as maximising the likelihood is same as maximising the log likelihood, so log likelihood $L(\theta)$ is now:
$$ L(\theta) = \log(P(\mathbf{X}|\mathbf{y}) = \sum_{i=1}^{m} y_i \log(h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log(1 - h_{\theta}(\mathbf{x}_i)) $$.
Since in linear regression we found the $\theta$ that minimizes our cost function , here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of $L(\theta)$. Maximization of $L(
\theta)$ is equivalent to minimization of $ -L(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be,
$$ J(\theta) = - \dfrac{1}{m} L(\theta)$$
$$ = - \dfrac{1}{m} \left( \sum_{i=1}^{m} y_i \log (h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log (1 - h_{\theta}(\mathbf{x}_i)) \right )$$
Now we can also understand why the cost for single data point comes as follows:
the cost for a single data point is $ = -\log( P(\mathbf{x}_i | y_i))$, which can be written as $ - \left ( y_i \log (h_{\theta}(\mathbf{x}_i)) + (1 - y_i) \log (1 - h_{\theta}(\mathbf{x}_i) \right )$.
We can now split the above into two depending upon the value of $y_i$. Thus we get
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (h_{\theta}(\mathbf{x}_i)) , \text{ if } y_i=1$, and
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (1 - (h_{\theta}(\mathbf{x}_i) ) , \text{ if } y_i=0 $.
Best Answer
Let me see if the following helps. First, remember:
So our goal is to produce such a function $H_\theta$ using $D$. One way to do this is to define our goal as an optimization problem. So we define a cost function, denoted $J(\theta)$, that measures how bad our current $H_\theta$ is, which we will then try to minimize.
In essence, if $J(\theta)$ is very small (i.e. the cost is low), then we can say that $ H_\theta $ is doing a good job on $D$. This suggests we should sum over $D$. In essence, for every point $(x,y)$ in the dataset, we will measure the error that $H_\theta$ makes on that point, i.e. how far $H_\theta(x)$ is from $y$. So then the total error, or cost, will be the sum of the errors on these individual points.
Let's try something simple: $$ J_{\text{bad}}(\theta) = \sum_i H_\theta(x^i) - y^i $$ so if $H_\theta(x^i) = y^i$ all the time, we get $J_1=0$. That's good, but there's a problem. If $H_\theta$ is negative, then the cost is shrinking. So, if we optimize this, we will just want to make $H_\theta$ very negative! That's just dumb, so we need to fix this. One simple way is to take the square: $$ J_{\text{better}}(\theta) = \sum_i (H_\theta(x^i) - y^i)^2 $$ This is much better. The smallest that $J_{\text{better}}$ can ever be is zero, and this can only happen if $H_\theta$ scores perfectly on every data point. Since it is always positive, it reminds one of an energy, which is nice.
Now to finish off the cost function. First, we divide by $m$, so that instead of being the total error (or cost) of the function, it is the average error instead. Then, we also divide by $2$, because there is a square in the cost function. So, when we take the derivative (which we will, in order to optimize it), the square will generate a $2$ and cancel out. It's just aesthetics really.
So anyway, the final function is: $$ J(\theta) = \frac{1}{2m}\sum_i (H_\theta(x^i) - y^i)^2 $$
Recap: (1) We want to make every $H_\theta(x^i) \approx y^i$, so we define a distance between them (their squared difference, because we want to minimize to zero and want the "energy" or "cost" to always be positive) called the error, and sum over the whole dataset to get the total error. (2) We divide by $2m$ because it is more aesthetically pleasing to consider the average error and we want to cancel the inevitable $2$ from differentiation.
Sorry if this was too basic; hopefully it helped!
Edit: note that when we are considering the error, we can use any distance metric. For instance, we could instead use the distance: $$ J_p(\theta) = \frac{1}{m}\sum_i |H_\theta(x^i) - y^i|^p $$ Note that there will be a difference: the higher $p$ is, the more impact outliers will have on the function. The case $p=2$ is a natural one in ML because it is (the square of) the classical Euclidean distance. The case $p=1$ is certainly reasonable as well (though it comes with the drawback of being non-differentiable at $0$).
However, for linear regression specifically, there is extra reason to use the squared distance, also called the ordinary least squares (OLS) problem (note: this is mathematically slightly more advanced). If we assume that our dataset $D$ has been generated from a linear model with Gaussian noise $Y=\alpha X + \mathcal{E}$, then the minimizer of OLS is the maximum likelihood estimator of the model parameters. See also this post. So there is a beautiful connection between statistics, linear algebra, and optimization in this case.
See also this thread, which talks a little more about other reasons to choose the sum of squares rather than the sum of absolute values.