First, we need to understand what is a Markov chain. Consider the following weather example from Wikipedia. Suppose that weather on any given day can be classified into two states only: sunny and rainy. Based on past experience, we know the following:
$P(\text{Next day is Sunny}\,\vert \,\text{Given today is Rainy)}=0.50$
Since, the next day's weather is either sunny or rainy it follows that:
$P(\text{Next day is Rainy}\,\vert \,\text{Given today is Rainy)}=0.50$
Similarly, let:
$P(\text{Next day is Rainy}\,\vert \,\text{Given today is Sunny)}=0.10$
Therefore, it follows that:
$P(\text{Next day is Sunny}\,\vert \,\text{Given today is Sunny)}=0.90$
The above four numbers can be compactly represented as a transition matrix which represents the probabilities of the weather moving from one state to another state as follows:
$P = \begin{bmatrix}
& S & R \\
S& 0.9 & 0.1 \\
R& 0.5 & 0.5
\end{bmatrix}$
We might ask several questions whose answers follow:
Q1: If the weather is sunny today then what is the weather likely to be tomorrow?
A1: Since, we do not know what is going to happen for sure, the best we can say is that there is a $90\%$ chance that it is likely to be sunny and $10\%$ that it will be rainy.
Q2: What about two days from today?
A2: One day prediction: $90\%$ sunny, $10\%$ rainy. Therefore, two days from now:
First day it can be sunny and the next day also it can be sunny. Chances of this happening are: $0.9 \times 0.9$.
Or
First day it can be rainy and second day it can be sunny. Chances of this happening are: $0.1 \times 0.5$.
Therefore, the probability that the weather will be sunny in two days is:
$P(\text{Sunny 2 days from now} = 0.9 \times 0.9 + 0.1 \times 0.5 = 0.81 + 0.05 = 0.86$
Similarly, the probability that it will be rainy is:
$P(\text{Rainy 2 days from now} = 0.1 \times 0.5 + 0.9 \times 0.1 = 0.05 + 0.09 = 0.14$
In linear algebra (transition matrices) these calculations correspond to all the permutations in transitions from one step to the next (sunny-to-sunny ($S_2S$), sunny-to-rainy ($S_2R$), rainy-to-sunny ($R_2S$) or rainy-to-rainy ($R_2R$)) with their calculated probabilities:
On the lower part of the image we see how to calculate the probability of a future state ($t+1$ or $t+2$) given the probabilities (probability mass function, $PMF$) for every state (sunny or rainy) at time zero (now or $t_0$) as simple matrix multiplication.
If you keep forecasting weather like this you will notice that eventually the $n$-th day forecast, where $n$ is very large (say $30$), settles to the following 'equilibrium' probabilities:
$P(\text{Sunny}) = 0.833$
and
$P(\text{Rainy}) = 0.167$
In other words, your forecast for the $n$-th day and the $n+1$-th day remain the same. In addition, you can also check that the 'equilibrium' probabilities do not depend on the weather today. You would get the same forecast for the weather if you start of by assuming that the weather today is sunny or rainy.
The above example will only work if the state transition probabilities satisfy several conditions which I will not discuss here. But, notice the following features of this 'nice' Markov chain (nice = transition probabilities satisfy conditions):
Irrespective of the initial starting state we will eventually reach an equilibrium probability distribution of states.
Markov Chain Monte Carlo exploits the above feature as follows:
We want to generate random draws from a target distribution. We then identify a way to construct a 'nice' Markov chain such that its equilibrium probability distribution is our target distribution.
If we can construct such a chain then we arbitrarily start from some point and iterate the Markov chain many times (like how we forecast the weather $n$ times). Eventually, the draws we generate would appear as if they are coming from our target distribution.
We then approximate the quantities of interest (e.g. mean) by taking the sample average of the draws after discarding a few initial draws which is the Monte Carlo component.
There are several ways to construct 'nice' Markov chains (e.g., Gibbs sampler, Metropolis-Hastings algorithm).
It's not always wise to jump straight into taking derivatives (indeed there are a number of widely used densities where this will be useless, or if you're sufficiently rash in going about it, may lead you to select values that are not the mode).
Let's think a little first. That often pays handsome dividends.
The location of its mode will be given by the location of the mode of its log -- the height is changed but not which arguments correspond to the largest value. (Strictly monotonic
transformation of the density doesn't alter where the modes are)
note further that shifting the function up or down by additive constants and scaling by positive multiplicative constants don't alter the location of a mode
So in the case of the normal, after we take logs, then drop the additive and multiplicative constants, it suffices to find modes of the function $h(x) = -(x-\mu)^2$
Since this is the negative of a perfect square we can identify the (now plainly) unique mode by inspection, though if you feel compelled to use calculus at this stage it's quite straightforward.
Indeed, as whuber notes in comments, we can simply recognize that ($x-\mu$) is shifting the location of the function right by $μ$. That shifts its mode by $μ$, reducing the problem to finding the vertex of the parabola $h(x)=−x^2$, whose location is trivially implied by the fact that $x^2$ is non-negative (& only zero at $x=0$). [In more complicated problems, if all occurrences of $x$ are in the form $(x-\kappa)$ for some $\kappa$, consider $x^*=x-\kappa$, find the mode of the new function of $x^*$ ($x_0^*$, say) and then the original mode is at $x_0=x_0^*+\kappa$.]
Such simple notions often greatly simplify finding the optimum.
Best Answer
Have you considered using a nearest neighbour approach ?
e.g. building a list of the
k
nearest neighbours for each of the 100'000 points and then consider the data point with the smallest distance of thekth
neighbour a mode. In other words: find the point with the 'smallest bubble' containingk
other points around this point.I'm not sure how robust this is and the choice for
k
is obviously influencing the results.