The answer to the question referenced by @amoeba in the comment on your question answers this quite well, but I would like to make two points.
First, to expand on a point in that answer, the objective being minimized is not the negative log of the softmax function. Rather, it is defined as a variant of noise contrastive estimation (NCE), which boils down to a set of $K$ logistic regressions. One is used for the positive sample (i.e., the true context word given the center word), and the remaining $K-1$ are used for the negative samples (i.e., the false/fake context word given the center word).
Second, the reason you would want a large negative inner product between the false context words and the center word is because this implies that the words are maximally dissimilar. To see this, consider the formula for cosine similarity between two vectors $x$ and $y$:
$$
s_{cos}(x, y) = \frac{x^Ty}{\|x\|_2\,\|y\|_2}
$$
This attains a minimum of $-1$ when $x$ and $y$ are oriented in opposite directions and equals $0$ when $x$ and $y$ are perpendicular. If they are perpendicular, they contain none of the same information, while if they are oriented oppositely, they contain opposite information. If you imagine word vectors in 2D, this is like saying that the word "bright" has the embedding $[1\;0],$ "dark" has the embedding $[-1\;0],$ and "delicious" has the embedding $[0\;1].$ In our simple example, "bright" and "dark" are opposites. Predicting that something is "dark" when it is "bright" would be maximally incorrect as it would convey exactly the opposite of the intended information. On the other hand, the word "delicious" carries no information about whether something is "bright" or "dark", so it is oriented perpendicularly to both.
This is also a reason why embeddings learned from word2vec perform well at analogical reasoning, which involves sums and differences of word vectors. You can read more about the task in the word2vec paper.
If you normalize the features, dot product is the same as cosine similarity. As for the general answer, there is usually no single approach that you should always use. Usually, the choice is an empirical one: you try different ones and compare the results. Sometimes, in practice, the choice is arbitrary: assuming that different metrics are quite similar and give very similar results, you just pick one.
Best Answer
The issue
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of $k$ randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
The answer is based on some older notes of mine - I hope they were correct :)