Embeddings (in general, not only in Keras) are methods for learning vector representations of categorical data. They are most commonly used for working with textual data. Word2vec and GloVe are two popular frameworks for learning word embeddings. What embeddings do, is they simply learn to map the one-hot encoded categorical variables to vectors of floating point numbers of smaller dimensionality then the input vectors. For example, one-hot vector representing a word from vocabulary of size 50 000 is mapped to real-valued vector of size 100. Then, the embeddings vector is used for whatever you want to use it as features.
one-hot vector $\to$ real-valued vector $\to$ (additional layers of
the network)
The difference is how Word2vec is trained, as compared to the "usual" learned embeddings layers. Word2vec is trained to predict if word belongs to the context, given other words, e.g. to tell if "milk" is a likely word given the "The cat was drinking..." sentence begging. By doing so, we expect Word2vec to learn something about the language, as in the quote "You shall know a word by the company it keeps" by John Rupert Firth. Using the above example, Word2vec learns that "cat" is something that is likely to appear together with "milk", but also with "house", or "pet", so it is somehow similar to "dog". As a consequence, embeddings created by Word2vec, or similar models, learn to represent words with similar meanings using similar vectors.
On another hand, with embeddings learned as a layer of a neural network, the network may be trained to predict whatever you want. For example, you can train your network to predict sentiment of a text. In such case, the embeddings would learn features that are relevant for this particular problem. As a side effect, they can learn also some general things about the language, but the network is not optimized for such task. Using the "cat" example, embeddings trained for sentiment analysis may learn that "cat" and "dog" are similar, because people often say nice things about their pets.
In practical terms, you can use the pretrained Word2vec embeddings as features of any neural network (or other algorithm). They can give you advantage if your data is small, since the pretrained embeddings were trained on large volumes of text. On another hand, there are examples showing that learning the embeddings from your data, optimized for a particular problem, may be more efficient (Qi et al, 2018).
Qi, Y., Sachan, D. S., Felix, M., Padmanabhan, S. J., & Neubig, G. (2018). When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation? arXiv preprint arXiv:1804.06323.
Best Answer
Short explanation
rho is the "Gradient moving average [also exponentially weighted average] decay factor" and decay is the "Learning rate decay over each update".
Long explanation
RMSProp is defined as follows
source
So RMSProp uses "rho" to calculate an exponentially weighted average over the square of the gradients.
Note that "rho" is a direct parameter of the RMSProp optimizer (it is used in the RMSProp formula).
Decay on the other hand handles learning rate decay. Learning rate decay is a mechanism generally applied independently of the chosen optimizer. Keras simply builds this mechanism into the RMSProp optimizer for convenience (as does it with other optimizers like SGD and Adam which all have the same "decay" parameter). You may think of the "decay" parameter as "lr_decay".
It can be confusing at first that there are two decay parameters, but they are decaying different values.