You are absolutely correct in observing that even though $\mathbf{u}$ (one of the eigenvectors of the covariance matrix, e.g. the first one) and $\mathbf{X}\mathbf{u}$ (projection of the data onto the 1-dimensional subspace spanned by $\mathbf{u}$) are two different things, both of them are often called "principal component", sometimes even in the same text.
In most cases it is clear from the context what exactly is meant. In some rare cases, however, it can indeed be quite confusing, e.g. when some related techniques (such as sparse PCA or CCA) are discussed, where different directions $\mathbf{u}_i$ do not have to be orthogonal. In this case a statement like "components are orthogonal" has very different meanings depending on whether it refers to axes or projections.
I would advocate calling $\mathbf{u}$ a "principal axis" or a "principal direction", and $\mathbf{X}\mathbf{u}$ a "principal component".
I have also seen $\mathbf u$ called "principal component vector".
I should mention that the alternative convention is to call $\mathbf u$ "principal component" and $\mathbf{Xu}$ "principal component scores".
Summary of the two conventions:
$$\begin{array}{c|c|c} & \text{Convention 1} & \text{Convention 2} \\ \hline \mathbf u & \begin{cases}\text{principal axis}\\ \text{principal direction}\\ \text{principal component vector}\end{cases} & \text{principal component} \\ \mathbf{Xu} & \text{principal component} & \text{principal component scores} \end{array}$$
Note: Only eigenvectors of the covariance matrix corresponding to non-zero eigenvalues can be called principal directions/components. If the covariance matrix is low rank, it will have one or more zero eigenvalues; corresponding eigenvectors (and corresponding projections that are constant zero) should not be called principal directions/components. See some discussion in my answer here.
What distributions can be choosen for C, gamma and k?
To reproduce results from other methods, define a box and sample uniformly in the box. This will parallel the procedure of grid search, or any other tuning method, since each point is equally likely a priori.
But if you want some distributions more informative than these, then you'll have to work that out for the problem at hand because that is inherently a context-dependent question: some problems have larger/smaller $\gamma$ and $C$ than others, which is why we tune hyper-parameters in the first place.
If you decide to make this a fully Bayesian problem with informative probabilities over hyper-parameters, embedding the problem as a logistic regression can create a direct path to probability models.
Second, let's assume a random search optimization over parameters which have to sum to one. How could one incorporate this constraint into the search?
Use a stick-breaking process. You start with a unit interval, and pick a point in the interval according to a probabiltiy distribution over the unit interval. Then you iterate $k-1$ times the process for the interval "to the right" (or left) of the chosen point. At the end, you'll have $k$ value which sum to 1.
You could also review the stan documentation pertaining to sampling of simplex random variables for an alternative presentation of the concept.
Best Answer
A hyperparameter is a parameter for the (prior) distribution of some parameter.
So for a simple example, let's say we state that the variance parameter $\tau^2$ in some problem has a uniform prior on $(0,\theta)$.
(I personally would be unlikely to do such a thing, but it happens; I might in some very particular circumstance)
Then $\tau^2$ is a parameter (in the distribution of the data) and $\theta$ is a hyperparameter.
If we then in turn specify a (prior) distribution for $\theta$ (e.g. that it's Gamma with mean 100 and shape parameter 2), that's a hyperprior - a prior distribution on a parameter of a prior distribution.