It sounds like you are asking a lot of different questions here.
My question is: how should I interpret the p value? I don't understand what is that referred to.
The null hypothesis for Fisher's Exact test is that the groups do not affect the outcome, i.e. that they are independent. Rejection of the null hypothesis indicates the outcome (a, b, or c) is dependent on group.
fisher.test(matrix(c(2, 12, 1, 5, 3, 1),
nrow=2, ncol=3, byrow=TRUE))
Fisher's Exact Test for Count Data
data: dta
p-value = 0.05082
alternative hypothesis: two.sided
In this case your $p$ value is approximately 0.05082. I will let you decide whether to reject the null.
Having the p value, how can I say that one of the three forms is statistically significant more represented than the others (if true)?
This is a separate question and I'm not sure what you are trying to ask.
You calculate a single Kappa for all the categories at once. The formula for $\kappa$ is:
$$\kappa = \frac{p_{o}-p_{e}}{1-p_{e}} = \frac{N_{o}-N_{e}}{N-N_{e}}$$
You see, this depends only on $p_{o}$ and $p_{e}$, respectively the observed agreement and the chance agreement. The number of categories does not matter at all.
Say you have a table of observations $\text m_{i,j}$ with $k$ categories.
$\sum_i^k{m_{i,i}}$ is the sum of agreements in the table.
$\sum_j^k{m_{l,j}}$ is the sum of counts in the $l$-th row.
$\sum_i^k{m_{i,l}}$ is the sum of counts in the $l$-th column.
So, when you multiply the total counts in a row by the total counts in the respective column and divide by $N$ you have a conservative estimate of the count of agreements by chance in that category. Summing over all pairs of rows and columns of each category you have an overall estimate of chance agreement, and you can plug that into the $\kappa$ formula.
$$N_o = \sum_i^k{m_{i,i}}$$
$$N_e = \frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)$$
$$\therefore \kappa = \frac{N_{o}-N_{e}}{N-N_{e}} = \frac{\sum_i^k{m_{i,i}}-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}{N-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}$$
And here's a sample code in R:
kappa = function(M) (sum(diag(M)) - 1/sum(M) * sum(colSums(M) * rowSums(M)))/
(sum(M) - 1/sum(M) * sum(colSums(M) * rowSums(M)))
In case of perfect agreement:
> M = table(iris$Species, iris$Species)
> print(kappa(M))
[1] 1
In case of random predictions
> M = table(sample(iris$Species, 150L), iris$Species)
> print(kappa(M))
[1] -0.09
Best Answer
I know I answer the question two years later, but I hope some future readers may find the answer helpful.
Cohen's $\kappa$ tests if there are more chance that a datum falls in the diagonal of a classification table whereas Fisher's exact test evaluates the association between two categorical variables.
In some cases, Cohen's $\kappa$ might appear to converge to Fisher exact test. A simple case, will answer your question that the Fisher test is not appropriate for rater agreement.
Imagine a $2 \times 2$ matrix like
$\begin{matrix} 10 & 20 \\ 20 & 10\end{matrix}$.
It is clear that there is an association between both variables on the off-diagonal, but that raters do not agree more than chance. In other terms, raters systematicaly disagree. From the matrix, we should expect that the Fisher test is significant while the Cohen's $\kappa$ should not be. Carrying the analysis confirms the expectation, $p = 0.01938$ and $\kappa = -0.333$, $z =-4743$ and $p = 0.999$.
We can also carry another example where both outcomes diverge with the following matrix :
$\begin{matrix} 20 & 10 & 10 \\ 20 & 20 & 20 \\ 20 & 20 & 20 \end{matrix}$,
which gives $p = 0.4991$ and $\kappa = 0.0697$, $z =1.722$ and $p = 0.043$. So the raters likely agree, but there is no relation between categorical variables.
I don't have a more formal mathematical explanation on how they should or should not converge though.
Finally, given the actual state of knowledge on Cohen's $\kappa$ in the methodological literature (see this for instance), you might want to avoid it as a measure of agreement. The coefficient has a lot of issus. Careful training of raters and strong agreement on each categories (rather than the overall agreement) is, I believe, the way to go.