The only problem with applying Fisher's exact test to tables larger than 2x2 is that the calculations become much more difficult to do. The 2x2 version is the only one which is even feasible by hand, and so I doubt that Fisher ever imagined the test in larger tables because the computations would have been beyond anything he would have envisaged.
Nevertheless, the test can be applied to any mxn table and some software including Stata and SPSS provide the facility. Even so, the calculation is often approximated using a Monte Carlo approach.
Yes, if the expected cell counts are small, it is better to use an exact test as the chi-squared test is no longer a good approximation in such cases.
Pearson's $\chi^2$ test is useful for a sample of $n$ observations cross-classified by two variables, say $A$ and $B$. These tests test the null hypothesis that $A$ and $B$ are independent variables. So, for an example, if you crossed two strains of D. melanogaster (fruit flies) with different mutations and observed the $F_2$ generation frequencies in $n$ progeny, the $\chi^2$ test tests for linkage of the two traits (i.e., are they on different chromosones [null] or the same chromosomes [i.e., linked, the alternative]).
McNemar's test is used for paired data -- that is, each observation represents a pair of values. For an example, consider a set of $n$ lung cancer patients each with a spouse. You record the smoking habits of the patients and their spouse, and cross classify. Pearson's test would appear to have $2\,n$ observations, but in this case you only have $n$. McNemar's test makes this correction. The hypotheses tested are similar: "Is cancer status related to smoking status?"
I suppose that one could think of this as a "between subjects" vs "within subjects" difference, and there is no doubt that things are similar. I don't see them that way, but I'll confess to not having thought about it much.
In regards to your Question 2,the restriction is on expected cell counts, not observed cell counts. Observed counts are reality, while expected cell counts represent a model. You can think of the restrictions as helping to ensure a decent approximation under the null hypothesis. Reality can (and should) diverge from the model when necessary, but if the model is approximately correct, it would be bad to have a situation where discrepancies get inflated in small cells.
Finally, an exact test is precisely what it says it is. The distribution of the test statistic under the null hypothesis is known exactly. Pearson's $\chi^2$, McNemar's test, and the log-likelihood $\chi^2$ are all based on asymptotic approximations to the distribution of the test statistic under the null hypothesis. Fisher's test, by comparison, notes that conditionally on the marginal totals, the distributions in the two cells of any row (or column) of the table follow a hypergeometric distribution. This insight permits computation of an exact observed significance level ($p$-value) for any given number of observations in the $1, 1$ cell.
Fisher's exact test tests the same null as Pearson's $\chi^2$ and can be used whenever Pearson's is appropriate and in other situations where Pearson's approximation is believed to be unreliable.. Pearson's test also makes use of the information in the marginal totals, and so is also conditional on those totals. Knowing the a priori margins (or even one margin) is unnecessary.
Best Answer
This is a good question, but a big one. I don't think I can provide a complete answer, but I will throw out some food for thought.
First, under your top bullet point, the correction you are referring to is known as Yates' correction for continuity. The problem is that we calculate a discrete inferential statistic:
$$ \chi^2=\sum\frac{(O-E)^2}{E} $$
(It is discrete because, with only a finite number of instances represented in a contingency table, there are a finite number of possible realized values that this statistic can take on.) Notwithstanding this fact, it is compared to a continuous reference distribution (viz., the $\chi^2$ distribution with degrees of freedom $(r-1)(c-1)$). This necessarily leads to a mismatch on some level. With a particularly small data set, and if some cells have expected values less than 5, it is possible that the p-value could be too small. Yates' correction adjusts for this.
Ironically, the same underlying problem (discrete-continuous mismatch) can lead to p-values that are too high. Specifically, the p-value is conventionally defined as the probability of getting data that are as extreme or more than the observed data. With continuous data, it is understood that the probability of getting any exact value is vanishingly small, and thus we really have the probability of data that are more extreme. However, with discrete data there is a finite probability of getting data just like yours. Only calculating the probability of getting data more extreme than yours yields nominal p-values that are too low (leading to increased type I errors), but including the probability of getting data the same as yours leads to nominal p-values that are too high (which would lead to increased type II errors). These facts prompt the idea of the mid p-value. Under this approach, the p-value is the probability of data more extreme than yours plus half the probability of data just the same as yours.
As you point out, there are many possibilities for testing contingency table data. The most comprehensive treatment of the pros and cons of the various approaches is here. That paper is specific to 2x2 tables, but you can still learn a lot about the options for contingency table data by reading it.
I also do think it's worth considering models seriously. Older tests like chi-squared are quick, easy, and understood by many people, but do not leave you with as comprehensive an understanding of your data as you get from building an appropriate model. If it is reasonable to think of the rows [columns] of your contingency table as a response variable, and the columns [rows] as an explanatory / predictor variables, a modeling approach follows quite readily. For instance, if you had just two rows, you can build a logistic regression model; if there are several columns, you could use reference cell coding (dummy coding) to build an ANOVA-type model. On the other hand, if you have more than two rows, multinomial logistic regression can be used in the same manner. Should your rows have an intrinsic order, ordinal logistic regression would yield superior performance to multinomial. The log-linear model (Poisson regression) is probably less relevant unless you have contingency tables with more than two dimensions, in my opinion.
For a comprehensive treatment of topics like these, the best sources are the books by Agresti: either his full-scale treatment (more rigorous), his intro book (easier but still comprehensive and very good), or possibly also his ordinal book.
Update: Just for the sake of the completeness of list of possible tests, it occurs to me that we can add the likelihood ratio test (often called the '$G^2\text{-test}$'). It is:
$$ G^2=\sum O\cdot\text{ln}\left(\frac{O}{E}\right) $$
This is also distributed as a chi-squared, and will almost always yield the same decision. The realized values of the two statistics will typically be similar, but slightly different. The question of which will be more powerful in a given situation is quite subtle. I gather it is the default choice by tradition in some fields. I do not necessarily advocate it's use over the traditional test; I'm only listing it for completeness, as I say.