First off, I would get yourself a copy of this classic and approachable article and read it: Anscombe FJ. (1973) Graphs in statistical analysis The American Statistician. 27:17–21.
On to your questions:
Answer 1: Neither the dependent nor independent variable needs to be normally distributed. In fact they can have all kinds of loopy distributions. The normality assumption applies to the distribution of the errors ($Y_{i} - \widehat{Y}_{i}$).
Answer 2: You are actually asking about two separate assumptions of ordinary least squares (OLS) regression:
One is the assumption of linearity. This means that the trend in $\overline{Y}$ across $X$ is expressed by a straight line (Right? Straight back to algebra: $y = a +bx$, where $a$ is the $y$-intercept, and $b$ is the slope of the line.) A violation of this assumption simply means that the relationship is not well described by a straight line (e.g., $\overline{Y}$ is a sinusoidal function of $X$, or a quadratic function, or even a straight line that changes slope at some point). My own preferred two-step approach to address non-linearity is to (1) perform some kind of non-parametric smoothing regression to suggest specific nonlinear functional relationships between $Y$ and $X$ (e.g., using LOWESS, or GAMs, etc.), and (2) to specify a functional relationship using either a multiple regression that includes nonlinearities in $X$, (e.g., $Y \sim X + X^{2}$), or a nonlinear least squares regression model that includes nonlinearities in parameters of $X$ (e.g., $Y \sim X + \max{(X-\theta,0)}$, where $\theta$ represents the point where the regression line of $\overline{Y}$ on $X$ changes slope).
Another is the assumption of normally distributed residuals. Sometimes one can validly get away with non-normal residuals in an OLS context; see for example, Lumley T, Emerson S. (2002) The Importance of the Normality Assumption in Large Public Health Data Sets. Annual Review of Public Health. 23:151–69. Sometimes, one cannot (again, see the Anscombe article).
However, I would recommend thinking about the assumptions in OLS not so much as desired properties of your data, but rather as interesting points of departure for describing nature. After all, most of what we care about in the world is more interesting than $y$-intercept and slope. Creatively violating OLS assumptions (with the appropriate methods) allows us to ask and answer more interesting questions.
The assumptions matter insofar as they affect the properties of the hypothesis tests (and intervals) you might use whose distributional properties under the null are calculated relying on those assumptions.
In particular, for hypothesis tests, the things we might care about are how far the true significance level might be from what we want it to be, and whether power against alternatives of interest is good.
In relation to the assumptions you ask about:
1. Equality of variance
The variance of your dependent variable (residuals) should be equal in each cell of the design
This can certainly impact the significance level, at least when sample sizes are unequal.
(Edit:) An ANOVA F-statistic is the ratio of two estimates of variance (the partitioning and comparison of variances is why it's called analysis of variance). The denominator is an estimate of the supposedly-common-to-all-cells error variance (calculated from residuals), while the numerator, based on variation in the group means, will have two components, one from variation in the population means and one due to the error variance. If the null is true, the two variances that are being estimated will be the same (two estimates of the common error variance); this common but unknown value cancels out (because we took a ratio), leaving an F-statistic that only depends on the distributions of the errors (which under the assumptions we can show has an F distribution. (Similar comments apply to the t-test I used for illustration.)
[There's a little bit more detail on some of that information in my answer here]
However, here the two population variances differ across the two differently-sized samples. Consider the denominator (of the F-statistic in ANOVA and of the t-statistic in a t-test) - it is composed of two different variance estimates, not one, so it will not have the "right" distribution (a scaled chi-square for the F and its square root in the case of a t - both the shape and the scale are issues).
As a result, the F-statistic or the t-statistic will no longer have the F- or t-distribution, but the manner in which it is affected is different depending on whether the large or the smaller sample was drawn from the population with the larger variance. This in turn affects the distribution of p-values.
Under the null (i.e. when the population means are equal), the distribution of p-values should be uniformly distributed. However, if the variances and the sample sizes are unequal but the means are equal (so we don't want to reject the null), the p-values are not uniformly distributed. I did a small simulation
to show you what happens. In this case, I used only 2 groups so ANOVA is equivalent to a two-sample t-test with the equal variance assumption. So I simulated samples from two normal distributions one with standard deviation ten
times as large as the other, but equal means.
For the left side plot, the larger (population) standard deviation was for n=5 and the smaller standard deviation was for n=30. For the right side plot the larger standard deviation went with n=30 and the smaller with n=5. I simulated each one 10000 times and found the p-value each time. In each case you
want the histogram to be completely flat (rectangular), since this means all tests conducted at some significance level $\alpha$ with actually get that type I error rate. In particular it's most important that the leftmost parts of the histogram to stay close to the grey line:
As we see, the left side plot (larger variance in the smaller sample) the
p-values tend to be very small -- we would reject the null hypothesis very often (nearly half the time in this example) even though the null is true. That is, our significance levels are much larger than we asked for. In the right hand side plot we see the p-values are mostly large (and so our significance level is much smaller than we asked for) -- in fact not once in ten thousand simulations did we reject at the 5% level (the smallest p-value here was 0.055). [This may not sound like such a bad thing, until we remember that we will also have very low power to go with our very low significance level.]
That's quite a consequence. This is why it's a good idea to use a Welch-Satterthwaite type t-test or ANOVA when we don't have a good reason to assume that the variances will be close to equal -- by comparison it's barely affected
in these situations (I simulated this case as well; the two distributions of simulated p-values - which I have not shown here - came out quite close to flat).
2. Conditional distribution of the response (DV)
Your dependent variable (residuals) should be approximately normally distributed for each cell of the design
This is somewhat less directly critical - for moderate deviations from normality, the significance level is so not much affected in larger samples (though the power can be!).
Here's one example, where the values are exponentially distributed (with identical distributions and sample sizes), where we can see this significance level issue being substantial at small $n$ but reducing with large $n$.
We see that at n=5 there are substantially too few small p-values (the significance level for a 5% test would be about half what it should be), but
at n=50 the problem is reduced -- for a 5% test in this case the true significance level is about 4.5%.
So we might be tempted to say "well, that's fine, if n is big enough to get the significance level to be pretty close", but we may also be throwing a way a good deal of power. In particular, it's known that the asymptotic relative efficiency of the t-test relative to widely used alternatives can go to 0. This means that better test choices can get the same power with a vanishingly small fraction of the sample size required to get it with the t-test. You don't need anything out of the ordinary to be going on to need more than say twice as much data to have the same power with the t as you would need with an alternative test - moderately heavier-than normal tails in the population distribution and moderately large samples can be enough to do it.
(Other choices of distribution may make the significance level higher than it should be, or substantially lower than we saw here.)
Best Answer
Here is an example where the variance of $\varepsilon$ is not constant (the variances of the residuals are larger for larger $x$):
and an example where $\varepsilon$ is not normally distributed (and so the residuals are not normally distributed):