There is considerable overlap among these, but some distinctions can be made. Of necessity, I will have to over-simplify some things or give short-shrift to others, but I will do my best to give some sense of these areas.
Firstly, Artificial Intelligence is fairly distinct from the rest. AI is the study of how to create intelligent agents. In practice, it is how to program a computer to behave and perform a task as an intelligent agent (say, a person) would. This does not have to involve learning or induction at all, it can just be a way to 'build a better mousetrap'. For example, AI applications have included programs to monitor and control ongoing processes (e.g., increase aspect A if it seems too low). Notice that AI can include darn-near anything that a machine does, so long as it doesn't do it 'stupidly'.
In practice, however, most tasks that require intelligence require an ability to induce new knowledge from experiences. Thus, a large area within AI is machine learning. A computer program is said to learn some task from experience if its performance at the task improves with experience, according to some performance measure. Machine learning involves the study of algorithms that can extract information automatically (i.e., without on-line human guidance). It is certainly the case that some of these procedures include ideas derived directly from, or inspired by, classical statistics, but they don't have to be. Similarly to AI, machine learning is very broad and can include almost everything, so long as there is some inductive component to it. An example of a machine learning algorithm might be a Kalman filter.
Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics), but is put to different ends. Data mining is carried out by a person, in a specific situation, on a particular data set, with a goal in mind. Typically, this person wants to leverage the power of the various pattern recognition techniques that have been developed in machine learning. Quite often, the data set is massive, complicated, and/or may have special problems (such as there are more variables than observations). Usually, the goal is either to discover / generate some preliminary insights in an area where there really was little knowledge beforehand, or to be able to predict future observations accurately. Moreover, data mining procedures could be either 'unsupervised' (we don't know the answer--discovery) or 'supervised' (we know the answer--prediction). Note that the goal is generally not to develop a more sophisticated understanding of the underlying data generating process. Common data mining techniques would include cluster analyses, classification and regression trees, and neural networks.
I suppose I needn't say much to explain what statistics is on this site, but perhaps I can say a few things. Classical statistics (here I mean both frequentist and Bayesian) is a sub-topic within mathematics. I think of it as largely the intersection of what we know about probability and what we know about optimization. Although mathematical statistics can be studied as simply a Platonic object of inquiry, it is mostly understood as more practical and applied in character than other, more rarefied areas of mathematics. As such (and notably in contrast to data mining above), it is mostly employed towards better understanding some particular data generating process. Thus, it usually starts with a formally specified model, and from this are derived procedures to accurately extract that model from noisy instances (i.e., estimation--by optimizing some loss function) and to be able to distinguish it from other possibilities (i.e., inferences based on known properties of sampling distributions). The prototypical statistical technique is regression.
Basically, Decision Trees are a pure classification techniques. These techniques aim at labelling records of unknown class making use of their
features. They basically map the set of record features $\mathcal{F} = {F_1 , \dots, F_m }$ (attributes, variables) into the class attribute $C$ (target variable), the object of the classification. The relationship between $\mathcal{F}$ and $C$ is learned using a set of labelled records, defined as the training set. The ultimate purpose of classification models is to minimise the mis-classification error on unlabelled records, where the class predicted by the model differs from the real one. The features $F$ can be categorical or continuous.
Association analysis first applications were about market basket analysis, in these application you are interested in finding out association between items with no particular focus on a target one. Datasets commonly used are the transactional ones: a collection of transaction were each of those contains a set of items. For example:
$$ t_1 = \{i_1,i_2 \} \\
t_2 = \{i_1, i_3, i_4, i_5 \} \\
t_3 = \{i_2, i_3, i_4, i_5 \} \\
\vdots \\
t_n = \{ i_2, i_3, i_4, i_5 \}
$$
You are interested in finding out rules such as
$$ \{ i_3, i_5 \} \rightarrow \{ i_4 \} $$
It turns out that you can use association analysis for some specific classification tasks, for example when all your features are categorical. You have just to see items as features, but this is not what association analysis was born for.
Best Answer
Jerome Friedman wrote a paper a while back: Data Mining and Statistics: What's the Connection?, which I think you'll find interesting.
Data mining was a largely commercial concern and driven by business needs (coupled with the "need" for vendors to sell software and hardware systems to businesses). One thing Friedman noted was that all the "features" being hyped originated outside of statistics -- from algorithms and methods like neural nets to GUI driven data analysis -- and none of the traditional statistical offerings seemed to be a part of any of these systems (regression, hypothesis testing, etc). "Our core methodology has largely been ignored." It was also sold as user driven along the lines of what you noted: here's my data, here's my "business question", give me an answer.
I think Friedman was trying to provoke. He didn't think data mining had serious intellectual underpinnings where methodology was concerned, but that this would change and statisticians ought to play a part rather than ignoring it.
My own impression is that this has more or less happened. The lines have been blurred. Statisticians now publish in data mining journals. Data miners these days seem to have some sort of statistical training. While data mining packages still don't hype generalized linear models, logistic regression is well known among the analysts -- in addition to clustering and neural nets. Optimal experimental design may not be part of the data mining core, but the software can be coaxed to spit out p-values. Progress!