Nothing is incompatible in Shannon's paper, though much has been cleaned up and streamlined* - C&T is perhaps one of the best at this. You have to keep in mind though that Shannon wrote a paper - these are never as easy to read as a book if one is not used to them. That said the paper is wonderful. Each time I read it I feel as if I've understood things better, and see them a little differently. Definitely pay close attention to the exposition in it whenever you do read it.
Be warned that the following is necessarily speculative, and a little unrelated to your direct question.
The reason that C&T don't go into why entropy is defined the way it is in Ch. 2 is philosophical. Usually, (and this is the 'incentive' of Shannon that you mention) the justification for entropy is that there are a few natural properties that one wants from a measure of information - key are continuity and that the 'information' of two independent sources is the sum of their individual 'infromation's, and once one posits these axioms, it is a simple theorem that entropy is the unique functional (up to scalar multiplication) that satisfies these.
However, there's a (large**) school in information theory that rejects the above's centrality. It argues that the utility of any information measure is in what operational consequences it has. (This, I think, arises from the fact that the origins - and practice! - of information theory are very much in engineering, and not mathematics, and we engineers, even fairly mathematical ones, are ultimately interested in what one can do with wonderful maths and are not content with how wonderful it is.***). According to this view, the basic reason we define entropy, and that it is such a natural object, is Asymptotic Equipartition (and other nice properties). So you'll find that much of chapter 2 is a (relatively) dry development of various facts about various information measures (except maybe the stuff on Fano's inequality, which is more directly applicable), and that the subject really comes alive in Chapter 3. I'd suggest reading that before you give up on the book - maybe even skip ahead and then go back to Ch. 2.
I'd argue that Cover and Thomas subscribe to the above view. See for instance, the concluding sentences of the introduction to Ch. 2:
In later chapters we show how these quantities arise as natural answers
to a number of questions in communication, statistics, complexity, and
gambling. That will be the ultimate test of the value of these definitions.
and the following from the bottom of page 14 (in the 2nd edition), a little after entropy is defined (following (2.3) ):
It is possible to derive the definition of entropy axiomatically by defining certain properties that the entropy of a random variable must satisfy. This approach is illustrated in Problem 2.46. We do not use the axiomatic approach to justify the definition of entropy; instead, we show that it arises as the answer to a number of natural questions, such as “What is the average length of the shortest description of the random variable?”
*: and some of the proofs have been brought into question from time to time - I think unfairly
**: I have little ability to make estimates about this, but in my experience this has been the dominant school.
***: The practice of this view is also something that Shannon exemplified. His 1948 paper, for instance, sets out to study communication, not to establish what a notion of information should be like - that's 'just' something he had to come up with on the way.
Best Answer
One good reason to call them bits is that this is the number of bits that you need on average to encode an outcome. Some Wikipedia articles you might want to take a look at are Huffman coding, arithmetic coding, entropy encoding and Shannon's source coding theorem.
To give a simple example, say outcome A has probability $1/2$ and outcomes B and C have probabilities $1/4$ each. Then you can encode A by $0$, B by $10$ and C by $11$. This is an optimal prefix-free code; the expected number of bits required to encode an outcome is $\frac12\cdot1+\frac14\cdot2+\frac14\cdot2=\frac32$, and since the number of bits in each code is the self-information (to base $2$) of the outcome it encodes, this expected number of bits is the entropy of the distribution.