A $(p,q)$ tensor field $T$ on a smooth manifold $M$ is defined as a multilinear map:
$$T : \underbrace{\Omega(M)\times\dots\times\Omega(M)}_{p\ times} \times\underbrace{\mathfrak{X}(M)\times\dots\times\mathfrak{X}(M)}_{q\ times} \to C^{\infty}(M)$$ where $\Omega(M)$ is the set of all covector fields on $M$ and $\mathfrak{X}(M)$ is the set of all vector fields on $M$.
Unpacking the above, $T$ sends $p$ covector fields $H_1,\ldots,H_p$ and $q$ vector fields $X_1,\ldots,X_q$ to some $C^{\infty}$ function $f$ on $M$. i.e., $T(H_1,\ldots,H_p,X_1,\ldots,X_q)$ is a $C^{\infty}$ function on $M$. Notationally, its effect for a point $x\in M$ is
$$T_x(H_1(x),\ldots,H_p(x),X_1(x),\ldots,X_q(x))=f(x)$$
where $H_i(x)$ and $X_i(x)$ are respectively covectors and vectors in cotangent and tangent spaces at $x$. $T_x$ is defined locally at $x$ as:
$$T_x:\underbrace{T^*_xM\times\dots\times T^*_xM}_{p\ times} \times\underbrace{T_xM\times\dots\times T_xM}_{q\ times} \to F$$
Thus you can see how the tensor field $T$ selects for each point $x$ a locally defined tensor acting on cotangent and tangent spaces at $x$.
To address what you wrote in the question: "a tensor is defined as a linear multilinear map on a set of vector spaces and/or dual vector spaces to a field..." - you hopefully see why this isn't accurate. Similarly, "tensor field is defined as a linear multilinear map on a set of tangent vector spaces and/or dual tangent vector spaces to a field" isn't accurate either.
I‘ll first try to explain a bit more in depth than Spivak what tensors are, why they are cool and then tell you why the inertia tensor is one.
Let’s consider a vectorspace $V$ over let‘s say the real numbers $\mathbb{R}$, before we talk about tensors you need to be familiar with the dual space of $V$, denoted $V^*$. $V^*$ is the space of all linear functions that assign numbers to vectors so for $f \in V$, we have $f: V \rightarrow \mathbb{R}$. For a given basis of $V$ let‘s say $\{e_1, \dots, e_n \}$ we get a basis of $V^*$ often denoted $\{e_1^*, \dots, e_n^* \}$ the elements in that basis are uniquely determined by the property that: $e_i^*(e_j)=\delta_{ij}$.
Now we can consider the space of multilinear maps from some copies of $V$ and $V^*$ to the reals, i.e.:
\begin{equation}
V^* \times \dots \times V^* \times V \times \dots \times V \rightarrow \mathbb{R}
\end{equation}
But multilinear maps are lame, we understand linear maps much better. Instead of having a multilinear map that takes many different vectors and dual vectors as input we would like to have a linear map that takes one massive vector as input and this one vector should encompasses all the information of the other vectors combined. Where would this giant vector come from? From the tensor product space! In mathematics the tensor product space is even defined as the space that is so large that a single element in it can replace all the different vectors in the input of a multilinear map. This space is denoted:
\begin{equation}
V^* \otimes \dots \otimes V^* \otimes V \otimes \dots \otimes V
\end{equation}
And the space of multilinear maps could thus be written as:
\begin{equation}
V^* \otimes \dots \otimes V^* \otimes V \otimes \dots \otimes V \rightarrow \mathbb{R}
\end{equation}
But notice, we now have a map from a vector space (the tensor product space) to the real numbers, elements (these are the maps) of this vector space are elements of the dual space, so this is actually nothing but:
\begin{equation}
(V^* \otimes \dots \otimes V^* \otimes V \otimes \dots \otimes V)^*
\end{equation}
Now we can use that „the „$^*$“ is distributive“ and that $V^{**} \cong V$, so the space above can actually be written as:
\begin{equation}
V \otimes \dots \otimes V \otimes V^* \otimes \dots \otimes V^*
\end{equation}
Elements of this space are denoted tensors of rank (r,s) where r is the number of copies of $V$ and s the number of copies of $V^*$
Almost all objects in linear algebra can be described that way, for example linear maps are tensors of rank (1,1), i.e. elements in $V \otimes V^*$, why is that?
Well I can prove it to you we only need to define how an object in $V \otimes V^*$ acts on vectors, lets take an $v \in V$ and a $x \otimes y^* \in V \otimes V^* $ and we define $x \otimes y^* (v):=y^*(w)x$ why is this a linear map? Because for $v,w \in V$: $x \otimes y^* (v+w):=y^*(w+v)x= (y^*(w)+y^*(v))x= y^*(w)x+y^*(v)x= x \otimes y^* (v)+ x \otimes y^* (w)$, where we used that $y^*$ is a linear map.
Remember the basis $\{e_1, \dots, e_n \}$ and $\{e_1^*, \dots, e_n^* \}$? It turns out that $\{e_1 \otimes e_1^*, e_1 \otimes e_2^*, \dots, e_2 \otimes e_1^*, \dots e_n\otimes e_n^* \}$ is a basis of $V \otimes V^*$ general elements in $V \otimes V^*$ are linear combinations, i.e.:
$$
\sum\limits_{i,j} A_{ij}e_i \otimes e_j^*
$$
Where the $A_{ij}$ are the coefficients of the linear combination, i.e. just numbers. In physics people often only write the coefficients and ignore the basis vectors, putting the j as a subscript, indicating that it belongs to a dual vector and the i as a superscript, indicating that it belongs to a „normal“ vector.
$$
\sum\limits_{i,j} A^{i}_{\ j}
$$
And then they don‘t even write the sum, and call it Einstein sum convention:
$$
A^{i}_{\ j}
$$
and call the object above a tensor of rank (1,1).
So why is the inertia tensor a tensor? Because it is a linear map, i.e. an element of $V \otimes V^*$, i.e. a rank (1,1) tensor. It assigns to the vector that indicates the objects rotation $\omega$ the vector that indicates its angular momentum $L$. But it can also be thought used in a different way, as a (0,2) tensor, i.e. an element of $V^* \otimes V^*$, because if you use it to calculate an objects rotational energy you feed it two vectors, this operation is often written as $\omega^T I \omega$ (The two vectors just happen to be the same) and out comes a number, its Energy.
Best Answer
The definition about indexes transforming in a certain way is very much about the tensors built from vectors in the tangent space of a manifold. Not all vectors need to be defined in such a way.
Let there be some vectors space $\mathcal{U}$ with basis vectors $\mathbf{u}_{i=1\dots n}$ and another vector space $\mathcal{V}$ with basis $\mathbf{v}_{j=1\dots m}$. Consider a space of all homogeneous bilinear functionals that map a pair of vectors $\left(a^i\mathbf{u}_i,\,b^j\mathbf{v}_j\right)$ to real numbers $\mathcal{L}:\mathcal{U}\times\mathcal{V}\to\mathbb{R}$. This space of maps can be spanned by the following functionals:
$$ \mathbf{l}^{ij}\left(\mathbf{u}_p,\,\mathbf{v}_r\right)=\begin{cases}\begin{array}\\ 1,\quad i=p\,and\,j=r \\ 0,\quad otherwise\end{array}\end{cases} $$
Then every functional can be represented as $\omega_{ij}\mathbf{l}^{ij}$ and the application of the functional onto the pair of vectors will lead to:
$$ \left(\omega_{ij}\mathbf{l}^{ij}\right)\left(a^p\mathbf{u}_p,\,b^r\mathbf{v}_r\right)=\omega_{ij}a^i b^j $$
There is clearly an isomorphism between the vector space of bilinear functionals (I will skip proof that this is a vector space) and the Cartesian product of two vector spaces. Call it:
$$ \phi:\mathcal{U}\times\mathcal{V}\to\mathcal{L} $$
And define it as $\phi\left(a^p\mathbf{u}_p,\,b^r\mathbf{v}_r\right)=\sum_{p,r}a^p b^r \mathbf{l}^{pr}$ (yes this breaks the upstairs-downstairs convention, but this is temporary).
Next, since $\mathcal{L}$ is a vector space we can consider a vector space dual to it. Let this vector space $\mathcal{T}$ be spanned by basis $\mathbf{t}_{ij}$. By definition of the dual space:
$$ \left(w^{ij}\mathbf{t}_{ij}\right)\left(\omega_{pr}\mathbf{l}^{pr}\right)=w^{ij}\omega_{ij} $$
We can define another isomorphism: $\psi:\mathcal{L}\to\mathcal{T}$, where $\psi\left(\omega_{pr}\mathbf{l}^{pr}\right)=\sum_{pr}\omega_{pr}\mathbf{t}_{lr}$.
Finally, define the tensor product as:
$\otimes=\psi\circ\phi:\,\mathcal{U}\times\mathcal{V}\to\mathcal{L}\to\mathcal{T}$. In particular, by definition, the basis for $\mathcal{T}$ can be denoted by: $\psi\circ \phi\left(\mathbf{u}_i,\,\mathbf{v}_j\right)=\mathbf{u}_i\otimes\mathbf{v}_j$
It readily follows that $\psi\circ \phi\left(a^i\mathbf{u}_i,\,b^j\mathbf{v}_j\right)=a^i b^j\mathbf{u}_i\otimes\mathbf{v}_j$.
The tensor product is not commutative because of the functional space you used in-between ($\mathcal{L}$). It was defined specifically for the pair $\mathcal{U}\times\mathcal{V}$ and not the other way round.
Note that above procedure can be repeated again and can be combined. For example you can consider bilinear functionals from $\mathcal{U}\times\mathcal{V}^*$ and create a tensor with upstairs-downstairs vectors. You can also chain tensor products together, i.e. $\mathcal{U}$ could itself be a tensor product space.
The difference between a Cartesian product $\mathcal{U}\times\mathcal{V}$ and tensor product $\mathcal{U}\otimes\mathcal{V}$ is that the latter is a vector space itself. In particular, you can meaningfully add members of $\mathcal{U}\otimes\mathcal{V}$ (thanks to space of bilinear functionals), whereas for Cartesian product such operation is not defined.