To add to all the answers above, there is a delightful example in the text "Mathematics for Physics" by Stone and Goldbart, (Appendix A.3), to clarify the difference between vectors and co-vectors, which I can't resist quoting here.
One way of driving home the distinction between $V$ and $V^*$ is to consider
the space $V$ of fruit orders at a grocers. Assume that the grocer stocks only
apples, oranges and pears. The elements of $V$ are then vectors such as
$x = 3 \text{ kg apples }+ 4.5 \text{ kg oranges } + 2 \text{ kg pears.} $
Take $V^*$ to be the space of possible price lists, an example element being
$f = (\$3.00/\text{kg}) \text{ apples}^* + (\$2.00/\text{kg}) \text{ oranges}^* + (\$1.50/\text{kg}) \text{ pears}^*$
The evaluation of $f$ on $x$
$f(x) = 3 \times \$3.00 + 4.5 \times \$2.00 + 2 \times \$1.50 = 21.0$
then returns the total cost of the order. You should have no difficulty in
distinguishing between a price list and box of fruit!
My favorite way to interpret the trace is as the average value of an associated quadratic form. Here's how that works.
Let $V$ be an $n$-dimensional vector space, and let $T$ be a tensor on $V$. First let's consider the case in which $T$ is a tensor of type $(1,1)$, which we can also interpret as a linear map from $V$ to itself. Choose an inner product $\left< \cdot,\cdot\right>$ on $V$, and define the associated quadratic form $Q\colon V\to\mathbb R$ by
$$Q(x) = \left< x, Tx \right>.$$
Then a computation shows that the trace of $T$ is $n$ times the average value of $Q$ over the unit sphere in $V$.
(Here's a sketch of how this computation is done: Choose an orthonormal basis for $V$ and express $x$ in terms of that basis as an $n$-tuple $(x^1,\dots,x^n)$, with $(x^1)^2 + \dots + (x^n)^2 = 1$. Then
$$\int_{\mathbb S^{n-1}} Q(x)\,dA =
\sum_{i,j}T_i^j\int_{\mathbb S^{n-1}} x^ix^j\,dA.
$$
The integrals on the right with $i\ne j$ are all zero, while the ones with $i=j$ are all the same, as can be seen by renaming the variables; adding them all up yields the volume of the sphere, so each integral is $1/n$th of the volume.)
It's interesting to note that, because the trace is independent of basis, this result doesn't depend on the inner product chosen, even though the quadratic form will change depending on the inner product.
The quadratic form may seem to capture only part of the information encoded in $T$. But note that once an inner product is chosen, there's a one-to-one correspondence between linear maps $T\colon V\to V$ and bilinear forms $B_T\colon V\times V\to\mathbb R$, given by $B_T(x,y) = \left<x,Ty\right>$. Each such bilinear form decomposes into a symmetric part and a skew-symmetric part: $B_T = B_T^{\text{sym}}+B_T^{\text{skew}}$. The trace of the skew part is zero, so the trace only "sees" the symmetric part; and the symmetric part can be reconstructed from the quadratic form by using the polarization identity $B_T(x,y) = \tfrac14(Q(x+y)-Q(x-y))$.
Now if $T$ is a tensor of type $(k,l)$, the contraction on any pair of indices yields a tensor of type $(k-1,l-1)$, whose value on any set of arguments $x_1,\dots,x_{k-1}, x_1^*,\dots,x_{l-1}^*$ is just $n$ times the average value of the quadratic form determined by the $(1,1)$-tensor $T(x_1,\dots,x_{k-1},\ \cdot\ , x_1^*,\dots,x_{l-1}^*,\ \cdot\ )$.
Best Answer
First step: we need to define the trace of an endomorphism $f:V\to V$ where $\dim V=n$. One way is to take a basis $\beta=\{e_1,\dots, e_n\}$ of $V$, consider the associated matrix $[f]_{\beta}$, and define \begin{align} \text{trace}(f):=\text{trace}([f]_{\beta}):=\sum_{i=1}^n([f]_{\beta})_{ii}, \end{align} i.e the sum of the diagonals of the matrix-representation of $f$. This result doesn't depend on the choice of basis (if you use a different basis $\gamma$, then $[f]_{\gamma}=P[f]_{\beta}P^{-1}$ for some invertible matrix $P$; i.e they're related by similarity, and now using the cyclic property of traces ($\text{trace}(AB)=\text{trace}(BA)$) the well-definition follows). Using the isomorphism $\text{End}(V)\cong T^1_1(V)$, we see that $\text{trace}:\text{End}(V)\to\Bbb{R}$ induces a mapping (which by slight abuse of language we still refer to as 'trace') $\text{trace}:T^1_1(V)\to\Bbb{R}$. If you carry out this isomorphism, you'll see that it amounts to taking a basis $\{e_1,\dots, e_n\}$ of $V$, the dual basis $\{\epsilon^1,\dots, \epsilon^n\}$ of $V^*$, and then the trace of a $(1,1)$ tensor $F$ is \begin{align} \text{trace}(F)&=\sum_{i=1}^nF(\epsilon^i,e_i). \end{align}
So, in the above paragraph, we've defined the trace of a $(1,1)$ tensor. A natural question arises as to whether we can define an analogous operation for higher order tensors. Let $F$ be a $(k+1,l+1)$ tensor in your notation, where $k,l\geq 0$. This means $F$ is a multilinear map $(V^*)^{l+1}\times V^{k+1}\to\Bbb{R}$. Let us fix two integers $i$ and $j$ such that $1\leq i\leq l+1$ and $1\leq j\leq k+1$. We can now define a map $C_{ij}:T^{k+1}_{l+1}(V)\to T^k_l(V)$, which we shall call the '$i,j$ contraction map', whose definition is: you take $\omega^1,\dots, \omega^l\in V^*$ and $v_1,\dots, v_k\in V$, and define $C_{ij}(F)\in T^k_l(V)$ such that its numerical value on these guys is \begin{align} \text{trace}\bigg( F(\omega^1,\dots, \omega^{i-1}, \underbrace{\star}_{\text{$i^{th}$ covector slot}}, \omega^i,\dots, \omega^l, v_1,\dots, v_{j-1},\underbrace{\star}_{\text{$j^{th}$ vector slot}},v_j,\cdots, v_k)\bigg). \end{align} In words, we take $k$ vectors $v_1,\dots, v_k$ and $l$ covectors $\omega^1,\dots, \omega^l$, and we feed it inside of $F$ (which has $l+1$ open slots for covectors and $k+1$ open slots for vectors) such that we leave the $i^{th}$ covector slot empty, and the $j^{th}$ vector slot empty. With these two slots left open, we now have a $(1,1)$ tensor, so by my first paragraph, you can take the trace and get a number.
So that's the definition. Here I've given this mapping the name $C_{ij}$ to mean the '$i,j$ contraction', but it's also common to call it $\text{tr}_{ij}$ to mean the trace over the $i,j$ slots. Often, we may dispense with notation like $C_{ij}$ or $\text{tr}_{ij}$, and simply say in words "take the trace/contraction of the tensor $F$ over its $i^{th}$ covector and $j^{th}$ vector slots".
For concreteness, lets say $F$ is a $(3,2)$ tensor, meaning a multilinear map $F:V^*\times V^*\times V\times V\times V\to\Bbb{R}$. And say I want to take the trace over the first covector slot and the second vector slot (i.e $C_{12}$ or $\text{tr}_{12}$). Then, $\text{tr}_{12}(F):V^*\times V\times V\to\Bbb{R}$ is the map such that for all $\omega\in V^*,u,v\in V$, \begin{align} (\text{tr}_{12}F)(\omega, u,v)&:=\text{trace}\bigg(F(\star,\omega, u,\star, v)\bigg)=\sum_{i=1}^nF(\epsilon^i,\omega,u,e_i,v) \end{align}
For a slightly more abstract perspective on traces, see this answer of mine. The point is we can take any number of vector spaces $V_1,\dots, V_p$, and form the tensor product space $V_1\otimes\cdots\otimes V_p$. As long as we have one copy of $V$ and one copy of $V^*$ in the tensor product (i.e there exist distinct indices $i,j\in\{1,\dots, p\}$ such that $V_i=V_j^*$), we can define a trace/contraction mapping over those spaces, thereby obtaining a linear map $V_1\otimes\cdots\otimes V_p\to V_1\otimes\cdots\widehat{V_i}\otimes\cdots\otimes \widehat{V_j}\otimes\cdots\otimes V_p$, where the hat means omit that space in the tensor product.
We can generalize this idea further. Suppose $V_1,\dots, V_p$ are any vector spaces. Suppose we fix distinct indices $i,j$, and that we have a bilinear map $\mu:V_i\times V_j\to\Bbb{R}$. Then, we can define a 'contraction with respect to $\mu$' to be the unique linear map $\tilde{\mu}:V_1\otimes\cdots\otimes V_p\to V_1\otimes\cdots\widehat{V_i}\otimes\cdots\otimes \widehat{V_j}\otimes\cdots\otimes V_p$ such that for all pure tensors, we have \begin{align} \tilde{\mu}(v_1\otimes\cdots\otimes v_p)&=\mu(v_i,v_j)\cdot v_1\otimes\cdots \otimes\widehat{v_i}\otimes\cdots\otimes\widehat{v_j}\otimes\cdots\otimes v_p. \end{align} The previous paragraph was the special case where $\mu:V\times V^*\to\Bbb{R}$ is the evaluation mapping on a pair of vector spaces.