The choice of window length involves a balance between two opposing factors. A shorter window implies a smaller data set on which to perform your estimations. A longer window implies an increase in the chance that the data-generating process has changed over the time period covered by the window, so that the oldest data are no longer representative of the system's current behavior.
Suppose, for example, that you wished to estimate January mean temperature in New York. Due to climate change, data from 40 years ago are no longer representative of current conditions. However, if one uses only data from the past 5 years, your estimate will have a large uncertainty due to natural sampling variability.
Analogously, if you were trying to model the behavior of the Dow Jones Industrial Average, you could pull in data going back over a century. But you may have legitimate reasons to believe that data from the 1920s will not be representative of the process that generates the DJIA values today.
To put it in other terms, shorter windows increase your parameter risk while longer windows increase your model risk. A short data sample increases the chance that your parameter estimates are way off, conditional on your model specification. A longer data sample increases the chance that you are trying to stretch your model to cover more cases than it can accurately represent. A more "local" model may do a better job.
Your selection of window size depends, therefore, on your specific application -- including the potential costs for different kinds of error. If you were certain that the underlying data-generating process was stable, then the more data you have, the better. If not, then maybe not.
I'm afraid I can't offer more insight on how to strike this balance appropriately, without knowing more about the specifics of your application. Perhaps others can offer pointers to particular statistical tests.
What most people do in practice (not necessarily the best practice) is to eyeball it, choosing the longest window for which one can be "reasonably comfortable" that the underlying data-generating process has, during that period, not changed "much". These judgements are based on the analyst's heuristic understanding of the data-generating process.
Taking theoretical considerations aside, Akaike Information Criterion is just likelihood penalized by the degrees of freedom. What follows, AIC accounts for uncertainty in the data (-2LL) and makes the assumption that more parameters leads to higher risk of overfitting (2k). Cross-validation just looks at the test set performance of the model, with no further assumptions.
If you care mostly about making the predictions and you can assume that the test set(s) would be reasonably similar to the real-world data, you should go for cross-validation. The possible problem is that when your data is small, then by splitting it, you end up with small training and test sets. Less data for training is bad, and less data for test set makes the cross-validation results more uncertain (see Varoquaux, 2018). If your test sample is insufficient, you may be forced to use AIC, but keeping in mind what it measures, and what can assumptions it makes.
On another hand, as already mentioned in comments, AIC gives you asymptomatic guarantees, and it's not the case with small samples. Small samples may be misleading about the uncertainty in the data as well.
Best Answer
There is no problem with using K-fold cross-validation using the entire time series in time series classification. It is commonly used with algorithms such as NN-DTW (Nearest Neighbour Dynamic Time Warping) to select the size of the warping window. The sktime package you referred to uses this type of cross-validation to select hyper-parameters for some of its classification algorithms - for example, Hive-Cote Lines, Taylor, and Bagnall: HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification.