I'm not an authoritative figure, so consider these brief practitioner notes:
More trees is always better with diminishing returns.
Deeper trees are almost always better subject to requiring more trees for similar performance.
The above two points are directly a result of the bias-variance tradeoff. Deeper trees reduces the bias; more trees reduces the variance.
The most important hyper-parameter is how many features to test for each split. The more useless features there are, the more features you should try. This needs tuned. You can sort of tune it via OOB estimates if you just want to know your performance on your training data and there is no twinning (~repeated measures). Even though this is the most important parameter, it's optimum is still usually fairly close to the original suggest defaults (sqrt(p) or (p/3) for classification/regression).
Fairly recent research shows you don't even need to do exhaustive split searches inside a feature to get good performance. Just try a few cut points for each selected feature and move on. This makes training even faster. (~Extremely Random Forests/Trees).
I think you are describing nested cross validation and you can use it to select your best hyperparameters. R already has some packages implementing this, for example for support vector machines you could use the package e1071 and do something like this assuming you have two independent variables:
svmTuning <- tune.svm(Y~X1+X2.,type="nu-regression",kernel="radial", data = dat, gamma = seq(from=0,to=3,by=0.1), cost=seq(from=2,to=16,by=2),
tunecontrol= tune.control(sampling="cross",cross=1000))
If you had 1000 observations the previous would perform leave-one-out cross validation, sweeping through the possible combinations of selected gammas and costs (but only one kernel in this case). You can see the best parameters by doing:
svmTuning$best.parameters
I'm pretty sure the optimal is chosen using the mean squared error calculated based on the cross validation you chose (in the case of regression) and average classification error.
Here's another example with kernel k-nearest neighbours
knnTuning <- train.kknn(Y~X1+X2., data=dat, kmax = 40, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov","gaussian", "rank", "optimal"),
ykernel = NULL, scale = TRUE,kcv=1000)
Which sweeps through all combinations of neighbors up to 40 and the different kernels but using the euclidean distance (distance=2). You may plot all these results and again obtain the best parameters:
plot(knnTuning)
knnTuning$best.parameters
You could do the same for random forest:
rfTuning <- tune.randomForest(Y~X1+X2, data = dat,ntree=1000, mtry=seq(from=2,to=10,by=1),
tunecontrol= tune.control(sampling="cross",cross=1000))
Where you just sweep through possible values for the amount of variables in the candidates for each split. This is known to overfit if not done carefully.
And so on and so forth. Since you appear to have a small sample size maybe leave-one-out is the way to go. Maybe you can also look into the caret package which has good capabilities for model building and the actual documentation is very solid (theoretical descriptions and all).
Best Answer
Random forests have the reputation of being relatively easy to tune. This is because they only have a few hyperparameters, and aren't overly sensitive to the particular values they take. Tuning the hyperparameters can often increase generalization performance somewhat.
Tree size can be controlled in different ways depending on the implementation, including the maximum depth, maximum number of nodes, and minimum number of points per leaf node. Larger trees can fit more complex functions, but also increase the ability to overfit. Some implementations don't impose any restrictions by default, and grow trees fully. Tuning tree size can improve performance by balancing between over- and underfitting.
Number of features to consider per split. Each time a node is split, a random subset of features is considered, and the best is selected to perform the split. Considering more features increases the chance of finding a better split. But, it also increases the correlation between trees, increasing the variance of the overall model. Recommended default values are the square root of the total number of features for classification problems, and 1/3 the total number for regression problems. As with tree size, it may be possible to increase performance by tuning.
Number of trees. Increasing the number of trees in the forest decreases the variance of the overall model, and doesn't contribute to overfitting. From the standpoint of generalization performance, using more trees is therefore better. But, there are diminishing returns, and adding trees increases the computational burden. Therefore, it's best to fit some large number of trees while remaining within the computational budget. Several hundred is typically a good choice, but it may depend on the problem. Tuning isn't really needed. But, it's possible to monitor generalization performance while sequentially adding new trees to the model, then stop when performance plateaus.
Choosing hyperparameters
Tuning random forest hyperparameters uses the same general procedure as other models: Explore possible hyperparameter values using some search algorithm. For each set of hyperparameter values, train the model and estimate its generalization performance. Choose the hyperparameters that optimize this estimate. Finally, estimate the generalization performance of the final, tuned model on an independent data set.
For many models, this procedure often involves splitting the data into training, validation, and test sets, using holdout or nested cross validation. However, random forests have a unique, convenient property: bootstrapping is used to fit the individual trees, which readily yields the out-of-bag (OOB) error. This is an unbiased estimate of the error on future data, and can therefore take the place of the validation or test set error. This leaves more data available for training, and is computationally cheaper than nested cross validation. See this post for more information.
Grid search is probably the most popular search algorithm for hyperparameter optimization. Random search can be faster in some situations. I mention more about this (and some other hyperparameter optimization issues) here. Fancier algorithms (e.g. Bayesian optimization) are also possible.