Solved – setting neural network hidden layer size

deep learningmachine learningneural networksregression

I'm having trouble setting network hidden layer size

I have dataset size [40,000,000 x 60] which has so many instances

and I tried to do regression with 4 hidden layer with size 300 (all same)

someone told me my network not working well could be problem of setting size too small and I should increase capacity of neural network

1. how should I set hidden layer size? it seems too arbitrary and I wonder if there is rule of thumb

and I used all same size in all 4 layers and there can be improvement in accuracy when setting different layer sizes and 2. I wonder if there is rule of thumb when setting layer sizes. I'm dealing with regression problem and people say highest hidden layer should be large in regression problem and wonder why.

— added—

It's quite complicated but trying to estimate gene relevance score.

it consists of gene expression correlation scores by regression so it goes like features [0.9,0.8,…, 0.5](about 60) relevance score [0.4].

relevance score is something I want to predict

so there isn't much work about this. genes are about 20000 and I train with all pair so combination between gene pairs thats why I have so many instance of data.

Currently I'm using plain MLP and 4 layer was best for now but I have slight improvement from linear regression. and one more weird thing is that linear regression perform better than random forest regression or gradient boost tree regression models

since I have so many rows it takes so many time to train and evaluate

Best Answer

I don't know a lot about statistical genomics but I can give you a few suggestions.

Be wary of spurious correlations, they are a very common problem in statistical genomics. I suggest you keep a set of data separated from the others and never use it to train or validate your architectures, until you have selected the very final one. In other words, build different networks (different number of layers, different number of hidden units, etc.) without using the "reserved data", and choose the one with the smallest $k$-fold cross-validation error. Then, once you have fixed all the hyperparameters of your neural network, test it on the separate data set. At the cost of some accuracy (your training set will be smaller) you gain some protection from the risk of mistaking noise for signal. Since the number of alternatives can be prohibitive, you can use some automated machine learning frameworks which help you explore the space of possible networks, such as for example auto-sklearn and tpot.

Especially if you use these automated tools you should not let them see the separate data set. You're basically using a black-box to define your architecture, and you may want some kind of insurance against overfitting.

Also, more often than not, when deep learning has the same accuracy than linear regression on a large training set, it's a sign that you're doing something wrong. Read the What's going on? section here, and see also here for some common errors. Unfortunately, most of the material is geared towards classification - that's where Deep Learning is being used the most, today.

Finally, in case the final goal of this genomic study is precision medicine, you may want to have a look at the Deep Review - it's a work in progress, but it may contain useful material for you.