I've been working on a regression problem where the input is an image, and the label is a continuous value between 80 and 350. The images are of some chemicals after a reaction takes place. The color that turns out indicates the concentration of another chemical that's left over, and that's what the model is to output – the concentration of that chemical. The images can be rotated, flipped, mirrored, and the expected output should still be the same. This sort of analysis is done in real labs (very specialized machines output the concentration of the chemicals using color analysis just like I'm training this model to do).
So far I've only experimented with models roughly based off VGG (multiple sequences of conv-conv-conv-pool blocks). Before experimenting with more recent architectures (Inception, ResNets, etc.), I thought I'd research if there are other architectures more commonly used for regression using images.
The dataset looks like this:
The dataset contains about 5,000 250×250 samples, which I've resized to 64×64 so training is easier. Once I find a promising architecture, I'll experiment with larger resolution images.
So far, my best models have a mean squared error on both training and validation sets of about 0.3, which is far from acceptable in my use case.
My best model so far looks like this:
// pseudo code
x = conv2d(x, filters=32, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=32, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=32, kernel=[3,3])->batch_norm()->relu()
x = maxpool(x, size=[2,2], stride=[2,2])
x = conv2d(x, filters=64, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=64, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=64, kernel=[3,3])->batch_norm()->relu()
x = maxpool(x, size=[2,2], stride=[2,2])
x = conv2d(x, filters=128, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=128, kernel=[3,3])->batch_norm()->relu()
x = conv2d(x, filters=128, kernel=[3,3])->batch_norm()->relu()
x = maxpool(x, size=[2,2], stride=[2,2])
x = dropout()->conv2d(x, filters=128, kernel=[1, 1])->batch_norm()->relu()
x = dropout()->conv2d(x, filters=32, kernel=[1, 1])->batch_norm()->relu()
y = dense(x, units=1)
// loss = mean_squared_error(y, labels)
Question
What is an appropriate architecture for regression output from an image input?
Edit
I've rephrased my explanation and removed mentions of accuracy.
Edit 2
I've restructured my question so hopefully it's clear what I'm after
Best Answer
First of all a general suggestion: do a literature search before you start making experiments on a topic you're not familiar with. You'll save yourself a lot of time.
In this case, looking at existing papers you may have noticed that
Regression with CNNs is not a trivial problem. Looking again at the first paper, you'll see that they have a problem where they can basically generate infinite data. Their objective is to predict the rotation angle needed to rectify 2D pictures. This means that I can basically take my training set and augment it by rotating each image by arbitrary angles, and I'll obtain a valid, bigger training set. Thus the problem seems relatively simple, as far as Deep Learning problems go. By the way, note the other data augmentation tricks they use:
I don't know your problem well enough to say if it makes sense to consider variations in position, brightness and gamma noise for your pictures, carefully shot in a lab. But you can always try, and remove it if it doesn't improve your test set loss. Actually, you should really use a validation set or $k-$fold cross-validation for these kinds of experiments, and don't look at the test set until you have defined your setup, if you want the test set loss to be representative of the generalization error.
Anyway, even in their ideal conditions, the naive approach didn't work that well (section 4.2). They stripped out the output layer (the softmax layer) and substituted it with a layer with two units which would predict the sine $y$ and cosine $x$ of the rotation angle. The actual angle would then be computed as $\alpha=\text{atan2}(y,x)$. The neural network was also pretrained on ImageNet (this is called transfer learning). Of course the training on ImageNet had been for a different task (classification), but still training the neural network from scratch must have given such horrible results that they decided not to publish them. So you had all ingredients to make a good omelette: potentially infinite training data, a pretrained network and an apparently simple regression problem (predict two numbers between -1 and 1). Yet, the best they could get with this approach was a 21° error. It's not clear if this is an RMSE error, a MAD error or what, but still it's not great: since the maximum error you can make is 180°, the average error is $>11\%$ of the maximum possible error. They did slightly better by using two networks in series: the first one would perform classification (predict whether the angle would be in the $[-180°,-90°],[-90°,0°],[0°,90°]$ or $[90°,180°]$ class), then the image, rotated by the amount predicted by the first network, would be fed to another neural network (for regression, this time), which would predict the final additional rotation in the $[-45°,45°]$ range.
On a much simpler (rotated MNIST) problem, you can get something better, but still you don't go below an RMSE error which is $2.6\%$ of the maximum possible error.
So, what can we learn from this? First of all, that 5000 images is a small data set for your task. The first paper used a network which was pretrained on images similar to that for which they wanted to learn the regression task: not only you need to learn a different task from that for which the architecture was designed (classification), but your training set doesn't look anything at all like the training sets on which these networks are usually trained (CIFAR-10/100 or ImageNet). So you probably won't get any benefits from transfer learning. The MATLAB example had 5000 images, but they were black and white and semantically all very similar (well, this could be your case too).
Then, how realistic is doing better than 0.3? We must first of all understand what do you mean by 0.3 average loss. Do you mean that the RMSE error is 0.3,
$$\frac{1}{N}\sum_{i=1}^N (h(\mathbf{x}_i)-y_i)^2$$
where $N$ is the size of your training set (thus, $N< 5000$), $h(\mathbf{x}_i)$ is the output of your CNN for image $\mathbf{x}_i$ and $y_i$ is the corresponding concentration of the chemical? Since $y_i\in[80,350]$, then assuming that you clip the predictions of your CNN between 80 and 350 (or you just use a logit to make them fit in that interval), you're getting less than $0.12\%$ error. Seriously, what do you expect? it doesn't seem to me a big error at all.
Also, just try to compute the number of parameters in your network: I'm in a hurry and I may be making silly mistakes, so by all means double check my computations with some
summary
function from whatever framework you may be using. However, roughly I would say you have$$9\times(3\times 32 + 2\times 32\times 32 + 32\times64+2\times64\times64+ 64\times128+2\times128\times128) +128\times128+128\times32+32 \times32\times32=533344$$
(note I skipped the parameters of the batch norm layers, but they're just 4 parameters for layer so they don't make a difference). You have half a million parameters and 5000 examples...what would you expect? Sure, the number of parameters is not a good indicator for the capacity of a neural network (it's a non-identifiable model), but still...I don't think you can do much better than this, but you can try a few things: