Using a neural network to learn linear regression variance

machine learningneural networksvariance

In Deep Learning by Goodfellow, Bengio, and Courville they show that using MSE as a loss function has the same result as maximizing the likelihood (MLE) of an assumed Gaussian model (Section 5.5.1):

$-m \log(\sigma) – \frac{m}{2}\log(2 \pi) – \sum_{i=1}^{m} \frac{||\hat{y}^{(i)} – y^{(i)} ||^2}{2 \sigma^2}$

where $m$ is the number of examples, $\hat{y}^{(i)}$ is the output of a model (presumably a neural network) and $y^{(i)}$ is the ground truth. This seems pretty clear to me. In pyTorch, I can implement the MSE loss function as

def custom_loss(outputs, targets):
    """ MSE, but written by me """
    return torch.mean(torch.pow(outputs - targets, 2))

This function works on a network with 1 output node. There is a built-in MSE, but I want to walk before I run here. This works; I can easily fit linear functions that are 'corrupted' by noise with this loss function. Later, in Section 6.2.2.4, they write

For example, we may wish to learn the variance of a conditional Gaussian for y, given x. In the simple case, where the variance $\sigma^2$ is a constant, there is a closed form expression because the maximum likelihood estimator of variance is simply the empirical mean of the squared difference between observations y and their expected value.

Sure, but how to I implement this in pyTorch (my preferred neural network language)? If I simply chuck it into the custom_loss function, the two terms are going to compete. Without much else to go on, I can try the following loss function:

def custom_loss(outputs, targets):
    """ MLE, but written by me """
    prec = 1 / torch.var(outputs)
    not_mse = 0.5 * prec * torch.mean(torch.pow(outputs - targets, 2))
    return not_mse - 0.5 * torch.log(prec)

This function simply uses the MLE formula from above (ignoring the constant term). Again I have a single output node. Of course this doesn't work.

first failure
second failure

The dashed line is the function to learn, the red circles are the function the network should learn and the blue/magenta dots are the training and test sets, although they are hard to tell apart, they are there. I have $x \in [-3, 3]$, $y=m*x+b$ with $m=2.3$ and $b=1.7$, to which I add Gaussian noise with $\sigma=1$. I train on 10000 points and test on 1000. I arbitrarily chose a (1, 50, 50, 1) network with ReLU activation and bias nodes. This network easily fits the dashed line with the first custom_loss function. The text continues

A computationally more expensive approach that does not require writing special-case code is to simply include the variance as one of the properties of the distribution $p(\mathbf{y}|\mathbf{x})$ that is controlled by $\omega = f(\mathbf{x};\mathbf{\theta})$. The negative log-likelihood $-\log p(\mathbf{y};\mathbf{\omega}(\mathbf{x}))$ will then provide a cost function with the appropriate terms necessary to make our optimization procedure incrementally learn the variance.

First, I don't know what a "propert[y] of the distribution" is. But this really just looks like the first equation (multiplied by -1 to minimize loss instead of maximizing likelihood) and I've already used it above and it doesn't work. And it is obvious that it shouldn't work because of the term competition.

In the simple case where the standard deviation does not depend on the input, we can make a new parameter in the network that is copied directly into $\mathbf{\omega}$.

I have no idea what this means, nor what to do with it. I just pick a weight and say "this is the variance" and off the MSE optimizer goes to optimize things? This is supervised learning, how do I pick a target for $\sigma$? The text makes it seem like I don't have to.

At this point the text veers into computation issues in selecting precision versus variance versus standard deviation. Clearly, they think that their task of explaining how to fit the variance is done and only the implementation details remain, but I have no good idea what to do.

How do I learn the variance of my Gaussian model using a neural network?


Some code that might make your life easier when answering.

A pyTorch dataset class that will make a line with noise:

class RegDataset(torch.utils.data.Dataset):
    def __init__(self, m: float, b: float, stdev: float = 1, 
                 limits: List[float] = [-1, 1], n: int = 1000):
        self.m = float(m)
        self.b = float(b)
        self.stdev = abs(stdev)
        self.limits = self._check_limits(limits)
        self.n = int(n)
        self.dtype = torch.float32
        
        self.X = self.make_x()
        self.Y = self.add_noise(self.line(self.X))
        
    @staticmethod
    def _check_limits(limits: List[float]) -> List[float]:
        assert len(limits) == 2, 'limits must be a list of two numerics'
        assert limits[0] != limits[1], 'limits must not be equal'
        if limits[0] > limits[1]:
            return [limits[1], limits[0]]
        return limits
        
    def make_x(self) -> torch.tensor:
        u = torch.rand((self.n,1), dtype=self.dtype)
        return (self.limits[1] - self.limits[0]) * u + self.limits[0]
        
    def line(self, x: torch.tensor) -> torch.tensor:
        return self.m * x + self.b
        #return self.m * torch.pow(x, 3) + self.b
    
    def add_noise(self, y: torch.tensor) -> torch.tensor:
        return y + torch.randn((y.shape[0], 1), dtype=self.dtype) * self.stdev
    
    def make_line(self) -> (torch.tensor, torch.tensor):
        x = torch.linspace(self.limits[0], self.limits[1], 100)
        return x, self.line(x)
    
    def compute_mse(self) -> float:
        return torch.pow(self.Y - self.line(self.X), 2).sum() / self.n
    
    def __len__(self) -> int:
        return len(self.X)

    def __getitem__(self, idx) -> (torch.tensor, torch.tensor):
        return self.X[idx], self.Y[idx]

A pyTorch class that will make an MLP network:

class MLP(torch.nn.Module):
    """
    """

    def __init__(self,
                 input_size: int,
                 output_size: int = 1,
                 layers: Tuple[int] = (64,),
                 dropout_fraction: float = 0.5,
                 include_bias: bool = True
                 ):
        """
        A neural network for regression.

        Parameters
        ----------
        input_size : int
            The size of the input layer
        output_size : int
            The size of the output layer, default is 1
        layers : tuple
            The MLP feed forward network shape: unit counts for each layer
        dropout_fraction : float
            The fraction of connections that should be dropped during training
        include_bias : bool
            Should the linear layers have a bias?
        """
        super().__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.layers = layers
        self.dropout_fraction = dropout_fraction
        self.bias = include_bias

        # required so pytorch can find the parameters
        self.mlp = torch.nn.ModuleList()
        self.dropouts = torch.nn.ModuleList()

        for idx, l in enumerate(self.layers):
            if idx == 0:
                last_layer_numbers = self.input_size
            else:
                last_layer_numbers = self.mlp[idx - 1].out_features

            self.mlp.append(torch.nn.Linear(last_layer_numbers, l,
                                            bias=self.bias))

            if idx < len(self.layers) - 1:
                self.dropouts.append(torch.nn.Dropout(p=self.dropout_fraction))
            else:
                self.dropouts.append(None)
        self.mlp.append(torch.nn.Linear(self.layers[-1], self.output_size,
                                        bias=False))

    def forward(self, x: torch.tensor) -> torch.tensor:
        """
        Forward prop on the neural network.
        """
        for layer, dropout in zip(self.mlp[:-1], self.dropouts):
            x = layer(x)
            x = torch.nn.functional.relu(x)
            if dropout:
                x = dropout(x)
        x = self.mlp[-1](x)
        return x

An "environment" to do training and testing:

class MLPEnvironment:
    """
    """

    def __init__(self,
                 train_loader: torch.utils.data.DataLoader,
                 test_loader: torch.utils.data.DataLoader,
                 net: MLP,
                 optimizer: Type[torch.optim.Optimizer] = None,
                 loss_function: Type[torch.nn.MSELoss] = None,
                 n_epochs: int = 3,
                 learning_rate: float = 0.01,
                 momentum: float = 0.5,
                 log_interval: int = 50,
                 random_seed: int = 17
                 ):
        """
        Environment for training a neural network.


        Parameters
        ----------
        train_loader: RegDataset
            The source of the training data.
        test_loader: RegDataset
            The source of the test data.
        net: MLP
            The network to train / test.
        optimizer: Type[torch.optim.Optimizer]
            The optimizer to use for training.  Default is Adam.
        loss_function: Type[torch.nn.MSELoss]
            Loss function to use.  Default is MSE.
        n_epochs: int
            How many epochs to train on top of any previous training.
        learning_rate: float
            The learning rate.
        momentum: float
            Learning momentum.
        log_interval: int
            How often to log learning progress.  Number of batches.
        random_seed: int
            Seed for the random number generator.
        """
        self.train_loader = train_loader
        self.test_loader = test_loader
        self.net = net
        self.n_epochs = n_epochs
        self.trained_epochs = 0
        self.learning_rate = learning_rate
        self.momentum = momentum
        if optimizer is None:
            self.optimizer = torch.optim.Adam(
                self.net.parameters(),
                lr=self.learning_rate
            )
        if loss_function is None:
            self.loss_function = torch.nn.MSELoss(reduction='mean')
        else:
            self.loss_function = loss_function
        self.log_interval = log_interval
        self.random_seed = random_seed

        # turn off cuda
        torch.backends.cudnn.enabled = False
        # de-randomize
        torch.manual_seed(self.random_seed)

        self.train_losses = []
        self.train_counter = []
        self.test_losses = []
        self.test_counter = []
        self.loss = None

    def train_net(self, new_epochs: int = 0, verbose=True) -> None:
        if new_epochs > 0:
            n_epochs = new_epochs
        else:
            n_epochs = self.n_epochs  # run the default
        start_epoch = self.trained_epochs + 1
        for epoch in range(start_epoch, start_epoch + n_epochs):
            self.train(epoch, verbose=verbose)
            self.test(verbose=verbose)

    def train(self, epoch: int, verbose=True) -> None:
        num_train = len(self.train_loader.dataset)
        train_seen = 0
        self.net.train()  # set the network to train mode
        for batch_idx, (data, target) in enumerate(self.train_loader):
            data = data.float()
            target = target.float()

            self.optimizer.zero_grad()  # reset gradients
            output = self.net.forward(data)  # compute predictions
            self.loss = self.loss_function(output, target)  # compute loss
            self.loss.backward()  # backprop
            self.optimizer.step()  # training step

            train_seen += len(data)

            if batch_idx % self.log_interval == 0:
                if verbose:
                    print(
                        'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.2e}'
                            .format(epoch, train_seen, num_train,
                                    100. * train_seen / num_train,
                                    self.loss.item()
                                    )
                    )
                self.train_losses.append(self.loss.item())
                self.train_counter.append(
                    (train_seen) +
                    ((epoch - 1) * num_train)
                )
        self.trained_epochs += 1

    def test(self, verbose=True) -> None:
        self.net.eval()  # set the network to evaluation mode
        test_loss = 0
        number_test_sets = 0
        with torch.no_grad():  # don't compute gradients
            for data, target in self.test_loader:
                data = data.float()
                target = target.float()
                output = self.net.forward(data)
                test_loss += self.loss_function(output, target).item()
                number_test_sets += 1
        test_loss /= number_test_sets
        self.test_losses.append(test_loss)
        self.test_counter.append(self.trained_epochs *
                                 len(self.train_loader.dataset))
        if verbose:
            print(
                '\nTest set: Avg. loss: {:.2e}\n'.format(
                    test_loss)
            )

    def plot_performance(self, logx: bool = False) -> None:
        plt.figure()
        plt.semilogy([u / 1000 for u in self.train_counter], self.train_losses,
                     color='blue')
        plt.semilogy([u / 1000 for u in self.test_counter], self.test_losses,
                     color='red', linestyle='', marker='o')
        plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
        plt.xlabel('number of training examples seen (1000s)')
        plt.ylabel('{}'.format(repr(self.loss_function)[:-2]))
        if logx:
            ax = plt.gca()
            ax.set_xscale('log')
        plt.show()

    def count_weights(self) -> int:
        """ counts the number of weights trained in the network """
        count = 0
        for p in self.net.mlp.parameters():
            if len(p.shape) == 2:
                count += np.product(p.shape)
        return count

All of these make constructing the problem and playing around easy:

m = 2.3
b = 1.7
limits = [-3.0, 3.0]
stdev = 1.0
n_train = 10000
n_test = 1000
batch_size = 500


def custom_loss(outputs, targets):
    """ MLE, but written by me """
    prec = 1 / torch.var(outputs)
    not_mse = 0.5 * prec * torch.mean(torch.pow(outputs - targets, 2))
    return not_mse - 0.5 * torch.log(prec)

train_dataset = RegDataset(m, b, stdev, limits, n_train)
test_dataset = RegDataset(m, b, stdev, limits, n_test)

train_dataloader = torch.utils.data.DataLoader(train_dataset, 
                                               batch_size=batch_size, shuffle=False)
test_dataloader = torch.utils.data.DataLoader(test_dataset, 
                                              batch_size=batch_size, shuffle=False)

nnet = MLP(1, 1, (50, 50), dropout_fraction=0.0)

MLPEnv = MLPEnvironment(train_dataloader, test_dataloader, nnet, loss_function=custom_loss, 
                        n_epochs=1, log_interval=1)

print(nnet.mlp)
print('Number of weights to train: {:d}'.format(MLPEnv.count_weights()))
print('NOTE: pytorch does not count bias weights for some reason.')

Then train and plot:

for _ in range(10):
    MLPEnv.train_net(verbose=True)

MLPEnv.plot_performance(logx=True)

Best Answer

How do I learn the variance of my Gaussian model using a neural network?

The problems you outline arise because you're not estimating the variance of the target distribution, you're estimating the variance of the predictions.

You need a network that has two outputs, $\hat y$ and $ \hat \sigma$.

Then use $$ \sum_{i=1}^{m} \left[ \frac{\left(\hat{y}_i - y_i \right)^2}{2 \hat \sigma_i^2} + \log(\hat \sigma_i) \right] $$ as the loss function, taking three arguments: $\hat \sigma_i, \hat y_i, y_i$.

(If each $y_i$ is a vector, then you'll generalize to the multivariate Gausisan case, which is straightforward to derive from the multivariate normal pdf.)

You'll either need to enforce that $\sigma_i$ is positive or parameterize your loss (and model) in terms of $\gamma_i = \log \sigma_i$ (which can be either positive or negative).

I just pick a weight and say "this is the variance" and off the MSE optimizer goes to optimize things?

Not a weight, an output of the model. But yeah, one of the outputs is $\hat \sigma_i$ and one of the outputs is $\hat y_i$.

This is supervised learning, how do I pick a target for $\sigma$?

You don't need targets for $\sigma$ because the negative log-likelihood does it all. If $\hat \sigma_i$ gets too large, then $\log \hat \sigma_i$ gets larger. If $\hat \sigma_i$ gets too small, then the MSE term gets too large. You want to find a compromise value that minimizes the loss function, so you optimize the loss.

See also: Improvement in NN regressor by Negative Log Liklihood loss vs MSE loss

Related Question