Gradient Descent – Solve Local Minima Problem Through Averaging in Neural Networks

boostinggradient descentmachine learningneural networksr

I am using neural networks (created in R: neuralnet) to predict county-level food insecurity. I want to use Olden's connection weight approach to analyze relative variable importance (NeuralNetTools::olden). However, the relative variable importance metrics vary quite wildly from trial to trial, due to the existence of local minima. Therefore, I am solving the problem through averaging.

In other words, I am creating the network 1000 times, randomizing the train/test splits each time, and calculating the olden metrics as well as MSE on each trial, then averaging them. I was wondering if this was a suitable approach, or if there is more justification to arrive at relative variable importance through another method, such as choosing the variable importance metrics from the single trial with the lowest MSE. For reference, I am justifying my chosen method with the following quotation from the literature:

"The first approach involves combining different local minima…by averaging the outputs of networks using the connection weights corresponding to different local minima…This approach would also presumably involve averaging the contributions of input variables"

Here is my code for reference. This code only creates the neural network 10 times, I don't want to waste any of your afternoons. RVI stands for relative variable importance.

library(neuralnet)
library(NeuralNetTools)

sctg_1 <- c(1493906.5, 222814.9, 2019421.3, 761410.3, 547919.1, 0, 13971201, 2413181, 50061.9, 3735876.2,      
            8741.4, 123914.7, 132759.9, 2742649.7, 4842.245, 5112489.5, 
            93346.04, 13110.432, 120822.6, 222947.3)

sctg_2 <- c(13034.8, 362991.4688, 0, 2559556.7, 1215809.62, 0, 0, 17308.4,
            53385.5, 125.8, 319.08, 0, 0, 98113.4, 0, 348682,
            679107.7, 0, 0, 0)

sctg_3 <- c(0, 834788.5, 0, 0, 1198, 0, 0, 3864978.8, 2292815, 19045.8,
            1539491.9, 0, 0, 3030195.8, 3180861.2, 114802.9, 5114067.5,
            0, 1135.6, 0)

sctg_4 <- c(13878012, 319332, 503917, 0, 39, 1861756, 2112385,  
            2968683, 8250070, 2964929, 2174953, 1218664, 0, 13522784,
            7391358, 9564713, 825841, 3948627, 7043228, 0)

sctg_5 <- c(23008, 198340, 911600, 17691, 216618, 1688342,
            9555588, 415008, 2633059, 780138, 317738, 1248618,
            1197830, 0, 0, 467849, 1506839, 0,1561499, 1272532)

sctg_6 <- c(0, 49421, 0, 0, 133588, 0, 0, 2496, 0, 0, 41474, 0, 0, 0,
            0, 0, 0, 0, 0, 0)

sctg_7 <- c(5010, 241192, 237927, 0, 895213, 370, 458, 2484810, 542846,   
            22578, 1692851, 6212, 0, 0, 3712074, 872365, 1186191, 1, 0, 7119802)

foodinsecurity <- c(0.157, 0.134, 0.207, 0.163, 0.145, 0.157, 0.165, 0.172,
                    0.154, 0.161, 0.172, 0.189, 0.186, 0.165, 0.182, 0.157,
                    0.165, 0.160, 0.150, 0.181)

sctg_data <- data.frame(sctg_1, sctg_2, sctg_3, sctg_4, sctg_5, sctg_6, sctg_7, foodinsecurity)

form1<- function(){
  index <- sample(1:nrow(sctg_data),round(0.75*nrow(sctg_data)))
  train <- sctg_data[index,]
  test <- sctg_data[-index,]
  maxs <- apply(sctg_data, 2, max) 
  mins <- apply(sctg_data, 2, min)
  scaled <- as.data.frame(scale(sctg_data, center = mins, scale = maxs - mins))
  train_ <- scaled[index,]
  test_ <- scaled[-index,]
  n <- names(train_)
  f1 <- as.formula(paste("foodinsecurity ~", paste(n[!n %in% "foodinsecurity"], collapse = " + ")))
  nn1 <- neuralnet(f1,data=train_, stepmax = 1e07,   hidden=c(4),linear.output=T)
  pr.nn <- neuralnet::compute(nn1,test_[,1:7])
  pr.nn_ <- pr.nn$net.result*(max(sctg_data$foodinsecurity)-min(sctg_data$foodinsecurity))+min(sctg_data$foodinsecurity)
  test.r <- (test_$foodinsecurity)*(max(sctg_data$foodinsecurity)-min(sctg_data$foodinsecurity))+min(sctg_data$foodinsecurity)
  MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
  MSE <- as.numeric(MSE.nn)
  RVI <- unlist(olden(nn1, bar_plot = F))
  RVI <- unname(RVI)
  
  Return <- c(MSE, RVI)
  Return 
  
}

df1 <- plyr::rdply(
  
  10,
  form1,
  .progress = "text"
  
)

m1 <- data.frame(df1$V1, df1$V2, df1$V3, df1$V4, df1$V5, df1$V6, df1$V7, df1$V8)
colnames(m1) <- c("MSE", "sctg_1", "sctg_2", "sctg_3", "sctg_4", "sctg_5", "sctg_6", "sctg_7")
means1 <- data.frame(mean(m1$sctg_1), mean(m1$sctg_2), mean(m1$sctg_3), mean(m1$sctg_4), mean(m1$sctg_5), mean(m1$sctg_6),mean(m1$sctg_7))
colnames(means1) <- c("mean_sctg_1", "mean_sctg_2", "mean_sctg_3", "mean_sctg_4", "mean_sctg_5", "mean_sctg_6", "mean_sctg_7")
best1 <- m1[which.min(m1$MSE),]
means1
best1
```

Best Answer

Taking random subsets of the data, training a model on each, and averaging their predictions is not a new idea. This is what random forest does, it is a very popular approach in machine learning in general. This however is usually about averaging the predictions of the models rather than the parameters.

You can also average the parameters, but this would not work in all the cases. It doesn't sound like a great idea for neural networks. Take as an example a convolutional neural network for visual images, because they are great for illustrating things. Early layers of such network (as for other networks) work as feature detectors. In the example below from Krizhevsky et al, each neuron (kernel) in the layer detects a different edge or texture.

Many gray boxes with textures and blurry shapes. Each box is the weight of a CNN kernel. Each is a feature detector for a different feature (diagonal edge, two edges, horizontal edge, checkers pattern, etc).

Say that you trained this network multiple times, on different random subsets of the data. Let's say that all the networks have learned exactly the same features (unlikely to happen, they would rather be similar but not the same), but by chance, they are just randomly shuffled for each model (due to data shuffling, weights initialization). If you averaged those kernels, all the boxes on the image below would be uniformly gray, they wouldn't do anything. This is because neurons of the neural networks are redundant.

Averaging the parameters could work if you averaged the parameters that have the same function in the model. For example, if you have a linear regression model

$$ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon $$

you could average $\beta_0,\beta_1,\beta_2$ parameters from different models using the same $X_1,X_2$ features.