Solved – Issue with convergence with SGD with function approximation using polynomial linear regression

convergencegradient descentmachine learningregressionstochastic gradient descent

I was trying to learn a sine curve

$$ f_{target}(x) = sin(2 \pi f_s x )$$

with $f_s = 4$, from 10 points, with linear regression and a polynomial of degree 9:

$$f_{model}(x) = \langle w, \Phi(x) \rangle = \sum^9_{i=0} w_i x^i$$

with Stochastic Gradient Descent/SGD (or GD). I was doing this to play around and get some intuition on early stopping as a regularizer, however, when I tried it the model doesn't seem to train at all (since the SGD solution error is not close at all to the least squares solution). I can't figure why but at this point I am more confident it is not a programming issue (because I've implemented SGD with pytorch and tensorflow and both seem to get stuck in the same way).

Therefore, I am assuming I must be doing something else wrong, probably on the statistical/optimization side. I have tried to following to debug the issue:

  • choose terms # terms that match the number of data points so that I know there is a unique minimizer (since problem is convex since I am minimizing $ \frac{1}{N_{train}}\| Kw – Y \|^2_{2}$)
  • decreased the number of data points to see if there is any point where the loss decreases and it does arrive to order $10^{-6}$ which makes me think nothing is wrong with the code
  • tried tensorflow and pytorch and both don't work in the same example.
  • computed the unique minimizer with linear algebra tools, so I know for sure empirically (i.e. by computing it) that the minimizer exists and it does.
  • printed various debugging things as the train, like size of gradients, checking the parameters are being updated, etc
  • played around with the batch size, step size $\eta$, # iterations
  • visualized solutions, plotted out the solution obtained by SGD vs the one linear algebra solution got and they don't match at all
  • tried different initializations (which shouldn't matter since things are convex)
  • compared the linear algebra solution vectors vs one obtained via SGD
  • I've also briefly tried changing the interval where the learning is doing with much success…

at this point I find it quite mysterious why it wouldn't work if the problem is suppose to be very simple. I should be able to fit the data exactly but I am not able. The problem is convex. So I find it extremely weird.

I will paste the code, it should be completely self contained and run (I omitted the tensorflow code for simplicity, since it already looks more complicated than it should…):

import numpy as np
from sklearn.preprocessing import PolynomialFeatures

import torch
from torch.autograd import Variable

def index_batch(X,batch_indices,dtype):
    '''
    returns the batch indexed/sliced batch
    '''
    if len(X.shape) == 1: # i.e. dimension (M,) just a vector
        batch_xs = torch.FloatTensor(X[batch_indices]).type(dtype)
    else:
        batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
    return batch_xs

def get_batch2(X,Y,M,dtype):
    '''
    get batch for pytorch model
    '''
    # TODO fix and make it nicer, there is pytorch forum question
    X,Y = X.data.numpy(), Y.data.numpy()
    N = len(Y)
    valid_indices = np.array( range(N) )
    batch_indices = np.random.choice(valid_indices,size=M,replace=False)
    batch_xs = index_batch(X,batch_indices,dtype)
    batch_ys = index_batch(Y,batch_indices,dtype)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

def get_sequential_lifted_mdl(nb_monomials,D_out, bias=False):
    return torch.nn.Sequential(torch.nn.Linear(nb_monomials,D_out,bias=bias))

def train_SGD(mdl, M,eta,nb_iter,logging_freq ,dtype, X_train,Y_train, X_test,Y_test,c_pinv):
    ##
    N_train,_ = tuple( X_train.size() )
    #print(N_train)
    for i in range(nb_iter):
        for W in mdl.parameters():
            W_before_update = np.copy( W.data.numpy() )
        # Forward pass: compute predicted Y using operations on Variables
        batch_xs, batch_ys = get_batch2(X_train,Y_train,M,dtype) # [M, D], [M, 1]
        ## FORWARD PASS
        y_pred = mdl.forward(batch_xs)
        ## LOSS + Regularization
        batch_loss = (1/M)*(y_pred - batch_ys).pow(2).sum()
        ## BACKARD PASS
        batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
        ## SGD update
        for W in mdl.parameters():
            delta = eta*W.grad.data
            #W.data.copy_(W.data - delta)
            W.data -= delta
        ## train stats
        if i % (nb_iter/50) == 0 or i == 0:
        #if True:
        #if i % logging_freq == 0 or i == 0:
            current_train_loss = (1/N_train)*(mdl.forward(X_train) - Y_train).pow(2).sum().data.numpy()
            print('\n-------------')
            print(f'i = {i}, current_train_loss = {current_train_loss}')
            print(f'N_train = {N_train}')
            print(f'W_before_update={W_before_update}')
            print(f'W.data = {W.data.numpy()}')
            print(f'W.grad.data = {W.grad.data.numpy()}')
            diff = W_before_update - W.data.numpy()
            print(f' w_^(t) - w^(t-1) = {diff/eta}')
            diff_norm = np.linalg.norm(diff, 2)
            print(f'|| w_^(t) - w^(t-1) ||^2 = {diff_norm}')
            print(f'c_pinv = {c_pinv.T}')
            train_error_c_pinv = (1/N_train)*(np.linalg.norm(Y_train.data.numpy() - np.dot(X_train.data.numpy(),c_pinv) )**2)
            print(f'train_error_c_pinv = {train_error_c_pinv}')
        ## Manually zero the gradients after updating weights
        mdl.zero_grad()
##
logging_freq = 100
dtype = torch.FloatTensor
## SGD params
M = 5
eta = 0.03
nb_iter = 100*1000
##
lb,ub=0,1
freq_sin = 4
f_target = lambda x: np.sin(2*np.pi*freq_sin*x).reshape(x.shape[0],1)
N_train = 10
X_train = np.linspace(lb,ub,N_train).reshape(N_train,1)
Y_train = f_target(X_train)
N_test = 200
X_test = np.linspace(lb,ub,N_test).reshape(N_test,1)
Y_test = f_target(X_test)
## degree of mdl
Degree_mdl = 9
## pseudo-inverse solution
c_pinv = np.polyfit( X_train.reshape( (N_train,) ), Y_train , Degree_mdl )[::-1]
## linear mdl to train with SGD
nb_terms = c_pinv.shape[0]
mdl_sgd = get_sequential_lifted_mdl(nb_monomials=nb_terms,D_out=1, bias=False)
#mdl_sgd[0].weight.data.normal_(mean=0,std=0.0)
#mdl_sgd[0].weight.data.fill_(0)
print(f'mdl_sgd[0].weight.data={mdl_sgd[0].weight.data}')
## Make polynomial Kernel
poly_feat = PolynomialFeatures(degree=Degree_mdl)
Kern_train, Kern_test = poly_feat.fit_transform(X_train.reshape(N_train,1)), poly_feat.fit_transform(X_test.reshape(N_test,1))
Kern_train_pt, Y_train_pt = Variable(torch.FloatTensor(Kern_train).type(dtype), requires_grad=False), Variable(torch.FloatTensor(Y_train).type(dtype), requires_grad=False)
Kern_test_pt, Y_test_pt = Variable(torch.FloatTensor(Kern_test).type(dtype), requires_grad=False ), Variable(torch.FloatTensor(Y_test).type(dtype), requires_grad=False)
train_SGD(mdl_sgd, M,eta,nb_iter,logging_freq ,dtype, Kern_train_pt,Y_train_pt, Kern_test_pt,Y_test_pt,c_pinv)
##
legend_mdl = f'SGD solution standard parametrization, number of monomials={nb_terms}, batch-size={M}, iterations={nb_iter}, step size={eta}'
#### PLOTS
X_plot = poly_feat.fit_transform(x_horizontal)
X_plot_pytorch = Variable( torch.FloatTensor(X_plot), requires_grad=False)
##
fig1 = plt.figure()
##
p_sgd_tf, = plt.plot(x_horizontal, Y_tf )
p_sgd_pt, = plt.plot(x_horizontal, [ float(f_val) for f_val in mdl_sgd.forward(X_plot_pytorch).data.numpy() ])
p_pinv, = plt.plot(x_horizontal, np.dot(X_plot,c_pinv))
p_data, = plt.plot(X_train,Y_train,'ro')
## legend
nb_terms = c_pinv.shape[0]
legend_mdl = f'SGD solution standard parametrization, number of monomials={nb_terms}, batch-size={M}, iterations={nb_iter}, step size={eta}'
plt.legend(
        [p_sgd_tf,p_sgd_pt,p_pinv,p_data],
        ['TF '+legend_mdl,'Pytorch '+legend_mdl,f'linear algebra soln, number of monomials={nb_terms}',f'data points = {N_train}']
    )
##
plt.xlabel('x'), plt.ylabel('f(x)')
plt.show()

I am not 100% what is wrong but something that I did find worrying is the values of the coefficients of the linear algebra solution:

    c_pinv = [[ -7.36275143e-11   9.94955061e+02  -2.27235773e+04   2.02776690e+05
   -9.45987901e+05   2.56477290e+06  -4.18670905e+06   4.05381875e+06
   -2.14321212e+06   4.76269361e+05]]

it makes me feel that the fact that some are really large vs some are really small might be a problem…but I would have expected that SGD with a sufficiently small step size should have worked if ran long enough…but I don't know for sure whats wrong.

Does anyone know how to make it work? Is there something trivially obvious I am doing wrong? This problem seems so simple that its quite puzzling that its not working.


Not sure if this is useful but this is some of the stuff that gets printed to my console while debugging:

mdl_sgd[0].weight.data=
 0.2769  0.2238 -0.1786 -0.2836  0.0282 -0.2650  0.1517  0.0609 -0.1799  0.2518
[torch.FloatTensor of size 1x10]


-------------
i = 0, current_train_loss = [ 0.51122922]
N_train = 10
W_before_update=[[ 0.276916    0.22384584 -0.17859279 -0.28359878  0.02818507 -0.26502955
   0.15169969  0.06087267 -0.17991513  0.25179213]]
W.data = [[ 0.27278039  0.2223435  -0.17868967 -0.28320512  0.02860935 -0.26476261
   0.15175563  0.06072243 -0.18024531  0.251313  ]]
W.grad.data = [[ 0.13785343  0.05007789  0.00322947 -0.01312203 -0.01414278 -0.00889825
  -0.00186479  0.00500792  0.01100619  0.01597152]]
 w_^(t) - w^(t-1) = [[ 0.13785362  0.05007784  0.00322958 -0.01312196 -0.01414275 -0.00889798
  -0.00186463  0.00500791  0.011006    0.01597106]]
|| w_^(t) - w^(t-1) ||^2 = 0.004487781319767237
c_pinv = [[ -7.36275143e-11   9.94955061e+02  -2.27235773e+04   2.02776690e+05
   -9.45987901e+05   2.56477290e+06  -4.18670905e+06   4.05381875e+06
   -2.14321212e+06   4.76269361e+05]]
train_error_c_pinv = 0.00041026620352414134

-------------
i = 2000, current_train_loss = [ 0.45121056]
N_train = 10
W_before_update=[[ 0.05377455  0.14968246 -0.0918882  -0.18873887  0.0875883  -0.24442779
   0.14100061  0.02913089 -0.22231367  0.20818822]]
W.data = [[ 0.02684817  0.13449876 -0.10165974 -0.19549945  0.08267717 -0.24813652
   0.13810736  0.02681178 -0.22421438  0.20660225]]
W.grad.data = [[ 0.89754611  0.50612354  0.32571793  0.2253527   0.16370434  0.12362462
   0.0964416   0.07730356  0.06335653  0.05286586]]
 w_^(t) - w^(t-1) = [[ 0.89754611  0.50612342  0.32571805  0.22535275  0.16370441  0.1236245
   0.09644181  0.07730357  0.06335676  0.05286584]]
|| w_^(t) - w^(t-1) ||^2 = 0.03397814929485321
c_pinv = [[ -7.36275143e-11   9.94955061e+02  -2.27235773e+04   2.02776690e+05
   -9.45987901e+05   2.56477290e+06  -4.18670905e+06   4.05381875e+06
   -2.14321212e+06   4.76269361e+05]]
train_error_c_pinv = 0.00041026620352414134

I've done more extensive testing and it seems that for any data set of size 10,30 we can't really approximate the function with gradient descent. Is this what is suppose to be happening?

It seems really odd to me. The main thing I find odd is that based on the intuition from Nysquit-Shannon sampling theorem getting more data points should make the task easier, not harder. I know that the dictionary/basis used for Nysquit is different (i.e. the dictionary is sinusoidals) however, polynomials are not that far away from them specially on a small interval. Or at least thats my intuition. It seems odd that more data points makes the problem harder even though we can just choose the number of features equal to the number of data points and have a totally well defined convex problem.


New attempt:

I had time to try the Hermitian polynomial but it didn't change anything as far as I could tell. I changed the step size all over the place but now it either explodes to NaN easierly or it still doesn't train. Not sure what to do anymore…

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from numpy.polynomial.hermite import hermvander

import torch
from torch.autograd import Variable

from maps import NamedDict

from plotting_utils import *

def index_batch(X,batch_indices,dtype):
    '''
    returns the batch indexed/sliced batch
    '''
    if len(X.shape) == 1: # i.e. dimension (M,) just a vector
        batch_xs = torch.FloatTensor(X[batch_indices]).type(dtype)
    else:
        batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
    return batch_xs

def get_batch2(X,Y,M,dtype):
    '''
    get batch for pytorch model
    '''
    # TODO fix and make it nicer, there is pytorch forum question
    X,Y = X.data.numpy(), Y.data.numpy()
    N = len(Y)
    valid_indices = np.array( range(N) )
    batch_indices = np.random.choice(valid_indices,size=M,replace=False)
    batch_xs = index_batch(X,batch_indices,dtype)
    batch_ys = index_batch(Y,batch_indices,dtype)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

def get_sequential_lifted_mdl(nb_monomials,D_out, bias=False):
    return torch.nn.Sequential(torch.nn.Linear(nb_monomials,D_out,bias=bias))

def train_SGD(mdl, M,eta,nb_iter,logging_freq ,dtype, X_train,Y_train):
    ##
    N_train,_ = tuple( X_train.size() )
    #print(N_train)
    for i in range(nb_iter):
        # Forward pass: compute predicted Y using operations on Variables
        batch_xs, batch_ys = get_batch2(X_train,Y_train,M,dtype) # [M, D], [M, 1]
        ## FORWARD PASS
        y_pred = mdl.forward(batch_xs)
        ## LOSS + Regularization
        batch_loss = (1/M)*(y_pred - batch_ys).pow(2).sum()
        ## BACKARD PASS
        batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
        ## SGD update
        for W in mdl.parameters():
            delta = eta*W.grad.data
            W.data.copy_(W.data - delta)
        ## train stats
        if i % (nb_iter/10) == 0 or i == 0:
            current_train_loss = (1/N_train)*(mdl.forward(X_train) - Y_train).pow(2).sum().data.numpy()
            print('\n-------------')
            print(f'i = {i}, current_train_loss = {current_train_loss}\n')
            print(f'eta*W.grad.data = {eta*W.grad.data}')
            print(f'W.grad.data = {W.grad.data}')
        ## Manually zero the gradients after updating weights
        mdl.zero_grad()
##
logging_freq = 100
dtype = torch.FloatTensor
## SGD params
M = 3
eta = 0.002
nb_iter = 20*1000
##
lb,ub = 0,1
f_target = lambda x: np.sin(2*np.pi*x)
N_train = 5
X_train = np.linspace(lb,ub,N_train)
Y_train = f_target(X_train)
## degree of mdl
Degree_mdl = 4
## pseudo-inverse solution
c_pinv = np.polyfit( X_train, Y_train , Degree_mdl )[::-1]
## linear mdl to train with SGD
nb_terms = c_pinv.shape[0]
mdl_sgd = get_sequential_lifted_mdl(nb_monomials=nb_terms,D_out=1, bias=False)
mdl_sgd[0].weight.data.normal_(mean=0,std=0.001)
## Make polynomial Kernel
#poly_feat = PolynomialFeatures(degree=Degree_mdl)
#Kern_train = poly_feat.fit_transform(X_train.reshape(N_train,1))
Kern_train = hermvander(X_train,Degree_mdl)
Kern_train = Kern_train.reshape(N_train,Kern_train.shape[1])
Kern_train_pt, Y_train_pt = Variable(torch.FloatTensor(Kern_train).type(dtype), requires_grad=False), Variable(torch.FloatTensor(Y_train).type(dtype), requires_grad=False)
train_SGD(mdl_sgd, M,eta,nb_iter,logging_freq ,dtype, Kern_train_pt,Y_train_pt)

#### PLOTTING
x_horizontal = np.linspace(lb,ub,1000).reshape(1000,1)
#X_plot = poly_feat.fit_transform(x_horizontal)
X_plot = hermvander(x_horizontal,Degree_mdl)
X_plot = X_plot.reshape(1000,X_plot.shape[2])
X_plot_pytorch = Variable( torch.FloatTensor(X_plot), requires_grad=False)
##
fig1 = plt.figure()
#plots objs
p_sgd, = plt.plot(x_horizontal, [ float(f_val) for f_val in mdl_sgd.forward(X_plot_pytorch).data.numpy() ])
p_pinv, = plt.plot(x_horizontal, np.dot(X_plot,c_pinv))
p_data, = plt.plot(X_train,Y_train,'ro')
## legend
nb_terms = c_pinv.shape[0]
legend_mdl = f'SGD solution standard parametrization, number of monomials={nb_terms}, batch-size={M}, iterations={nb_iter}, step size={eta}'
plt.legend(
        [p_sgd,p_pinv,p_data],
        [legend_mdl,f'linear algebra soln, number of monomials={nb_terms}',f'data points = {N_train}']
    )
##
plt.xlabel('x'), plt.ylabel('f(x)')
plt.show()

The main issue seems that I can't get the SGD solution to match the linear algebra error (via inverse or pseudo-inverse) with either parametrization of the data matrix $X$ (standard polynomials or Hermite polynomials).

e.g.

With Hermite:

-----------------
train_error_pinv = 6.006056840733974e-09
final_sgd_error = [ nan]

With standard:

-----------------
train_error_pinv = 5.205485509746132e-20
final_sgd_error = [ 4.50123644]

Best Answer

Just so you can move beyond your doubts about my comments on the OP, here is the code I originally used which led to my comments. This should show definitively that, yes, it works with Hermite polynomials and therefore the problem has to do with the design matrix. This is stochastic gradient descent with a batch size of 2, compared with the least squares fit. You can also check for yourself that if you comment the Hermite basis and uncomment the basis you are using that SGD no longer works.

set.seed(1234)

N <- 10
x <- seq(from = 0, to = 1, length = N)
mu <- sin(2 * pi * x * 4)
y <- mu
plot(x,y)

X <- cbind(1, poly(x = x, degree = 9))
# X <- sapply(0:9, function(i) x^i)
w <- rnorm(10)

learning_rate <- function(t) .1 / t^(.6)

n_samp <- 2
for(t in 1:100000) {
  mu_hat <- X %*% w
  idx <- sample(1:N, n_samp)
  X_batch <- X[idx,]
  y_batch <- y[idx]
  score_vec <- t(X_batch) %*% (y_batch - X_batch %*% w)

  change <- score_vec * learning_rate(t)
  w <- w + change
}

plot(mu_hat, ylim = c(-1, 1))
lines(mu)
fit_exact <- predict(lm(y ~ X - 1))
lines(fit_exact, col = 'red')
abs(w - coef(lm(y ~ X - 1)))

If you want to apply this fix in general, you can replace your design matrix with the Q matrix from the QR, but this is probably not realistic for problems large enough that you would want to use SGD in the first place.

Hopefully that is helpful, but I'm not going to debug your code.