Three types of gradient descent

Reference – Notes from this blog: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

Three types are : batch gradient descent, stochastic gradient descent, mini-batch gradient descent

Batch gradient descent

During one epoch, evaluate error for one sample at a time. But update model only after evaluating all errors in the training set

Pros:

  • Calculation of prediction errors and the model update are seperated. Thus the algorithm can use parallel processing based implementations.

Cons:

  • Need to have all data in memory
  • The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.

Algo:

  • Model.initialize()
  • For i in n_epoches:
    • training_data.shuffle
    • X, Y = split(training_data)
      • For x in X
      • Y_pred = model(X) # get a vector
      • error = get_error(Y_pred, Y)
      • error_sum += error
    • model.update(error_sum)

Stochastic gradient descent

During one epoch, evaluate error for one sample at a time, then update model immediate after that evaluation.

Pros:

  • immediate feedback of model performance and improvement rate
  • simple to implement
  • frequent update — faster learning rate
  • The noisy update process can allow the model to avoid local minima (e.g. premature convergence).

Cons;

  • frequent update – computationally extensive
  • add a noise parameter /  gradient signal, causing the parameters to jump around
  • Hard to settle to an error minimum

Algo

  • Model.initialize()
  • For i in n_epoches:
    • training_data.shuffle
    • X, Y = split(training_data)
      • for  each x in X
        • Y_pred = model(X)
        • error = get_error(Y_pred, Y)
        • model.update()

 

Mini batch gradient descent

During one epoch, split the data into batches (which adds a batch size parameter). Then for each batch, evaluate error for one sample at a time. Update the model after evaluating for all data in one batches. Repeat for different batches. Repeat for different epoch.

Pros:

  • more robust converge to local minima, compared to stochastic gradient descent
  • frequent update — faster learning rate, compared to batch gradient descent
  • efficiency: no need to have all data in memory

Cons;

  • configuration of an additional mini-batch parameter

Algo

  • Model.initialize()
  • For i in n_epoches:
    • training_data.shuffle
    • Data.split.batches
    • For j in n_batches
      • X, Y = split(batch_data)
      • For x in X:
        • Y_pred = model(X) # here a vector
        • error = get_error(Y_pred, Y)
        • error_sum += error
      • model.update()

 

 

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *