Contents

Contents
Barzilai Borwein Method
Gradient Descent with Fixed Step Size
Lipschitz Approximation (Malitsky & Mishchenko)
Weighted Norm Damping Gradient Method (WNGrad)
Nesterov's Accelerated Gradient Descent
Gradient Descent with Diminishing Step Size
Gradient Descent with Backtracking
Gradient Descent with Non-monotone Line Search
- Fixed Step
Gradient Descent with Non-sequential Armijo Line Search
Line search Helper Functions
- Backtracking
- Non-sequential Armijo
Index

Barzilai Borwein Method

OptimizationMethods.barzilai_borwein_gd — Function

barzilai_borwein_gd(optData::BarzilaiBorweinGD{T}, progData::P 
    where P <: AbstractNLPModel{T, S}) where {T,S}

Implements gradient descent with Barzilai-Borwein step size and applies the method to the optimization problem specified by progData.

Reference(s)

Barzilai and Borwein. "Two-Point Step Size Gradient Methods". IMA Journal of Numerical Analysis.

Method

Given iterates $\lbrace x_0,\ldots,x_k\rbrace$, the iterate $x_{k+1}$ is equal to $x_k - \alpha_k \nabla f(x_k)$, where $\alpha_k$ is one of two versions.

Long Step Size Version (if optData.long_stepsize==true)

If $k=0$, then $\alpha_0$ is set to optData.init_stepsize. For $k>0$,

\[\alpha_k = \frac{ \Vert x_k - x_{k-1} \Vert_2^2}{(x_k - x_{k-1})^\intercal (\nabla f(x_k) - \nabla f(x_{k-1}))}.\]

Short Step Size Version (if optData.long_stepsize==false)

If $k=0$, then $\alpha_0$ is set to optData.init_stepsize. For $k>0$,

\[\alpha_k = \frac{(x_k - x_{k-1})^\intercal (\nabla f(x_k) - \nabla f(x_{k-1}))}{\Vert \nabla f(x_k) - \nabla f(x_{k-1})\Vert_2^2}.\]

Arguments

optData::BarzilaiBorweinGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.BarzilaiBorweinGD — Type

BarzilaiBorweinGD{T} <: AbstractOptimizerData{T}

A structure for storing data about gradient descent using the Barzilai-Borwein step size, and the progress of its application on an optimization problem.

Fields

name:String, name of the solver for reference.
init_stepsize::T, initial step size to start the method.
long_stepsize::Bool, flag for step size; if true, use the long version of Barzilai-Borwein. If false, use the short version.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration stops.
max_iterations::Int64, max number of iterations (gradient steps) taken by the solver.
iter_diff::Vector{T}, a buffer for storing differences between subsequent iterate values that are used for computing the step size
grad_diff::Vector{T}, a buffer for storing differences between gradient values at adjacent iterates, which is used to compute the step size
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry correpsonds to the gradient norm at iteration k.
stop_iteration::Int64, the iteration number that the solver stopped on. The terminal iterate is saved at iter_hist[stop_iteration+1].

Constructors

BarzilaiBorweinGD(::Type{T}; x0::Vector{T}, init_stepsize::T, 
    long_stepsize::Bool, threshold::T, max_iterations::Int) where {T}

Constructs the struct for the Barzilai-Borwein optimization method

Arguments

T::DataType, specific data type used for calculations.

Keyword Arguments

x0::Vector{T}, initial point to start the solver at.
init_stepsize::T, initial step size used for the first iteration.
long_stepsize::Bool, flag for step size; if true, use the long version of Barzilai-Borwein, if false, use the short version.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration is terminated.
max_iterations::Int, max number of iterations (gradient steps) taken by the solver.

Gradient Descent with Fixed Step Size

OptimizationMethods.fixed_step_gd — Function

fixed_step_gd(optData::FixedStepGD{T}, progData<:AbstractNLPModel{T,S})
    where {T,S}

Implements fixed step-size gradient descent for the desired optimization problem specified by progData.

Method

The iterates are updated according to the procedure

\[x_{k+1} = x_k - \alpha ∇f(x_k),\]

where $\alpha$ is the step size, $f$ is the objective function, and $∇f$ is the gradient function of $f$.

Arguments

optData::FixedStepGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.FixedStepGD — Type

FixedStepGD{T} <: AbstractOptimizerData{T}

A structure for storing data about fixed step-size gradient descent, and the progress of its application on an optimization problem.

Fields

name::String, name of the solver for reference
step_size::T, the step-size selection for the optimization procedure
threshold::T, the threshold on the norm of the gradient to induce stopping
max_iterations::Int, the maximum allowed iterations
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry correpsonds to the gradient norm at iteration k
stop_iteration, iteration number that the algorithm stopped at. Iterate number stop_iteration is produced.

Constructors

FixedStepGD(::Type{T}; x0::Vector{T}, step_size::T, threshold::T, 
    max_iterations::Int) where {T}

Constructs the struct for the optimizer.

Arguments

T::DataType, specific data type for the calculations

Keyword Arguments

x0::Vector{T}, the initial iterate for the optimizers
step_size::T, the step size of the optimizer
threshold::T, the threshold on the norm of the gradient to induce stopping
max_iterations::Int, the maximum number of iterations allowed

Lipschitz Approximation (Malitsky & Mishchenko)

OptimizationMethods.lipschitz_approximation_gd — Function

lipschitz_approximation_gd(optData::FixedStepGD{T}, progData::P where P 
    <: AbstractNLPModel{T, S}) where {T, S}

Implements gradient descent with adaptive step sizes formed through a lipschitz approximation for the desired optimization problem specified by progData.

Warning

This method is designed for convex optimization problems.

References(s)

Malitsky, Y. and Mishchenko, K. (2020). "Adaptive Gradient Descent without Descent." Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:6702-6712.

Method

The iterates are updated according to the procedure,

\[x_{k+1} = x_{k} - \alpha_k \nabla f(x_{k}),\]

where $\alpha_k$ is the step size and $\nabla f$ is the gradient function of the objective function $f$.

The step size is computed depending on $k$. When $k = 0$, the step size is set to optData.init_stepsize. When $k > 0$,

\[\alpha_k = \min\left\lbrace \sqrt{1 + \theta_{k-1}}\alpha_{k-1}, \frac{\Vert x_k - x_{k-1} \Vert}{\Vert \nabla f(x_k) - \nabla f(x_{k-1})\Vert} \right\rbrace,\]

where $\theta_0 = \inf$ and $\theta_k = \alpha_k / \alpha_{k-1}$.

Arguments

optData::LipschitzApproxGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.LipschitzApproxGD — Type

LipschitzApproxGD{T} <: AbstractOptimizerData{T}

A structure for storing data about adaptive gradient descent using a Lipschitz Approximation scheme (AdGD), and the progress of its application on an optimization problem.

Fields

name::String, name of the solver for reference
init_stepsize::T, the initial step size for the method
prev_stepsize::T, step size used at iter - 1 when iter > 1.
theta::T, element used in the computation of the step size. See the referenced paper for more information.
lipschitz_approximation::T, help the lipschitz approximation used in the computation of the step size. See the referenced paper for more information.
threshold::T, the threshold on the norm of the gradient to induce stopping
max_iterations::Int64, the maximum allowed iterations
iter_diff::Vector{T}, a buffer for storing differences between subsequent iterate values that are used for computing the step size
grad_diff::Vector{T}, a buffer for storing differences between gradient values at adjacent iterates, which is used to compute the step size
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry corresponds to the gradient norm at iteration k
stop_iteration::Int64, iteration number that the algorithm stopped at. Iterate number stop_iteration is produced.

Constructors

LipschitzApproxGD(::Type{T}; x0::Vector{T}, init_stepsize::T, threshold::T, 
    max_iterations::Int) where {T}

Constructs the struct for the optimizer.

Arguments

T::DataType, specific data type for the calculations

Keyword Arguments

x0::Vector{T}, the initial iterate for the optimizers
init_stepsize::T, the initial step size for the method
threshold::T, the threshold on the norm of the gradient to induce stopping
max_iterations::Int, the maximum number of iterations allowed

Weighted Norm Damping Gradient Method (WNGrad)

OptimizationMethods.weighted_norm_damping_gd — Function

weighted_norm_damping_gd(optData::WeightedNormDampingGD{T}, 
    progData::P where P <: AbstractNLPModel{T, S}) where {T, S}

Method that implements gradient descent with weighted norm damping step size using the specifications in optData on the problem specified by progData.

Reference

Wu, Xiaoxia et. al. "WNGrad: Learn the Learning Rate in Gradient Descent". arxiv, https://arxiv.org/abs/1803.02865

Method

Let $\theta_k$ be the $k^{th}$ iterate, and $\alpha_k$ be the $k^{th}$ step size. The optimization method generate iterates following

\[\theta_{k + 1} = \theta_{k} - \alpha_k \nabla f(\theta_k),\]

where $\nabla f$ is the gradient of the objective function $f$.

The step size depends on the iteration number $k$. For $k = 0$, the step size $\alpha_0$ is the reciprocal of optData.init_norm_damping_factor. For $k > 0$, the step size is iteratively updated as

\[\alpha_k = \left[ \frac{1}{\alpha_{k-1}} + \Vert\dot F(\theta_k)\Vert_2^2 \alpha_{k-1} \right]^{-1}.\]

Warning

When $\alpha_0 < \Vert\dot F(\theta_0)\Vert_2^{-1}$ and a globally Lipschitz smooth objective function is used, then the method is guaranteed to find an $\epsilon$-stationary point. It is recommended then that optData.init_norm_damping_factor exceed $\Vert\dot F(\theta_0)\Vert_2$.

Arguments

optData::WeightedNormDampingGD{T}, specification for the optimization algorithm.
progData::P where P <: AbstractNLPModel{T, S}, specification for the problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.WeightedNormDampingGD — Type

WeightedNormDampingGD{T} <: AbstractOptimizerData{T}

A mutable struct that represents gradient descent using the weighted-norm damping step size. It stores the specification for the method and records values during iteration.

Fields

name::String, name of the optimizer for recording purposes
init_norm_damping_factor::T, initial damping factor. This value's reciprocal will be the initial step size.
threshold::T, norm gradient tolerance condition. Induces stopping when norm is at most threshold.
max_iterations::Int64, max number of iterates that are produced, not including the initial iterate.
iter_hist::Vector{Vector{T}}, store the iterate sequence as the algorithm progresses. The initial iterate is stored in the first position.
grad_val_hist::Vector{T}, stores the norm gradient values at each iterate. The norm of the gradient evaluated at the initial iterate is stored in the first position.
stop_iteration::Int64, the iteration number the algorithm stopped on. The iterate that induced stopping is saved at iter_hist[stop_iteration + 1].

Constructors

WeightedNormDampingGD(::Type{T}; x0::Vector{T}, init_norm_damping_factor::T, 
    threshold::T, max_iterations::Int64) where {T}

Constructs an instance of type WeightedNormDampingGD{T}.

Arguments

T::DataType, type for data and computation

Keyword Arguments

x0::Vector{T}, initial point to start the optimization routine. Saved in iter_hist[1].
init_norm_damping_factor::T, initial damping factor, which will correspond to the reciprocoal of the initial step size.
threshold::T, norm gradient tolerance condition. Induces stopping when norm at most threshold.
max_iterations::Int64, max number of iterates that are produced, not including the initial iterate.

Nesterov's Accelerated Gradient Descent

OptimizationMethods.nesterov_accelerated_gd — Function

nesterov_accelerated_gd(optData::NesterovAcceleratedGD{T}, progData::P
    where P <: AbstractNLPModel{T, S}) where {T, S}

Implements Nesterov's Accelerated Gradient Descent as specified by optData on the problem specified by progData.

Warning

This algorithm is designed for convex problems.

Reference(s)

See Algorithm 1 of Li et. al. "Convex and Non-convex Optimization Under Generalized Smoothness". arxiv, https://arxiv.org/abs/2306.01264.
See line-search based approach in Nesterov, Yurii. 1983. “A Method for Solving the Convex Programming Problem with Convergence Rate O(1/K^2).” Proceedings of the USSR Academy of Sciences 269:543–47.

Method

Let the objective function be denoted by $F(\theta)$ and $\nabla F( \theta)$ denote its gradient function. Given $\theta_0$ and a step size $\alpha$ (equal to optData.step_size), the method produces five sequences. At $k=0$,

\[\begin{cases} B_0 &= 0 \\ z_0 &= \theta_0 \\ \Delta_0 & = 1 \\ y_0 &= \theta_0; \end{cases}\]

and, for $k\in\mathbb{N}$,

\[\begin{cases} B_{k} &= B_{k-1} + \Delta_{k-1} \\ \theta_k &= y_{k-1} - \alpha \nabla F(y_{k-1}) \\ z_k &= z_{k-1} - \alpha\Delta_{k-1}\nabla F(y_{k-1}) \\ \Delta_k &= \frac{1}{2}\left( 1 + \sqrt{4 B_{k} + 1} \right) \\ y_k &= \theta_k + \frac{\Delta_{k}}{B_k + \Delta_k + \alpha^{-1}} (z_k - \theta_k). \end{cases}\]

The iterate sequence of interest is $\lbrace \theta_k \rbrace$.

Arguments

optData::NesterovAcceleratedGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.NesterovAcceleratedGD — Type

NesterovAcceleratedGD{T} <: AbstractOptimizerData

A structure that represents Nesterov Accelerated Gradient Descent. Stores variables related to the method, and tracks quantities as the algorithm progresses.

Fields

name:String, name of the solver for reference.
step_size::T, step size used in the method.
z::Vector{T}, buffer array for auxiliary iterate sequence
y::Vector{T}, buffer array for convex combination of iterate and auxiliary sequence
B::T, auxiliary quadratic scaling term for computing acceleration weights and step size.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration stops.
max_iterations::Int64, max number of iterations (gradient steps) taken by the solver.
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry corresponds to the gradient norm at iteration k.
stop_iteration::Int64, the iteration number that the solver stopped on. The terminal iterate is saved at iter_hist[stop_iteration+1].

Constructors

    NesterovAcceleratedGD(::Type{T}; x0::Vector{T}, step_size::T,
        threshold::T, max_iterations::Int64) where {T}

Arguments

T::DataType, specific data type used for calculations.

Keyword Arguments

x0::Vector{T}, initial point to start the solver at.
step_size::T, step size used in the method.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration is terminated.
max_iterations::Int, max number of iterations (gradient steps) taken by the solver.

Gradient Descent with Diminishing Step Size

OptimizationMethods.DiminishingStepGD — Type

DiminishingStepGD{T} <: AbstractOptimizerData{T}

A structure for storing data about gradient descent using diminishing step sizes, and recording the progress of its application on an optimization problem.

Fields

name::String, name of the solver for reference.
step_size_function::Function, step size function. Should take the iteration number and return the step size for that iteration.
step_size_scaling::T, factor that is multipled with the amount of the step size function.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration stops.
max_iterations::Int64, max number of iterations (gradient steps) taken by the solver.
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry correpsonds to the gradient norm at iteration k.
stop_iteration::Int64, the iteration number that the solver stopped on. The terminal iterate is saved at iter_hist[stop_iteration+1].

Warning

step_size_function should take in two arguments, the data type for computation T and the iteration number. For example, calling step_size_function(Float64, 1) should return the step size as a Float64 for the iteration 1.

Constructors

DiminishingStepGD(::Type{T}; x0::Vector{T}, step_size_function::Function,
    step_size_scaling::T, threshold::T, max_iterations::Int)

Constructs the struct for the diminishing step size gradient descent method.

Arguments

T::DataType, specific data type used for calculations.

Keyword Arguments

x0::Vector{T}, initial point to start the solver at.
step_size_function::Function, step size function. Should take the iteration number and return the step size for that iteration.
step_size_scaling::T, factor that is multipled with the amount of the step size function.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration is terminated.
max_iterations::Int, max number of iterations (gradient steps) taken by the solver.

OptimizationMethods.diminishing_step_gd — Function

diminishing_step_gd(optData::DiminishingStepGD{T}, progData::P
    where P <: AbstractNLPModel{T, S}) where {T, S}

Implements gradient descent with diminishing step sizes and applies the method to the optimization problem specified by progData.

Reference(s)

Patel, Vivak, and Albert Berahas. “Gradient Descent in the Absence of Global Lipschitz Continuity of the Gradients.” SIAM 6 (3): 579–846. https://doi.org/10.1137/22M1527210.

Bertsekas, Dimitri. "Nonlinear Programming". 3rd Edition, Athena Scientific, Chapter 1.

Method

Given iterates $\lbrace x_0,\ldots,x_k\rbrace$, the iterate $x_{k + 1}$ is equal to $x_k - \alpha_k \nabla f(x_k)$, where $\alpha_k$ is equal to optData.step_size_scaling * optData.step_size_function(T, k).

Step Size Function

The step size function should satisfy several conditions. First, $\alpha_k > 0$ for all $k$. Second, $\lim_{k \to \infty} \alpha_k = 0.$ Finally, $\sum_{k=0}^{\infty} \alpha_k = \infty.$ See Patel and Berahas (2024) for details.

Arguments

optData::DiminishingStepGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

Below is a list of step size functions that are in the library. The step size sequence generated by these functions, $\lbrace \alpha_k \rbrace$ satisfies $\alpha_k > 0$, $\lim_{k \to\infty} \alpha_k = 0$ and $\sum_{k=0}^\infty \alpha_k = \infty$.

OptimizationMethods.inverse_k_step_size — Function

inverse_k_step_size(::Type{T}, k::Int64)

Return the inverse of k.

Method

The step size sequence generated for $k \in \mathbb{N}$ is

\[ \alpha_k = \frac{1}{k+1},\]

when using this method.

Arguments

T::Type, data type that the computations are done in.
k::Int64, index of the step size needed.

Returns

Returns a number of type T.

OptimizationMethods.inverse_log2k_step_size — Function

inverse_log2k_step_size(::Type{T}, k::Int64) where {T}

Method

The step size sequence generated when using this method is

\[ \alpha_k = \frac{1}{\lfloor \log_2(k+1) + 1 \rfloor}\]

for $k \in \mathbb{N}$.

Arguments

T::Type, data type that the computation are done in.
k::Int64, index of the step size needed.

Returns

Returns a number of type T.

OptimizationMethods.stepdown_100_step_size — Function

stepdown_100_step_size(::Type{T}, k::Int64) where {T}

Method

The step size sequence generated is

\[ \alpha_k = \frac{1}{2^i}\]

for $k \in [ (2^i-1)100, (2^{i+1}-1)100)$.

Arguments

T::Type, data type that the computation are done in.
k::Int64, index of the step size needed.

Returns

Returns a number of type T.

Gradient Descent with Backtracking

OptimizationMethods.backtracking_gd — Function

backtracking_gd(optData::BacktrackingGD{T},
    progData::P where P <: AbstractNLPModel{T, S}) where {T, S}

Reference(s)

Armijo, Larry. “Minimization of Functions Having Lipschitz Continuous First Partial Derivatives.” Pacific Journal of Mathematics 16.1 (1966): 1–3. Pacific Journal of Mathematics. Web.

Nocedal and Wright. "Numerical Optimization". Springer New York, NY.

Method

Let $\theta_{k-1}$ be the current iterate, and let $\alpha \in \mathbb{R}_{>0}$, $\delta \in (0, 1)$, and $\rho \in (0, 1)$. The $k^{th}$ iterate is generated as $\theta_k = \theta_{k-1} - \delta^t\alpha \dot F(\theta_{k-1})$ where $t + 1 \in \mathbb{N}$ is the smallest such number satisfying

\[ F(\theta_k) \leq F(\theta_{k-1}) - \rho\delta^t\alpha ||\dot F(\theta_{k-1})||_2^2,\]

where $||\cdot||_2$ is the L2-norm.

Note

Theoretically, there exists such a $t$, but it can be made arbitrarily large. Therefore, the line search procedure stops searching after optData.line_search_max_iteration. The current implementation terminates the procedure if the backtracking condition is not satisfied, and returns the previous iterate.

Arguments

optData::BarzilaiBorweinGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.BacktrackingGD — Type

BacktrackingGD{T} <: AbstractOptimizerData{T}

Mutable sturcture representing gradient descent using backtracking. It also stores and keeps track of values during the optimization routine.

Fields

name:String, name of the solver for reference.
α::T, initial step size used for backtracking.
δ::T, backtracking decreasing factor applied to α when line search criterion is not satisfied.
ρ::T, factor involved in the acceptance criterion in the line search procedure. Larger values correspond to stricter descent conditions, and smaller values correspond to looser descent conditions.
line_search_max_iteration::Int64, maximum allowable iterations for the backtracking procedure.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration stops.
max_iterations::Int64, max number of iterations (gradient steps) taken by the solver.
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry correpsonds to the gradient norm at iteration k.
stop_iteration::Int64, the iteration number that the solver stopped on. The terminal iterate is saved at iter_hist[stop_iteration+1].

Constructors

BacktrackingGD(::Type{T}; x0::Vector{T}, α::T, δ::T, ρ::T,
    line_search_max_iteration::Int64, threshold::T, max_iteration::Int64)
    where {T}

Returns a struct of type BacktrackingGD{T} with all the field initialized to either initialized to the values given, or their default values.

Arguments

T::DataType, specific data type used for calculations.

Keyword Arguments

x0::Vector{T}, initial point to start the solver at.
α::T, initial step size used for backtracking.
δ::T, backtracking decreasing factor applied to α when line search criterion not satisfied.
ρ::T, factor involved in the acceptance criterion in the line search procedure. Larger values correspond to stricter descent conditions, and smaller values correspond to looser descent conditions.
line_search_max_iteration::Int64, maximum allowable iterations for the backtracking procedure.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration is terminated.
max_iterations::Int, max number of iterations (gradient steps) taken by the solver.

Gradient Descent with Non-monotone Line Search

Fixed Step

OptimizationMethods.fixed_step_nls_maxval_gd — Function

fixed_step_nls_maxval_gd(optData::FixedStepNonmonLSMaxValG{T},
    progData::P where P <: AbstractNLPModel{T, S}) where {T, S}

Implementation of gradient descent with non-monotone line search using the maximum value of a fixed number of previous objective values on an optimization problem specified by progData. This implementation initializes the line search procedure with optData.α every iteration.

Reference(s)

Grippo, L., et al. "A Nonmonotone Line Search Technique for Newton's Method." SIAM Journal on Numerical Analysis, vol. 23, no. 4, 1886, pp. 707-16. JSTOR, http://www.jstor.org/stable/2157617.

Method

Let $\theta_{k-1}$ be the current iterate, and let $\alpha \in \mathbb{R}_{>0}$, $\delta \in (0, 1)$, and $\rho \in (0, 1)$. The $k^{th}$ iterate is generated as $\theta_k = \theta_{k-1} - \delta^t\alpha \dot F(\theta_{k-1})$ where $t + 1 \in \mathbb{N}$ is the smallest such number satisfying

\[ F(\theta_k) \leq \max_{\max(0, k-M) \leq j < k} F(\theta_{k-1}) - \rho\delta^t\alpha||\dot F(\theta_{k-1})||_2^2,\]

where $||\cdot||_2$ is the L2-norm, and $M \in \mathbb{N}_{>0}$.

Arguments

optData::FixedStepNonmonLSMaxValG{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

OptimizationMethods.FixedStepNonmonLSMaxValG — Type

FixedStepNonmonLSMaxValGD{T} <: AbstractOptimizerData{T}

Mutable struct that represents gradient descent using non-monotone line search where the initial step size for line search is fixed. This struct also keeps track of values during the optimization procedure implemented in fixed_step_nls_maxval_gd.

Fields

name::String, name of the solver for reference.
α::T, the initial step size for line search.
δ::T, backtracking decreasing factor applied to α when the line search criterion is not satisfied
ρ::T, factor involved in the acceptance criterion in the line search procedure. Larger values correspond to stricter descent conditions, and smaller values correspond to looser descent conditions.
window_size::Int64, number of previous objective values that are used to construct the reference objective value for the line search criterion.
line_search_max_iteration::Int64, maximum number of iterations for line search.
objective_hist::CircularVector{T, Vector{T}}, buffer array of size window_size that stores window_size previous objective values.
max_value::T, maximum value of objective_hist. This is the reference objective value used in the line search procedure.
max_index::Int64, index of the maximum value that corresponds to the reference objective value.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration stops.
max_iterations::Int64, max number of iterations (gradient steps) taken by the solver.
iter_hist::Vector{Vector{T}}, a history of the iterates. The first entry corresponds to the initial iterate (i.e., at iteration 0). The k+1 entry corresponds to the iterate at iteration k.
grad_val_hist::Vector{T}, a vector for storing max_iterations+1 gradient norm values. The first entry corresponds to iteration 0. The k+1 entry correpsonds to the gradient norm at iteration k.
stop_iteration::Int64, the iteration number that the solver stopped on. The terminal iterate is saved at iter_hist[stop_iteration+1].

Constructors

FixedStepNonmonLSMaxValG(::Type{T}; x0::Vector{T}, α::T, δ::T, ρ::T,
    window_size::Int64, line_search_max_iteration::Int64,
    threshold::T, max_iterations::Int64) where {T}

Arguments

T::DataType, specific data type used for calculations.

Keyword Arguments

α::T, the initial step size for line search.
δ::T, backtracking decreasing factor applied to α when the line search criterion is not satisfied
ρ::T, factor involved in the acceptance criterion in the line search procedure. Larger values correpsond to stricter descetn conditions, and smaller values correspond to looser descent conditions.
window_size::Int64, number of previous objective values that are used to construct the reference value for the line search criterion.
line_search_max_iteration::Int64, maximum number of iterations for line search.
threshold::T, gradient threshold. If the norm gradient is below this, then iteration stops.
max_iterations::Int64, max number of iterations (gradient steps) taken by the solver.

Gradient Descent with Non-sequential Armijo Line Search

OptimizationMethods.NonsequentialArmijoGD — Type

NonsequentialArmijoGD{T} <: AbstractOptimizerData{T}

A mutable struct that represents gradient descent with non-sequential armijo line search and triggering events. It stores the specification for the method and records values during iteration.

Fields

name::String, name of the optimizer for reference.
∇F_θk::Vector{T}, buffer array for the gradient of the initial inner loop iterate.
norm_∇F_ψ::T, norm of the gradient of the current inner loop iterate.
prev_∇F_ψ::Vector{T}, buffer array for the previous gradient in the inner loop. Necessary for updating the local Lipschitz approximation.
prev_norm_step::T, norm of the step between inner loop iterates. Used for updating the local Lipschitz approximation.
α0k::T, first step size used in the inner loop.
δk::T, scaling factor used to condition the step size.
δ_upper::T, upper limit imposed on the scaling factor when updating.
ρ::T, parameter used in the non-sequential Armijo condition. Larger numbers indicate stricter descent conditions. Smaller numbers indicate less strict descent conditions.
τ_lower::T, lower bound on the gradient interval triggering event.
τ_upper::T, upper bound on the gradient interval triggering event.
local_lipschitz_estimate::T, local Lipshitz approximation.
threshold::T, norm gradient tolerance condition. Induces stopping when norm is at most threshold.
max_iterations::Int64, max number of iterates that are produced, not including the initial iterate.
iter_hist::Vector{Vector{T}}, store the iterate sequence as the algorithm progresses. The initial iterate is stored in the first position.
grad_val_hist::Vector{T}, stores the norm gradient values at each iterate. The norm of the gradient evaluated at the initial iterate is stored in the first position.
stop_iteration::Int64, the iteration number the algorithm stopped on. The iterate that induced stopping is saved at iter_hist[stop_iteration + 1].

Constructors

NonsequentialArmijoGD(::Type{T}; x0::Vector{T}, δ0::T, δ_upper::T, ρ::T,
    threshold::T, max_iterations::Int64) where {T}

Constructs an instance of type NonsequentialArmijoGD{T}.

Arguments

T::DataType, type for data and computation.

Keyword Arguments

x0::Vector{T}, initial point to start the optimization routine. Saved in iter_hist[1].
δ0::T, starting scaling factor.
δ_upper::T, upper limit imposed on the scaling factor when updating.
ρ::T, parameter used in the non-sequential Armijo condition. Larger numbers indicate stricter descent conditions. Smaller numbers indicate less strict descent conditions.
threshold::T, norm gradient tolerance condition. Induces stopping when norm at most threshold.
max_iterations::Int64, max number of iterates that are produced, not including the initial iterate.

OptimizationMethods.nonsequential_armijo_gd — Function

nonsequential_armijo_gd(optData::NonsequentialArmijoGD{T},
    progData::P where P <: AbstractNLPModel{T, S}) where {T, S}

Implementation of gradient descent with non-sequential armijo and triggering events. The optimization algorithm is specified through optData, and applied to the problem progData.

Reference(s)

Method

In what follows, we let $||\cdot||_2$ denote the L2-norm. Let $\theta_{k}$ for $k + 1 \in\mathbb{N}$ be the $k^{th}$ iterate of the optimization algorithm. Let $\delta_{k}, \tau_{\mathrm{grad},\mathrm{lower}}^k, \tau_{\mathrm{grad},\mathrm{upper}}^k$ be the $k^{th}$ parameters for the optimization method. The $k+1^{th}$ iterate and parameters are produced by the following procedure.

Let $\psi_0^k = \theta_k$, and recursively define

\[ \psi_{j}^k = \psi_0^k - \sum_{i = 0}^{j-1} \delta_k \alpha_i^k \dot F(\psi_i^k).\]

Let $j_k \in \mathbb{N}$ be the smallest iteration for which at least one of the conditions are satisfied:

$||\psi_{j_k}^k - \theta_k||_2 > 10$,
$||\dot F(\psi_{j_k}^k)||_2 \not\in (\tau_{\mathrm{grad},\mathrm{lower}}^k, \tau_{\mathrm{grad},\mathrm{upper}}^k)$,
$j_k == 100$.

The next iterate and algorithmic parameters in optData are updated based on the result of the non-sequential Armijo condition

\[ F(\psi_{j_k}^k) \leq F(\theta_k) - \rho\delta_k\alpha_0^k||\dot F(\theta_k)||_2.\]

When this condition is not satisfied, the following quantities are updated.

The iterate $\psi_{j_k}^k$ is rejected, and $\theta_{k+1} = \theta_k$
The scaling factor $\delta_{k+1} = .5\delta_k$

When this condition is satisfied, the following quantities are updated.

The iterate $\psi_{j_k}^k$ is accepted, and $\theta_{k+1} = \psi_{j_k}^k$.
The scaling factor is updated as $\delta_{k+1} = \min(1.5*\delta_k, \bar\delta)$ when $||\dot F(\psi_{j_k}^k)||_2 > \tau_{\mathrm{grad},\mathrm{lower}}^k$, otherwise $\delta_{k+1} = \delta_k$.
If $||\dot F(\psi_{j_k}^k)||_2 \not\in (\tau_{\mathrm{grad},\mathrm{lower}}^k, \tau_{\mathrm{grad},\mathrm{upper}}^k)$, then $\tau_{\mathrm{grad},\mathrm{lower}}^{k+1} = ||\dot F(\psi_{j_k}^k)||_2/\sqrt{2}$ and $\tau_{\mathrm{grad},\mathrm{upper}}^{k+1} = \sqrt{10}||\dot F(\psi_{j_k}^k)||_2$. Otherwise, this parameters are held constant.

Arguments

optData::NesterovAcceleratedGD{T}, the specification for the optimization method.
progData<:AbstractNLPModel{T,S}, the specification for the optimization problem.

Warning

progData must have an initialize function that returns subtypes of AbstractPrecompute and AbstractProblemAllocate, where the latter has a grad argument.

Return

x::S, final iterate of the optimization algorithm.

The method above requires several utility functions. These are listed below.

OptimizationMethods.update_local_lipschitz_approximation — Function

update_local_lipschitz_approximation(j::Int64, k::Int64, djk::S, curr_grad::S,
    prev_grad::S, prev_approximation::T, prev_acceptance::Bool) where {T, S}

Given the previous approximation of the local Lipschitz constant, prev_approximation::T, update the current estimate. That is, return the Lipschitz approximation for inner loop iteration j and outer loop iteration k.

Method

The local Lipschitz approximation method is conducted as follows. Let $j$ be the current inner loop iteration counter, and let $k$ be the current outer loop iteration counter. Let $\psi_j^k$ be the $j^{th}$ iterate of the $k^{th}$ outer loop, and let $\hat L_{j}^k$ be the $j^th$ estimate of the $k^{th}$ outer loop of the local Lipschitz constant. The local estimate is updated according the following five cases.

When $j == 1$ and $k == 1$, this is the first iteration of the first inner loop, and as there is no information available we set it to 1.0.
When $j == 1$ and $k > 1$, this is the first iteration of the $k^{th}$ inner loop, and we return $L_{j_{k-1}}^{k-1}$ which is the local Lipschitz estimates formed using information at the terminal iteration of the $k-1^{th}$ inner loop (i.e., this is the latest estimate).
When $j > 1$ and $k == 1$, this is an inner loop iteration where we have possible taken multiple steps, so we return the most 'local' estimate of the local Lipschitz constant which is $\frac{ ||\dot F(\psi_{j}^k) - \dot F(\psi_{j-1}^k)||_2}{ ||\psi_{j}^k - \psi_{j-1}^k||_2}.$
When $j > 1$ and $k > 1$ and $\psi_{j_{k-1}}^{k-1}$ satisfied the descent condition, then we return $\frac{ ||\dot F(\psi_{j}^k) - \dot F(\psi_{j-1}^k)||_2}{ ||\psi_{j}^k - \psi_{j-1}^k||_2}.$
When $j > 1$ and $k > 1$ and $\psi_{j_{k-1}}^{k-1}$ did not satisfy the descent condition, then we return $\max \left( \frac{||\dot F(\psi_{j}^k) - \dot F(\psi_{j-1}^k)||_2}{ ||\psi_{j}^k - \psi_{j-1}^k||_2}, \hat L_{j-1}^k \right).$

Arguments

j::Int64, inner loop iteration.
k::Int6, outer loop iteration.
norm_djk::T, norm of difference between \psi_j^k and $\psi_{j-1}^k$. On the first iteration of the inner loop this will not matter.
curr_grad::S, gradient at $\psi_j^k$ (i.e., the current iterate).
prev_grad::S, gradient at $\psi_{j-1}^k$ (i.e., the previous iterate).
prev_approximation::T, the local Lipschitz approximation from the previous iteration
prev_acceptance::Bool, whether or not the previous inner loop resulted in an accepted iterate.

Return

estimate::T, estimate of the local Lipschitz constant.

OptimizationMethods.compute_step_size — Function

compute_step_size(τ_lower::T, norm_grad::T, local_lipschitz_estimate::T
    ) where {T}

Computes the step size for the inner loop iterates.

Method

The inner loop iterates are generated according to the formula

\[ \psi_{j+1}^k = \psi_j^k - \delta_k\alpha_j^k \dot F(\psi_j^k).\]

The step size, $\alpha_j^k$ is computed as

\[ \alpha_j^k = \min \left( (\tau_{\mathrm{grad}, \mathrm{lower}}^k)^2 / C_{j,1}^k, 1 / C_{j,2}^k \right)\]

where

\[ C_{j,1}^k = ||\dot F(\psi_j^k)||_2^3 + .5 \hat{L}_j^k ||\dot F(\psi_j^k)||_2^2 + 1e-16\]

and

\[ C_{j,2}^k = ||\dot F(\psi_j^k)||_2 + .5 \hat{L}_j^k + 1e-16.\]

Arguments

τ_lower::T, lower bound on the gradient.
norm_grad::T, norm of the gradient at $\psi_j^k$.
local_lipschitz_estimate::T, local lipschitz approximation at inner loop iteration j and outer loop iteration k.

Return

αjk::T, the step size.

OptimizationMethods.inner_loop! — Function

inner_loop!(ψjk::S, θk::S, optData::NonsequentialArmijoGD{T}, 
    progData::P1 where P1 <: AbstractNLPModel{T, S}, 
    precomp::P2 where P2 <: AbstractPrecompute{T}, 
    store::P3 where P3 <: AbstractProblemAllocate{T}, 
    past_acceptance::Bool, k::Int64; max_iteration = 100) where {T, S}

Conduct the inner loop iteration, modifying ψjk, optData, and store in place. ψjk gets updated to be the terminal iterate of the inner loop; the fields local_lipschitz_estimate, norm_∇F_ψ, and prev_∇F_ψ are updated in optData; the fields grad in store gets updated to be the gradient at ψjk.

Arguments

ψjk::S, buffer array for the inner loop iterates.
θk::S, starting iterate.
optData::NonsequentialArmijoGD{T}, struct that specifies the optimization algorithm. Fields are modified during the inner loop.
progData::P1 where P1 <: AbstractNLPModel{T, S}, struct that specifies the optimization problem. Fields are modified during the inner loop.
precomp::P2 where P2 <: AbstractPrecompute{T}, struct that has precomputed values. Required to take advantage of this during the gradient computation.
store::P3 where P3 <: AbstractProblemAllocate{T}, struct that contains buffer arrays for computation.
past_acceptance::Bool, flag indicating if the previous inner loop resulted in a success (i.e., $F(\theta_k) < F(\theta_{k-1})$).
k::Int64, outer loop iteration for computation of the local Lipschitz approximation scheme.

Optional Keyword Arguments

max_iteration = 100, maximum number of allowable iteration of the inner loop. Should be kept at 100 as that is what is specified in the paper, but is useful to change for testing.

Returns

j::Int64, the iteration for which a triggering event evaluated to true.

OptimizationMethods.update_algorithm_parameters! — Function

update_algorithm_parameters!(θkp1::S, optData::NonsequentialArmijoGD{T},
    achieved_descent::Bool, iter::Int64) where {T, S}

Given that the non-sequential Armijo condition is checked, update the parameters the optimization method. The method updates the following variables in place.

θkp1 is updated to be the next outer loop iterate.
optData has (potentially) the following fields updated: δk, τ_lower, τ_upper.

Arguments

θkp1::S, buffer array for the storage of the next iterate.
optData::NonsequentialArmijoGD{T}, struct that specifies the optimization algorithm.
achieved_descent::Bool, boolean flag indicating whether or not the descent condition was achieved.
iter::Int64, the current iteration of the method. The outer loop iteration. This is requried as it is used to overwrite θkp1 with the previous iterate.

Returns

A boolean flag equal to achieved_descent to indicate whether θkp1 is modified in-place.

Line search Helper Functions

Backtracking

OptimizationMethods.backtracking! — Function

backtracking!(θk::S, θkm1::S, F::Function, gkm1::S, step_direction::S,
    reference_value::T, α::T, δ::T, ρ::T; max_iteration::Int64 = 100
    ) where {T, S}

Implementation of backtracking which modifies θk in place. This method should be used for general step directions. If gkm1 is the step direction use the other backtracking!(...) method.

Reference(s)

Nocedal and Wright. "Numerical Optimization". Springer New York, NY.

Method

Let $\theta_{k-1}$ be the current iterate, and let $\alpha \in \mathbb{R}_{>0}$, $\delta \in (0, 1)$, and $\rho \in (0, 1)$. Let $d_k$ be the step direction, then $\theta_k = \theta_{k-1} - \delta^t\alpha d_k$ where $t + 1 \in \mathbb{N}$ is the smallest such number satisfying

\[ F(\theta_k) \leq O_{k-1} - \rho\delta^t\alpha \dot F(\theta_{k-1})^\intercal d_k,\]

where $O_{k-1}$ is some reference value.

Arguments

θk::S, buffer array for the next iterate.
θkm1::S, current iterate the optimization algorithm.
F::Function, objective function. Should take in a single argument and return the value of the objective at the input value.
gkm1::S, gradient value at θkm1.
step_direction::S, direction to move θkm1 to form θk.
reference_value::T, value to check the objective function at θk against.
α::T, initial step size.
δ::T, backtracking decrease factor.
ρ::T, factor involved in the acceptance criterion in θk. Larger values correspond to stricter descent conditions, and smaller values correspond to looser descent conditions on θk.

Optional Keyword Arguments

max_iteration::Int64 = 100, the maximum allowable iterations for the line search procedure. In the language above, when $t$ is equal to max_iteration the algorithm will terminate.

Return

backtracking_condition_satisfied::Bool, whether the backtracking condition is satisfied before the max iteration limit.

backtracking!(θk::S, θkm1::S, F::Function, gkm1::S, norm_gkm1_squared::T,
    reference_value::T, α::T, δ::T, ρ::T; max_iteration::Int64 = 100)
    where {T, S}

Implementation of backtracking which modifies θk in place. This method assumes that the step direction is -gkm1.

Reference(s)

Armijo, Larry. “Minimization of Functions Having Lipschitz Continuous First Partial Derivatives.” Pacific Journal of Mathematics 16.1 (1966): 1–3. Pacific Journal of Mathematics. Web.

Nocedal and Wright. "Numerical Optimization". Springer New York, NY.

Method

Let $\theta_{k-1}$ be the current iterate, and let $\alpha \in \mathbb{R}_{>0}$, $\delta \in (0, 1)$, and $\rho \in (0, 1)$. Let $\dot F(\theta_{k-1})$ be the step direction, then $\theta_k = \theta_{k-1} - \delta^t\alpha \dot F(\theta_{k-1})$ where $t + 1 \in \mathbb{N}$ is the smallest such number satisfying

\[ F(\theta_k) \leq O_{k-1} - \rho\delta^t\alpha ||\dot F(\theta_{k-1})||_2^2,\]

where $O_{k-1}$ is some reference value, and $||\cdot||_2$ is the L2-norm.

Arguments

θk::S, buffer array for the next iterate.
θkm1::S, current iterate the optimization algorithm.
F::Function, objective function. Should take in a single argument and return the value of the objective at the input value.
gkm1::S, gradient value at θkm1.
norm_gkm1_squared::T, norm squared for the value of gkm1.
reference_value::T, value to check the objective function at θk against.
α::T, initial step size.
δ::T, backtracking decrease factor.
ρ::T, factor involved in the acceptance criterion in θk. Larger values correspond to stricter descent conditions, and smaller values correspond to looser descent conditions on θk.

Optional Keyword Arguments

max_iteration::Int64 = 100, the maximum allowable iterations for the line search procedure. In the language above, when $t$ is equal to max_iteration the algorithm will terminate.

Return

backtracking_condition_satisfied::Bool, whether the backtracking condition is satisfied before the max iteration limit.

Non-sequential Armijo

OptimizationMethods.non_sequential_armijo_condition — Function

non_sequential_armijo_condition(F_ψjk::T, reference_value::T,
    norm_grad_θk::T, ρ::T, δk::T, α0k::T) where {T}

Check if F_ψjk satisfies the non-sequential armijo condition with respect to reference_value and the remaining parameters. Returns a boolean value indicating if the descent condition is satisfied or not.

Method

Let $F(\theta) : \mathbb{R}^n \to \mathbb{R}$ be the objective function. Suppose that $\theta_k \in \mathbb{R}^n$ is an iterate of an optimization routine. Let $\psi_0^k = \theta_k$ and for $j \in \lbrace 1,...,T \rbrace$ for some $T \in \mathbb{N}$, recursively define

\[ \psi_j^k = \psi_0^k - \sum_{t = 0}^{j-1} \delta_k \alpha_t^k \dot F(\psi_t^k).\]

Then, the (monotone) non-sequential armijo condition requires that

\[ F(\psi_j^k) < F(\psi_0^k) - \rho \delta_k \alpha_0^k ||\dot F(\psi_0^k)||_2^2,\]

where $||\cdot||_2$ is the L2-norm and $\rho \in (0, 1)$.

This function implements checking the inequality, where F_ψjk corresponds to $F(\psi_j^k)$, reference_value corresponds to $F(\psi_0^k)$, norm_grad_θk to $||\dot F(\psi_0^k)||_2$, ρ to $\rho$, δk to $\delta_k$, and α0k to $\alpha_0^k$. To see more about how this method is used read the documentation for gradient descent with non-sequential armijo

Arguments

F_ψjk::T, numeric value on the LHS of the inequality. In optimization context, this is the objective value of a trial iterate to check if sufficient descent is achieved.
reference_value::T, numeric value on the RHS of the inequality. In the optimization context, the value of the current iterate F_ψjk must be smaller than this to guarantee a sufficient descent criterion.
norm_grad_θk::T, numeric value forming the amount of descent that needs to be achieved. This value is usually the norm of the gradient of a previous iterate.
ρ::T, parameter in the line search criterion dictating how much descent should be required. Should be positive. Larger values indicate stricter conditions, and lower value indicate looser conditions.
δk::T, numeric value that corresponds to a scaling factor for the step size.
α0k::T, numeric value. In the context of non-sequential armijo gradient descent this is the first step size used in an inner loop.

Return

flag::Bool, true if the descent condition is satisfied, and false if the descent condition is not satisfied.

Index