Running MadNLP on the GPU
MadNLP supports the solution of large-scale optimization problems on the GPU, with significant speedups reported on some instances. In this tutorial, we show how to solve a nonlinear program on the GPU with ExaModels and MadNLP.
Generic principles
MadNLP has been designed to run entirely on the GPU, without data exchange between the host and the device. If the model is well specified, deporting the solution on a NVIDIA GPU is seamless, using:
- ExaModels to evaluate the nonlinear model and its derivatives on the GPU;
- NVIDIA cuDSS to solve the linear KKT systems on the GPU;
- KernelAbstractions to deport MadNLP's internal computations on the GPU.
We import the aforementioned packages as:
using ExaModels
using MadNLPGPU
using CUDAOn the contrary to ExaModels, MadNLP does not yet support solving sparse optimization problems on AMD GPUs.
Evaluating the model on the GPU with ExaModels
The first step requires implementing the model with ExaModels.
As a demonstration, we implement the model airport from the CUTEst benchmark using ExaModels. The code writes:
function airport_model(T, backend)
N = 42
# Data
r = T[0.09 , 0.3, 0.09, 0.45, 0.5, 0.04, 0.1, 0.02, 0.02, 0.07, 0.4, 0.045, 0.05, 0.056, 0.36, 0.08, 0.07, 0.36, 0.67, 0.38, 0.37, 0.05, 0.4, 0.66, 0.05, 0.07, 0.08, 0.3, 0.31, 0.49, 0.09, 0.46, 0.12, 0.07, 0.07, 0.09, 0.05, 0.13, 0.16, 0.46, 0.25, 0.1]
cx = T[-6.3, -7.8, -9.0, -7.2, -5.7, -1.9, -3.5, -0.5, 1.4, 4.0, 2.1, 5.5, 5.7, 5.7, 3.8, 5.3, 4.7, 3.3, 0.0, -1.0, -0.4, 4.2, 3.2, 1.7, 3.3, 2.0, 0.7, 0.1, -0.1, -3.5, -4.0, -2.7, -0.5, -2.9, -1.2, -0.4, -0.1, -1.0, -1.7, -2.1, -1.8, 0.0]
cy = T[8.0, 5.1, 2.0, 2.6, 5.5, 7.1, 5.9, 6.6, 6.1, 5.6, 4.9, 4.7, 4.3, 3.6, 4.1, 3.0, 2.4, 3.0, 4.7, 3.4, 2.3, 1.5, 0.5, -1.7, -2.0, -3.1, -3.5, -2.4, -1.3, 0.0, -1.7, -2.1, -0.4, -2.9, -3.4, -4.3, -5.2, -6.5, -7.5, -6.4, -5.1, 0.0]
# Wrap all data in a single iterator for ExaModels
data = [(i, cx[i], cy[i], r[i]) for i in 1:N]
IJ = [(i, j) for i in 1:N-1 for j in i+1:N]
# Write model using ExaModels
core = ExaModels.ExaCore(T; backend=backend)
x = ExaModels.variable(core, 1:N, lvar = -10.0, uvar=10.0)
y = ExaModels.variable(core, 1:N, lvar = -10.0, uvar=10.0)
ExaModels.objective(
core,
((x[i] - x[j])^2 + (y[i] - y[j])^2) for (i, j) in IJ
)
ExaModels.constraint(core, (x[i]-dcx)^2 + (y[i] - dcy)^2 - dr for (i, dcx, dcy, dr) in data; lcon=-Inf)
return ExaModels.ExaModel(core)
endairport_model (generic function with 1 method)The first argument T specifies the numerical precision (Float32, Float64 or any AbstractFloat) whereas the second argument backend sets the device used to evaluate the model. We instantiate the model on the GPU with:
nlp = airport_model(Float64, CUDABackend())An ExaModel{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}, ...}
Problem name: Generic
All variables: ████████████████████ 84 All constraints: ████████████████████ 42
free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 upper: ████████████████████ 42
low/upp: ████████████████████ 84 low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nnzh: (-47.06% sparsity) 5250 linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nonlinear: ████████████████████ 42
nnzj: ( 97.62% sparsity) 84
lin_nnzj: (------% sparsity)
nln_nnzj: ( 97.62% sparsity) 84
By passing a CUDABackend(), we make sure that all the attributes in nlp are instantiated on the GPU. E.g., the initial point becomes a CuVector:
NLPModels.get_x0(nlp)84-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
⋮
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0If your problem is implemented with JuMP in model, ExaModel can load it for you on the GPU just by using nlp = ExaModel(model; backend=CUDABackend()).
Solving the problem on the GPU with MadNLP
Once the model nlp loaded on the GPU, you can solve it using the function madnlp:
results = madnlp(nlp; linear_solver=CUDSSSolver)
nothingThis is MadNLP version v0.8.12, running with cuDSS v0.7.1
Number of nonzeros in constraint Jacobian............: 84
Number of nonzeros in Lagrangian Hessian.............: 5250
Total number of variables............................: 84
variables with only lower bounds: 0
variables with lower and upper bounds: 84
variables with only upper bounds: 0
Total number of equality constraints.................: 0
Total number of inequality constraints...............: 42
inequality constraints with only lower bounds: 0
inequality constraints with lower and upper bounds: 0
inequality constraints with only upper bounds: 42
iter objective inf_pr inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
0 0.0000000e+00 1.04e+02 1.00e+00 1.00e+01 -1.0 - 0.00e+00 0 0
1 1.4808378e+01 1.00e+02 2.64e+01 6.04e+00 -1.0 - 3.37e-02 1 1h
2 1.3660555e+04 2.50e+01 1.79e+02 1.87e+00 -1.0 - 1.00e+00 1 1h
3 1.5757708e+04 2.11e+01 1.55e+02 8.19e-01 -1.0 - 1.66e-01 1 1h
4 2.2700976e+04 1.19e+01 1.31e+02 3.46e-01 -1.0 - 4.94e-01 1 1h
5 3.6245598e+04 2.96e+00 2.13e+02 2.84e-01 -1.0 - 1.00e+00 1 1h
6 4.0274013e+04 1.53e+00 1.64e+02 1.19e-01 -1.0 - 5.59e-01 1 1h
7 4.5453607e+04 3.61e-01 1.98e+02 1.16e-01 -1.0 - 1.00e+00 1 1h
8 4.7405169e+04 7.22e-02 1.00e+02 1.00e-01 -1.0 - 1.00e+00 1 1h
9 4.7892154e+04 8.03e-03 3.84e+01 9.05e-02 -1.0 - 1.00e+00 1 1h
iter objective inf_pr inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
10 4.7949829e+04 3.07e-04 6.16e+00 7.82e-02 -1.0 - 1.00e+00 1 1h
11 4.7951497e+04 1.23e-06 5.90e-02 7.81e-02 -1.0 - 1.00e+00 1 1h
12 4.7947424e+04 3.39e-06 1.21e-03 2.85e-03 -2.5 - 1.00e+00 1 1f
13 4.7947303e+04 4.19e-09 1.24e-06 7.80e-06 -5.0 - 1.00e+00 1 1h
Number of Iterations....: 13
(scaled) (unscaled)
Objective...............: 4.7947302735661709e+04 4.7947302735661709e+04
Dual infeasibility......: 1.2358625808371812e-06 1.2358625808371812e-06
Constraint violation....: 4.1921440756156718e-09 4.1921440756156718e-09
Complementarity.........: 7.8004282261755751e-06 7.8004282261755751e-06
Overall NLP error.......: 7.8004282261755751e-06 7.8004282261755751e-06
Number of objective function evaluations = 14
Number of objective gradient evaluations = 14
Number of constraint evaluations = 14
Number of constraint Jacobian evaluations = 14
Number of Lagrangian Hessian evaluations = 13
Total wall secs in initialization = 40.453
Total wall secs in linear solver = 4.273
Total wall secs in NLP function evaluations = 1.879
Total wall secs in solver (w/o init./fun./lin. alg.) = 59.726
Total wall secs = 106.331
EXIT: Optimal Solution Found (tol = 1.0e-04).When solving an optimization problem on the GPU, MadNLP proceeds to some automatic modifications. In order:
- It increases the parameter
bound_relax_factorto1e-4. - It relaxes all the equality constraints $g(x) = 0$ as a pair of inequality constraints $-\tau \leq g(x) \leq \tau$, with $tau$ being equal to
bound_relax_factor; - It reduces the KKT system down to sparse condensed system, exploiting the fact that the relaxed problem has only (potentially tight) inequality constraints. Up to a given primal regularization, the resulting KKT system is positive definite and can be factorized using a pivoting-free factorization routines (e.g. a Cholesky or an LDL' decompositions). This is the so-called Lifted-KKT formulation documented in this article.
- The new condensed KKT system increases the ill-conditioning inherent to the interior-point method, amplifying the loss of accuracy when the iterates get close to a local solution. As a result, the termination tolerance
tolis also increased to1e-4to recover convergence.
As a result, the convergence observed can be significantly different than what we observe on the CPU. In particular, relaxing the parameter bound_relax_factor has a non-marginal impact on the feasible set's geometry. You can limit the loss of accuracy by specifying explicitly the relaxation and tolerance parameters to MadNLP:
results = madnlp(
nlp;
tol=1e-7,
bound_relax_factor=1e-7,
linear_solver=CUDSSSolver,
)
nothingThis is MadNLP version v0.8.12, running with cuDSS v0.7.1
Number of nonzeros in constraint Jacobian............: 84
Number of nonzeros in Lagrangian Hessian.............: 5250
Total number of variables............................: 84
variables with only lower bounds: 0
variables with lower and upper bounds: 84
variables with only upper bounds: 0
Total number of equality constraints.................: 0
Total number of inequality constraints...............: 42
inequality constraints with only lower bounds: 0
inequality constraints with lower and upper bounds: 0
inequality constraints with only upper bounds: 42
iter objective inf_pr inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
0 0.0000000e+00 1.04e+02 1.00e+00 1.00e+01 -1.0 - 0.00e+00 0 0
1 1.4806534e+01 1.00e+02 2.64e+01 6.04e+00 -1.0 - 3.37e-02 1 1h
2 1.3660710e+04 2.50e+01 1.79e+02 1.87e+00 -1.0 - 1.00e+00 1 1h
3 1.5757916e+04 2.11e+01 1.55e+02 8.19e-01 -1.0 - 1.66e-01 1 1h
4 2.2713696e+04 1.19e+01 1.31e+02 3.46e-01 -1.0 - 4.95e-01 1 1h
5 3.6254087e+04 2.96e+00 2.13e+02 2.84e-01 -1.0 - 1.00e+00 1 1h
6 4.0283958e+04 1.53e+00 1.64e+02 1.19e-01 -1.0 - 5.60e-01 1 1h
7 4.5459842e+04 3.60e-01 1.98e+02 1.16e-01 -1.0 - 1.00e+00 1 1h
8 4.7410250e+04 7.21e-02 1.00e+02 1.00e-01 -1.0 - 1.00e+00 1 1h
9 4.7897346e+04 8.02e-03 3.86e+01 9.02e-02 -1.0 - 1.00e+00 1 1h
iter objective inf_pr inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
10 4.7955202e+04 3.12e-04 6.26e+00 7.81e-02 -1.0 - 1.00e+00 1 1h
11 4.7956891e+04 1.28e-06 6.14e-02 7.79e-02 -1.0 - 1.00e+00 1 1h
12 4.7952818e+04 3.39e-06 1.21e-03 2.85e-03 -2.5 - 1.00e+00 1 1f
13 4.7952696e+04 4.21e-09 1.24e-06 2.16e-06 -5.7 - 1.00e+00 1 1h
14 4.7952696e+04 2.16e-15 6.89e-13 7.07e-09 -8.0 - 1.00e+00 2 1h
Number of Iterations....: 14
(scaled) (unscaled)
Objective...............: 4.7952696276023358e+04 4.7952696276023358e+04
Dual infeasibility......: 6.8916601564313557e-13 6.8916601564313557e-13
Constraint violation....: 2.1550052366575115e-15 2.1550052366575115e-15
Complementarity.........: 7.0655894247843478e-09 7.0655894247843478e-09
Overall NLP error.......: 7.0655894247843478e-09 7.0655894247843478e-09
Number of objective function evaluations = 15
Number of objective gradient evaluations = 15
Number of constraint evaluations = 15
Number of constraint Jacobian evaluations = 15
Number of Lagrangian Hessian evaluations = 14
Total wall secs in initialization = 0.027
Total wall secs in linear solver = 0.028
Total wall secs in NLP function evaluations = 0.009
Total wall secs in solver (w/o init./fun./lin. alg.) = 0.056
Total wall secs = 0.119
EXIT: Optimal Solution Found (tol = 1.0e-07).Decreasing the tolerance tol too much is likely to cause some numerical issues inside the algorithm. In general, we recommend keeping the value of bound_relax_factor below tol.
Solving the problem on the GPU with HyKKT
Some applications require accurate solutions. In that case, we recommend using the extension HybridKKT.jl, which implements the Golub & Greif augmented Lagrangian formulation detailed in this article. Compared to Lifted-KKT, the Hybrid-KKT strategy is more accurate (it doesn't relax the equality constraints in the problem) but slightly slower (it computes the descent direction using a conjugate gradient at every IPM iterations).
Once the package HybridKKT installed, the solution proceeds as
using HybridKKT
results = madnlp(
nlp;
linear_solver=MadNLPGPU.CUDSSSolver,
kkt_system=HybridKKT.HybridCondensedKKTSystem,
equality_treatment=MadNLP.EnforceEquality,
fixed_variable_treatment=MadNLP.MakeParameter,
)
nothing