Running MadNLP on the GPU

MadNLP supports the solution of large-scale optimization problems on the GPU, with significant speedups reported on some instances. In this tutorial, we show how to solve a nonlinear program on a NVIDIA GPU. We start by importing MadNLPGPU:

using MadNLPGPU
using CUDA
Precompiling packages...
  14040.0 ms  ✓ MadNLPGPU
  1 dependency successfully precompiled in 16 seconds. 123 already precompiled.

How to solve a problem on the GPU with ExaModels?

MadNLP has been designed to run entirely on the GPU, without data exchange between the host and the device. We recommend implementing the model with the modeler ExaModels whenever possible.

We import ExaModels as:

using ExaModels
Info

MadNLP does not yet support solving sparse optimization problems on AMD GPUs.

Evaluating the model on the GPU with ExaModels

The first step requires implementing the model with ExaModels.

As a demonstration, we implement the model airport from the CUTEst benchmark using ExaModels. The code writes:

function airport_model(T, backend)
    N = 42
    # Data
    r = T[0.09 , 0.3, 0.09, 0.45, 0.5, 0.04, 0.1, 0.02, 0.02, 0.07, 0.4, 0.045, 0.05, 0.056, 0.36, 0.08, 0.07, 0.36, 0.67, 0.38, 0.37, 0.05, 0.4, 0.66, 0.05, 0.07, 0.08, 0.3, 0.31, 0.49, 0.09, 0.46, 0.12, 0.07, 0.07, 0.09, 0.05, 0.13, 0.16, 0.46, 0.25, 0.1]
    cx = T[-6.3, -7.8, -9.0, -7.2, -5.7, -1.9, -3.5, -0.5, 1.4, 4.0, 2.1, 5.5, 5.7, 5.7, 3.8, 5.3, 4.7, 3.3, 0.0, -1.0, -0.4, 4.2, 3.2, 1.7, 3.3, 2.0, 0.7, 0.1, -0.1, -3.5, -4.0, -2.7, -0.5, -2.9, -1.2, -0.4, -0.1, -1.0, -1.7, -2.1, -1.8, 0.0]
    cy = T[8.0, 5.1, 2.0, 2.6, 5.5, 7.1, 5.9, 6.6, 6.1, 5.6, 4.9, 4.7, 4.3, 3.6, 4.1, 3.0, 2.4, 3.0, 4.7, 3.4, 2.3, 1.5, 0.5, -1.7, -2.0, -3.1, -3.5, -2.4, -1.3, 0.0, -1.7, -2.1, -0.4, -2.9, -3.4, -4.3, -5.2, -6.5, -7.5, -6.4, -5.1, 0.0]
    # Wrap all data in a single iterator for ExaModels
    data = [(i, cx[i], cy[i], r[i]) for i in 1:N]
    IJ = [(i, j) for i in 1:N-1 for j in i+1:N]
    # Write model using ExaModels
    core = ExaModels.ExaCore(T; backend=backend)
    x = ExaModels.variable(core, 1:N, lvar = -10.0, uvar=10.0)
    y = ExaModels.variable(core, 1:N, lvar = -10.0, uvar=10.0)
    ExaModels.objective(
        core,
        ((x[i] - x[j])^2 + (y[i] - y[j])^2) for (i, j) in IJ
    )
    ExaModels.constraint(core, (x[i]-dcx)^2 + (y[i] - dcy)^2 - dr for (i, dcx, dcy, dr) in data; lcon=-Inf)
    return ExaModels.ExaModel(core)
end
airport_model (generic function with 1 method)

The first argument T specifies the numerical precision (Float32, Float64 or any AbstractFloat) whereas the second argument backend sets the device used to evaluate the model. We instantiate the model on the GPU with:

nlp = airport_model(Float64, CUDABackend())
An ExaModel{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}, ...}

  Problem name: Generic
   All variables: ████████████████████ 84     All constraints: ████████████████████ 42    
            free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                upper: ████████████████████ 42    
         low/upp: ████████████████████ 84             low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
          infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
            nnzh: (-47.06% sparsity)   5250            linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
                                                    nonlinear: ████████████████████ 42    
                                                         nnzj: ( 97.62% sparsity)   84    
                                                     lin_nnzj: (------% sparsity)         
                                                     nln_nnzj: ( 97.62% sparsity)   84    

By passing a CUDABackend(), we make sure that all the attributes in nlp are instantiated on the GPU. E.g., the initial point becomes a CuVector:

NLPModels.get_x0(nlp)
84-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
Info

If your problem is implemented with JuMP in a given model, ExaModel can load it for you on the GPU just by using nlp = ExaModel(model; backend=CUDABackend()).

Solving the problem on the GPU with MadNLP

Once the model nlp loaded on the GPU, you can solve it using the function madnlp:

results = madnlp(nlp; linear_solver=CUDSSSolver)
nothing
This is MadNLP version v0.9.1, running with cuDSS v0.7.1

Number of nonzeros in constraint Jacobian............:       84
Number of nonzeros in Lagrangian Hessian.............:     5250

Total number of variables............................:       84
                     variables with only lower bounds:        0
                variables with lower and upper bounds:       84
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:       42
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:       42

iter    objective    inf_pr   inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
   0  0.0000000e+00 1.04e+02 1.00e+00 1.00e+01  -1.0     -   0.00e+00  0  0
   1  1.4808378e+01 1.00e+02 2.64e+01 6.04e+00  -1.0     -   3.37e-02  1  1h
   2  1.3660555e+04 2.50e+01 1.79e+02 1.87e+00  -1.0     -   1.00e+00  1  1h
   3  1.5757708e+04 2.11e+01 1.55e+02 8.19e-01  -1.0     -   1.66e-01  1  1h
   4  2.2700976e+04 1.19e+01 1.31e+02 3.46e-01  -1.0     -   4.94e-01  1  1h
   5  3.6245598e+04 2.96e+00 2.13e+02 2.84e-01  -1.0     -   1.00e+00  1  1h
   6  4.0274013e+04 1.53e+00 1.64e+02 1.19e-01  -1.0     -   5.59e-01  1  1h
   7  4.5453607e+04 3.61e-01 1.98e+02 1.16e-01  -1.0     -   1.00e+00  1  1h
   8  4.7405169e+04 7.22e-02 1.00e+02 1.00e-01  -1.0     -   1.00e+00  1  1h
   9  4.7892154e+04 8.03e-03 3.84e+01 9.05e-02  -1.0     -   1.00e+00  1  1h
iter    objective    inf_pr   inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
  10  4.7949829e+04 3.07e-04 6.16e+00 7.82e-02  -1.0     -   1.00e+00  1  1h
  11  4.7951497e+04 1.23e-06 5.90e-02 7.81e-02  -1.0     -   1.00e+00  1  1h
  12  4.7947424e+04 3.39e-06 1.21e-03 2.85e-03  -2.5     -   1.00e+00  1  1f
  13  4.7947303e+04 4.19e-09 1.24e-06 7.80e-06  -5.0     -   1.00e+00  1  1h

Number of Iterations....: 13

                                   (scaled)                 (unscaled)
Objective...............:   4.7947302735661709e+04    4.7947302735661709e+04
Dual infeasibility......:   1.2358625874757045e-06    1.2358625874757045e-06
Constraint violation....:   4.1921440894934596e-09    4.1921440894934596e-09
Complementarity.........:   7.8004282261755734e-06    7.8004282261755734e-06
Overall NLP error.......:   7.8004282261755734e-06    7.8004282261755734e-06

Number of objective function evaluations              = 14
Number of objective gradient evaluations              = 14
Number of constraint evaluations                      = 14
Number of constraint Jacobian evaluations             = 14
Number of Lagrangian Hessian evaluations              = 13
Number of KKT factorizations                          = 13
Number of KKT backsolves                              = 13

Total wall secs in initialization                     = 48.551 s
Total wall secs in linear solver                      =  unavailable
Total wall secs in NLP function evaluations           =  2.519 s
Total wall secs in solver (w/o init./fun./lin. alg.)  =  unavailable
Total wall secs                                       = 129.560 s

EXIT: Optimal Solution Found (tol = 1.0e-04).

Note that we are using the sparse linear solver NVIDIA cuDSS to solve the primal-dual KKT system entirely on the GPU.

When solving an optimization problem on the GPU, MadNLP proceeds to some automatic modifications. In order:

  1. It increases the parameter bound_relax_factor to 1e-4.
  2. It relaxes all the equality constraints $g(x) = 0$ as a pair of inequality constraints $-\tau \leq g(x) \leq \tau$, with $\tau$ being equal to bound_relax_factor;
  3. It reduces the KKT system down to sparse condensed system, exploiting the fact that the relaxed problem has only (potentially tight) inequality constraints. Up to a given primal regularization, the resulting KKT system is positive definite and can be factorized using a pivoting-free factorization routines (e.g. a Cholesky or an LDL' decompositions). This is the so-called Lifted-KKT formulation documented in this article.
  4. The new condensed KKT system increases the ill-conditioning inherent to the interior-point method, amplifying the loss of accuracy when the iterates get close to a local solution. As a result, the termination tolerance tol is also increased to 1e-4 to recover convergence.

As a result, the convergence observed can be significantly different than what we observe on the CPU. In particular, relaxing the parameter bound_relax_factor has a non-marginal impact on the feasible set's geometry. You can limit the loss of accuracy by specifying explicitly the relaxation and tolerance parameters to MadNLP:

results = madnlp(
    nlp;
    tol=1e-7,
    bound_relax_factor=1e-7,
    linear_solver=CUDSSSolver,
)
nothing
This is MadNLP version v0.9.1, running with cuDSS v0.7.1

Number of nonzeros in constraint Jacobian............:       84
Number of nonzeros in Lagrangian Hessian.............:     5250

Total number of variables............................:       84
                     variables with only lower bounds:        0
                variables with lower and upper bounds:       84
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:       42
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:       42

iter    objective    inf_pr   inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
   0  0.0000000e+00 1.04e+02 1.00e+00 1.00e+01  -1.0     -   0.00e+00  0  0
   1  1.4806534e+01 1.00e+02 2.64e+01 6.04e+00  -1.0     -   3.37e-02  1  1h
   2  1.3660710e+04 2.50e+01 1.79e+02 1.87e+00  -1.0     -   1.00e+00  1  1h
   3  1.5757916e+04 2.11e+01 1.55e+02 8.19e-01  -1.0     -   1.66e-01  1  1h
   4  2.2713696e+04 1.19e+01 1.31e+02 3.46e-01  -1.0     -   4.95e-01  1  1h
   5  3.6254087e+04 2.96e+00 2.13e+02 2.84e-01  -1.0     -   1.00e+00  1  1h
   6  4.0283958e+04 1.53e+00 1.64e+02 1.19e-01  -1.0     -   5.60e-01  1  1h
   7  4.5459842e+04 3.60e-01 1.98e+02 1.16e-01  -1.0     -   1.00e+00  1  1h
   8  4.7410250e+04 7.21e-02 1.00e+02 1.00e-01  -1.0     -   1.00e+00  1  1h
   9  4.7897346e+04 8.02e-03 3.86e+01 9.02e-02  -1.0     -   1.00e+00  1  1h
iter    objective    inf_pr   inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
  10  4.7955202e+04 3.12e-04 6.26e+00 7.81e-02  -1.0     -   1.00e+00  1  1h
  11  4.7956891e+04 1.28e-06 6.14e-02 7.79e-02  -1.0     -   1.00e+00  1  1h
  12  4.7952818e+04 3.39e-06 1.21e-03 2.85e-03  -2.5     -   1.00e+00  1  1f
  13  4.7952696e+04 4.21e-09 1.24e-06 2.16e-06  -5.7     -   1.00e+00  1  1h
  14  4.7952696e+04 2.16e-15 6.89e-13 7.07e-09  -8.0     -   1.00e+00  2  1h

Number of Iterations....: 14

                                   (scaled)                 (unscaled)
Objective...............:   4.7952696276023358e+04    4.7952696276023358e+04
Dual infeasibility......:   6.8916601564313557e-13    6.8916601564313557e-13
Constraint violation....:   2.1550052101877319e-15    2.1550052101877319e-15
Complementarity.........:   7.0655894247826521e-09    7.0655894247826521e-09
Overall NLP error.......:   7.0655894247826521e-09    7.0655894247826521e-09

Number of objective function evaluations              = 15
Number of objective gradient evaluations              = 15
Number of constraint evaluations                      = 15
Number of constraint Jacobian evaluations             = 15
Number of Lagrangian Hessian evaluations              = 14
Number of KKT factorizations                          = 14
Number of KKT backsolves                              = 15

Total wall secs in initialization                     =  0.026 s
Total wall secs in linear solver                      =  unavailable
Total wall secs in NLP function evaluations           =  0.006 s
Total wall secs in solver (w/o init./fun./lin. alg.)  =  unavailable
Total wall secs                                       =  0.090 s

EXIT: Optimal Solution Found (tol = 1.0e-07).

Decreasing the tolerance tol too much is likely to cause numerical issues inside the algorithm. In general, we recommend keeping the value of bound_relax_factor below tol.

Solving the problem on the GPU with HyKKT

Some applications require accurate solutions. In that case, we recommend using the extension HybridKKT.jl, which implements the Golub & Greif augmented Lagrangian formulation detailed in this article. Compared to Lifted-KKT, the Hybrid-KKT strategy is more accurate (it doesn't relax the equality constraints in the problem) but slightly slower (it computes the descent direction using a conjugate gradient at every IPM iterations).

Once the package HybridKKT installed, the solution proceeds as

using HybridKKT

results = madnlp(
    nlp;
    linear_solver=MadNLPGPU.CUDSSSolver,
    kkt_system=HybridKKT.HybridCondensedKKTSystem,
    equality_treatment=MadNLP.EnforceEquality,
    fixed_variable_treatment=MadNLP.MakeParameter,
)

How to solve on the GPU a legacy model running on the CPU?

If your optimization problem is not instantiated on the GPU, you can still solve it on the GPU by wrapping your model in a SparseWrapperModel. Oftentimes, this is the most convenient solution if your problem does not formulate easilly with ExaModels. In that case, the evaluation of the model runs on the CPU, but all MadNLP's internals are instantiated on the GPU (including the sparse linear solver).

To give an example, we suppose we have instantiate our airport model on the CPU:

nlp_cpu = airport_model(Float64, nothing)
An ExaModel{Float64, Vector{Float64}, ...}

  Problem name: Generic
   All variables: ████████████████████ 84     All constraints: ████████████████████ 42    
            free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                upper: ████████████████████ 42    
         low/upp: ████████████████████ 84             low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
          infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
            nnzh: (-47.06% sparsity)   5250            linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
                                                    nonlinear: ████████████████████ 42    
                                                         nnzj: ( 97.62% sparsity)   84    
                                                     lin_nnzj: (------% sparsity)         
                                                     nln_nnzj: ( 97.62% sparsity)   84    

You can wrap it on the GPU using:

nlp_wrapped = MadNLP.SparseWrapperModel(CuArray, nlp_cpu)
MadNLP.SparseWrapperModel{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}, Float64, Vector{Int64}, Vector{Float64}, ExaModels.ExaModel{Float64, Vector{Float64}, Nothing, ExaModels.Objective{ExaModels.ObjectiveNull, ExaModels.SIMDFunction{ExaModels.Node2{typeof(+), ExaModels.Node1{typeof(abs2), ExaModels.Node2{typeof(-), ExaModels.Var{ExaModels.ParIndexed{ExaModels.ParSource, 1}}, ExaModels.Var{ExaModels.ParIndexed{ExaModels.ParSource, 2}}}}, ExaModels.Node1{typeof(abs2), ExaModels.Node2{typeof(-), ExaModels.Var{ExaModels.Node2{typeof(+), ExaModels.ParIndexed{ExaModels.ParSource, 1}, Int64}}, ExaModels.Var{ExaModels.Node2{typeof(+), ExaModels.ParIndexed{ExaModels.ParSource, 2}, Int64}}}}}, ExaModels.Compressor{NTuple{4, Int64}}, ExaModels.Compressor{NTuple{6, Int64}}}, Vector{Tuple{Int64, Int64}}}, ExaModels.Constraint{ExaModels.ConstraintNull, ExaModels.SIMDFunction{ExaModels.Node2{typeof(-), ExaModels.Node2{typeof(+), ExaModels.Node1{typeof(abs2), ExaModels.Node2{typeof(-), ExaModels.Var{ExaModels.ParIndexed{ExaModels.ParSource, 1}}, ExaModels.ParIndexed{ExaModels.ParSource, 2}}}, ExaModels.Node1{typeof(abs2), ExaModels.Node2{typeof(-), ExaModels.Var{ExaModels.Node2{typeof(+), ExaModels.ParIndexed{ExaModels.ParSource, 1}, Int64}}, ExaModels.ParIndexed{ExaModels.ParSource, 3}}}}, ExaModels.ParIndexed{ExaModels.ParSource, 4}}, ExaModels.Compressor{Tuple{Int64, Int64}}, ExaModels.Compressor{Tuple{Int64, Int64}}}, Vector{Tuple{Int64, Float64, Float64, Float64}}, Int64}, Nothing}}
  Problem name: Generic
   All variables: ████████████████████ 84     All constraints: ████████████████████ 42    
            free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                upper: ████████████████████ 42    
         low/upp: ████████████████████ 84             low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
          infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
            nnzh: (-47.06% sparsity)   5250            linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
                                                    nonlinear: ████████████████████ 42    
                                                         nnzj: ( 97.62% sparsity)   84    
                                                     lin_nnzj: (------% sparsity)         
                                                     nln_nnzj: ( 97.62% sparsity)   84    

  Counters:
             obj: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 grad: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 cons: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
        cons_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0             cons_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 jcon: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           jgrad: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                  jac: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0              jac_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
         jac_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                jprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0            jprod_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
       jprod_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               jtprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0           jtprod_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
      jtprod_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 hess: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                hprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           jhess: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               jhprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     

Then, we proceed as above to call MadNLP:

results = madnlp(nlp_wrapped; linear_solver=CUDSSSolver)
nothing
This is MadNLP version v0.9.1, running with cuDSS v0.7.1

Number of nonzeros in constraint Jacobian............:       84
Number of nonzeros in Lagrangian Hessian.............:     5250

Total number of variables............................:       84
                     variables with only lower bounds:        0
                variables with lower and upper bounds:       84
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:       42
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:       42

iter    objective    inf_pr   inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
   0  0.0000000e+00 1.04e+02 1.00e+00 1.00e+01  -1.0     -   0.00e+00  0  0
   1  1.4808378e+01 1.00e+02 2.64e+01 6.04e+00  -1.0     -   3.37e-02  1  1h
   2  1.3660555e+04 2.50e+01 1.79e+02 1.87e+00  -1.0     -   1.00e+00  1  1h
   3  1.5757708e+04 2.11e+01 1.55e+02 8.19e-01  -1.0     -   1.66e-01  1  1h
   4  2.2700976e+04 1.19e+01 1.31e+02 3.46e-01  -1.0     -   4.94e-01  1  1h
   5  3.6245598e+04 2.96e+00 2.13e+02 2.84e-01  -1.0     -   1.00e+00  1  1h
   6  4.0274013e+04 1.53e+00 1.64e+02 1.19e-01  -1.0     -   5.59e-01  1  1h
   7  4.5453607e+04 3.61e-01 1.98e+02 1.16e-01  -1.0     -   1.00e+00  1  1h
   8  4.7405169e+04 7.22e-02 1.00e+02 1.00e-01  -1.0     -   1.00e+00  1  1h
   9  4.7892154e+04 8.03e-03 3.84e+01 9.05e-02  -1.0     -   1.00e+00  1  1h
iter    objective    inf_pr   inf_du inf_compl lg(mu) lg(rg) alpha_pr ir ls
  10  4.7949829e+04 3.07e-04 6.16e+00 7.82e-02  -1.0     -   1.00e+00  1  1h
  11  4.7951497e+04 1.23e-06 5.90e-02 7.81e-02  -1.0     -   1.00e+00  1  1h
  12  4.7947424e+04 3.39e-06 1.21e-03 2.85e-03  -2.5     -   1.00e+00  1  1f
  13  4.7947303e+04 4.19e-09 1.24e-06 7.80e-06  -5.0     -   1.00e+00  1  1h

Number of Iterations....: 13

                                   (scaled)                 (unscaled)
Objective...............:   4.7947302735661782e+04    4.7947302735661782e+04
Dual infeasibility......:   1.2358625376867822e-06    1.2358625376867822e-06
Constraint violation....:   4.1921440895476697e-09    4.1921440895476697e-09
Complementarity.........:   7.8004282261790513e-06    7.8004282261790513e-06
Overall NLP error.......:   7.8004282261790513e-06    7.8004282261790513e-06

Number of objective function evaluations              = 14
Number of objective gradient evaluations              = 14
Number of constraint evaluations                      = 14
Number of constraint Jacobian evaluations             = 14
Number of Lagrangian Hessian evaluations              = 13
Number of KKT factorizations                          = 13
Number of KKT backsolves                              = 13

Total wall secs in initialization                     =  1.257 s
Total wall secs in linear solver                      =  unavailable
Total wall secs in NLP function evaluations           =  0.004 s
Total wall secs in solver (w/o init./fun./lin. alg.)  =  unavailable
Total wall secs                                       =  8.150 s

EXIT: Optimal Solution Found (tol = 1.0e-04).

How to solve on the GPU a JuMP model?

If your model is implemented with JuMP, you can solve it on the GPU using the same trick as in the previous paragraph just by passing a new option array_type specifying which array type to use to instantiate the SparseWrapperModel.

This is demonstrated in the following example:

using CUDA
using JuMP
using MadNLP, MadNLPGPU

model = Model(MadNLP.Optimizer)
@variable(model, x1 <= 0.5)
@variable(model, x2)

@objective(model, Min, 100.0 * (x2 - x1^2)^2 + (1.0 - x1)^2)
@constraint(model, x1 * x2 >= 1.0)
@constraint(model, x1 + x2^2 >= 0.0)

JuMP.set_optimizer_attribute(model, "array_type", CuArray)
JuMP.set_optimizer_attribute(model, "linear_solver", CUDSSSolver)
JuMP.optimize!(model) # Solve the model on the GPU!