Packages Used: Functions from the following packages are used in this notebook for the first time:
Multivariate optimization corresponds to the optimization of a function of two or more variables and is the most flexible type of nonlinear optimization.
$ \begin{eqnarray*} \quad \mathbf{x}^* &=& \mathrm{arg}\underset{\mathbf{x}}{\operatorname{min}} \bigl\{ f(\mathbf{x}) \bigr\} \\ T\!C^* &=& f(\mathbf{x}^*) \end{eqnarray*}$
Procedures for unconstrained multivariate optimization differ with respect to whether they require information regarding the gradient, where the gradient is the multivariate version of the derivative. The Nelder Mead method (1965) is the most used most efficient direct, or gradient-free, method available, in contrast to gradient-based methods that require gradients to be either provided or estimated, which can be difficult or impossible for some functions. Gradient-based methods are typically only used when working in high-dimensional spaces, which is common when, for example, training a neural network.
Given a 2-D point x and m other 2-D points in P, create a function dist2
to determine the Euclidean (i.e., straight-line) distance d from x to each of the m points in P:
$ \begin{array}{l} \quad\bf{x} = \left[ \begin{array}{cc} x_1 \\ x_2 \end{array} \right],\quad \bf{P} = \left[ \begin{array}{cc} p_{1,1} & p_{1,2} \\ \vdots & \vdots \\ p_{m,1} & p_{m,2} \end{array} \right] \\ \quad\bf{d} = \left[ \begin{array}{cc} d_1 \\ \vdots \\ d_m \end{array} \right] = \left[ \begin{array}{c} \sqrt{\left(x_1 - p_{1,1}\right)^2 + \left(x_2 - p_{1,2}\right)^2} \\ \vdots \\ \sqrt{\left(x_1 - p_{m,1}\right)^2 + \left(x_2 - p_{m,2}\right)^2} \end{array} \right] \end{array} $
The best way to start is to create some example data for which you know the correct answer:
x = [3., 1.] # x is a 2-element array of real values
P = [1 1; 6 1; 6 5]; # P will be converted from integer to real when subtracted from x
In the figure, the sum of the distances from x
to the three points in P
is 10:
The first step is to transpose x
since a 2-element array is interpreted as a column vector, and it needs to be a row vector so that it is compatible with P
and can be broadcast so that it can be subtracted from each row in P
:
x' .- P
3×2 Matrix{Float64}: 2.0 0.0 -3.0 0.0 -3.0 -4.0
Square each element of the result:
(x' .- P).^2
3×2 Matrix{Float64}: 4.0 0.0 9.0 0.0 9.0 16.0
Add the elements in each row:
sum((x' .- P).^2, dims = 2)
3×1 Matrix{Float64}: 4.0 9.0 25.0
Then take the square root and assign the result to d
:
d = sqrt.(sum((x' .- P).^2, dims = 2)) # dims = 2 => sum each row
3×1 Matrix{Float64}: 2.0 3.0 5.0
A function can be created and then called to assign the result to d
:
dist2(x,P) = sqrt.(sum((x' .- P).^2, dims = 2))
d = dist2(x,P)
3×1 Matrix{Float64}: 2.0 3.0 5.0
As it is written, the function will work for points of any dimension, not just 2-D points. For n-dimensional points, x
would be an n-element array and P
an m × n matrix; e.g., for 4-D points:
@show [x; x]
@show [P P]
d = dist2([x; x],[P P])
[x; x] = [3.0, 1.0, 3.0, 1.0] [P P] = [1 1 1 1; 6 1 6 1; 6 5 6 5]
3×1 Matrix{Float64}: 2.8284271247461903 4.242640687119285 7.0710678118654755
Can be used for 1-D points; e.g., Ex 1 from NL Opt 1:
p = [1, 2, 5, 9]
d = dist2(3.,p)
4-element Vector{Float64}: 2.0 1.0 2.0 6.0
To determine the number of elements in an array:
@show length(x)
@show length(P);
length(x) = 2 length(P) = 6
To determine the dimensions of an array:
@show size(P)
@show size(P,1)
@show size(P,2);
size(P) = (3, 2) size(P, 1) = 3 size(P, 2) = 2
The function dist2
can now be used inside of any expression; e.g., to calculate the sum of the distances from x
to the points in P
:
sum(dist2(x,P))
10.0
The function optimize
performs general-purpose multivariate optimization. Instead of specifying a bounded range, as was done for univariate optimization using optimize
, for multivariate optimization, an initial location x0
is specified, from which the function determines the location x
that minimizes the sum of distances to each point in P
:
using Optim
dist2(x,P) = sqrt.(sum((x' .- P).^2, dims = 2))
P = [1 1; 6 1; 6 5];
x0 = [0., 0.]
optimize(x -> sum(dist2(x,P)), x0)
* Status: success * Candidate solution Final objective value: 8.697184e+00 * Found with Algorithm: Nelder-Mead * Convergence measures √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 45 f(x) calls: 89
To assign the optimal location and minimum total distance values to variables:
res = optimize(x -> sum(dist2(x,P)), x0)
xᵒ = res.minimizer
TDᵒ = res.minimum
@show xᵒ TDᵒ;
xᵒ = [5.089389040930581, 1.9663250607367784] TDᵒ = 8.697184389632483
To return only the optimal location:
xᵒ = optimize(x -> sum(dist2(x,P)), x0).minimizer
2-element Vector{Float64}: 5.089389040930581 1.9663250607367784
The Nelder-Mead method (1965) is the most widely used procedure for unconstrained multivariate optimization (more details on Julia's Optim
package implementation of Nelder-Mead). It uses a d + 1 simplex to move towards an optimal solution. In 2-D, the simplex is a triangle and, as different simplexes are created through reflection, expansion, etc. (see figure, below), it looks like an amoeba moving; as a result, Nelder-Mead is sometimes referred to as the "amoeba method" (animation of Nelder-Mead).
using Plots
scatter(P[:,1],P[:,2], legend=false)
In order to be able to see the locations x
at which dist2
is evaluated during a run of the Nelder-Mead method, dist2
can be wrapped inside of another function, call it f2plot
, that plots each location visited:
function f2plot(f,x)
scatter!([x[1]],[x[2]], markersize=1, markerstrokecolor=:green)
return f(x)
end
x0 = [0., 0.]
xᵒ = optimize(x -> sum(f2plot(x -> dist2(x,P),x)), x0).minimizer
scatter!([xᵒ[1]],[xᵒ[2]], markersize=4, markercolor=:red)
Gradient-based algorithms either require the gradient to be provided as a function to optimize
or, otherwise, the gradient is approximated using finite differences. Some algorithms (e.g., Newton's) also require the Hessian to be provided as a function.
$\quad f(x,y) = 100\left( y - x^2 \right)^2 + \left( 1 - x \right)^2$
is often used to illustrate numerical optimization methods because, although it is not difficult to determine the values for x and y that minimize the function using analytical procedures, many numerical procedures converge slowly to this solution.
The gradient of the function is
$\quad \nabla\!f(x,y) = \left[ \begin{array}{c} \dfrac{\partial f}{\partial x} \\ \dfrac{\partial f}{\partial y} \end{array} \right] = \left[ \begin{array}{c} 2x - 400x(y - x^2) - 2 \\ 200y - 200x^2 \end{array} \right] $
and its Hessian is
$\quad H\!f(x,y) = \left[ \begin{array}{c} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial xy} \\ \dfrac{\partial^2 f}{\partial yx} & \dfrac{\partial^2 f}{\partial y^2} \end{array} \right] = \left[ \begin{array}{c} 1200x^2 - 400y + 2 & -400x \\ -400x & 200 \end{array} \right] $.
Solving $\partial f/\partial x = 0$ and $\partial f/\partial x = 0$ for $x$ and $y$ results in a single optima of $[1, 1]$ with a value of $0$, which is a minimum since
$\quad \dfrac{\partial^2 f(1,1)}{\partial x^2} = 802 > 0$
and the determinate of the Hessian is positive,
$\quad \bigl| H\!f(1,1) \bigr| = \dfrac{\partial^2 f}{\partial x^2}(1,1) \dfrac{\partial^2 f}{\partial y^2}(1,1) - \left(\dfrac{\partial^2 f}{\partial xy}(1,1) \right)^2 = 400 > 0$.
using Optim, Plots
f(x) = 100(x[2] - x[1]^2)^2 + (1 - x[1])^2 # Using single input argument
xrng = -3:.1:3 # x-axis range
yrng = -2:.1:5 # y-axis range
contour(xrng, yrng, (x1,x2) -> f([x1,x2])) # Using multiple argument anonymous function
x0 = [-2., 2.]
optimize(f, x0) # Gradient not provided: Nelder-Mead default algorithm
* Status: success * Candidate solution Final objective value: 7.355781e-10 * Found with Algorithm: Nelder-Mead * Convergence measures √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 132 f(x) calls: 246
function g!(G, x)
G[1] = 2. * x[1] - 400. * x[1] * (x[2] - x[1]^2) - 2.
G[2] = 200. * x[2] - 200. * x[1]^2
end
optimize(f, g!, x0) # Gradient provided: L-BFGS default algorithm
* Status: success * Candidate solution Final objective value: 2.155724e-18 * Found with Algorithm: L-BFGS * Convergence measures |x - x'| = 2.84e-07 ≰ 0.0e+00 |x - x'|/|x'| = 2.84e-07 ≰ 0.0e+00 |f(x) - f(x')| = 7.97e-13 ≰ 0.0e+00 |f(x) - f(x')|/|f(x')| = 3.70e+05 ≰ 0.0e+00 |g(x)| = 9.34e-09 ≤ 1.0e-08 * Work counters Seconds run: 1 (vs limit Inf) Iterations: 33 f(x) calls: 103 ∇f(x) calls: 103
optimize(f, x0, LBFGS()) # Gradient approximated using finite differences
* Status: success * Candidate solution Final objective value: 5.378404e-17 * Found with Algorithm: L-BFGS * Convergence measures |x - x'| = 2.05e-09 ≰ 0.0e+00 |x - x'|/|x'| = 2.05e-09 ≰ 0.0e+00 |f(x) - f(x')| = 1.75e-17 ≰ 0.0e+00 |f(x) - f(x')|/|f(x')| = 3.25e-01 ≰ 0.0e+00 |g(x)| = 4.05e-12 ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 36 f(x) calls: 108 ∇f(x) calls: 108
optimize(f, g!, x0, ConjugateGradient()) # Using Conjugate Gradient algorithm
* Status: success * Candidate solution Final objective value: 1.838688e-18 * Found with Algorithm: Conjugate Gradient * Convergence measures |x - x'| = 4.31e-10 ≰ 0.0e+00 |x - x'|/|x'| = 4.31e-10 ≰ 0.0e+00 |f(x) - f(x')| = 1.53e-17 ≰ 0.0e+00 |f(x) - f(x')|/|f(x')| = 8.32e+00 ≰ 0.0e+00 |g(x)| = 2.99e-09 ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 34 f(x) calls: 83 ∇f(x) calls: 58
function h!(H, x)
H[1,1] = 1200. * x[1]^2 - 400. * x[2] + 2
H[1,2] = -400. * x[1]
H[2,1] = -400. * x[1]
H[2,2] = 200.
end
optimize(f, g!, h!, x0, NewtonTrustRegion()) # Gradient and Hessian provided
* Status: success * Candidate solution Final objective value: 6.271614e-20 * Found with Algorithm: Newton's Method (Trust Region) * Convergence measures |x - x'| = 5.98e-06 ≰ 0.0e+00 |x - x'|/|x'| = 5.98e-06 ≰ 0.0e+00 |f(x) - f(x')| = 2.61e-11 ≰ 0.0e+00 |f(x) - f(x')|/|f(x')| = 4.16e+08 ≰ 0.0e+00 |g(x)| = 2.61e-09 ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 30 f(x) calls: 31 ∇f(x) calls: 31 ∇²f(x) calls: 27
Automatic differentiation (AD) evaluates the derivative using the function code. Julia has an active AD group. AD can be used to calculate the derivative for the optimization as long as the function is written only using Julia, without calls to non-Julia libraries.
optimize(f, x0, NewtonTrustRegion(), autodiff = :forward) # No gradient/Hessian provided
* Status: success * Candidate solution Final objective value: 6.271440e-20 * Found with Algorithm: Newton's Method (Trust Region) * Convergence measures |x - x'| = 5.98e-06 ≰ 0.0e+00 |x - x'|/|x'| = 5.98e-06 ≰ 0.0e+00 |f(x) - f(x')| = 2.61e-11 ≰ 0.0e+00 |f(x) - f(x')|/|f(x')| = 4.16e+08 ≰ 0.0e+00 |g(x)| = 2.61e-09 ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 30 f(x) calls: 31 ∇f(x) calls: 31 ∇²f(x) calls: 27
contour(xrng, yrng, (x1,x2) -> f([x1,x2]), legend=false, xlims=(-3,3), ylims=(-2,5))
res = optimize(x -> f2plot(f,x), x0)
str = "\\nNelder-Mead: " * string(res.f_calls) * " function calls\\n"
scatter!([res.minimizer[1]],[res.minimizer[2]],markersize=4,markercolor=:red,title=str)
contour(xrng, yrng, (x1,x2) -> f([x1,x2]), legend=false, xlims=(-3,3), ylims=(-2,5))
res = optimize(x -> f2plot(f,x), g!, x0, ConjugateGradient())
str = "\\nConjugate Gradient w grad: " * string(res.f_calls) * " function calls\\n"
scatter!([res.minimizer[1]],[res.minimizer[2]],markersize=4,markercolor=:red,title=str)
contour(xrng, yrng, (x1,x2) -> f([x1,x2]), legend=false, xlims=(-3,3), ylims=(-2,5))
res = optimize(x -> f2plot(f,x), g!, h!, x0, NewtonTrustRegion())
str = "\\nNewton w grad + Hessian: " * string(res.f_calls) * " function calls\\n"
scatter!([res.minimizer[1]],[res.minimizer[2]],markersize=4,markercolor=:red,title=str)
Determining the maximum deflection $x_1$ and rotation $x_2$ at the tip of a cantilever beam can be formulated as a minimization of the potential energy of the beam, which reduces to the unconstrained minimization of
$ \quad f(x_1,x_2) = 12x_1^2 + 4x_2^2 - 12x_1x_2 + 2x_1 $,
which has a known solution of $(-1/3, -1/2)$. See N.H. Kim, p. 129, for more details.
using Optim, Plots
f(x) = 12x[1]^2 + 4x[2]^2 - 12x[1]*x[2] + 2x[1]
xrng = -3:.1:3 # x-axis range
yrng = -3:.1:3 # y-axis range
contour(xrng, yrng, (x1,x2) -> f([x1,x2])) # Using multiple argument anonymous function
x0 = [0., 0.]
res = optimize(f, x0)
* Status: success * Candidate solution Final objective value: -3.333333e-01 * Found with Algorithm: Nelder-Mead * Convergence measures √(Σ(yᵢ-ȳ)²)/n ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 33 f(x) calls: 68
res.minimizer
2-element Vector{Float64}: -0.33336029354864666 -0.5000221244148185
x0 = [0., 0.]
optimize(f, x0, ConjugateGradient())
* Status: success * Candidate solution Final objective value: -3.333333e-01 * Found with Algorithm: Conjugate Gradient * Convergence measures |x - x'| = 5.00e-01 ≰ 0.0e+00 |x - x'|/|x'| = 1.00e+00 ≰ 0.0e+00 |f(x) - f(x')| = 2.50e-01 ≰ 0.0e+00 |f(x) - f(x')|/|f(x')| = 7.50e-01 ≰ 0.0e+00 |g(x)| = 1.83e-10 ≤ 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 2 f(x) calls: 5 ∇f(x) calls: 4
Regression is a method of modeling the relationships between a dependent variable and one or more independent variables. The model is expressed as a function that "best fits" the independent variables to the dependent variable. Determining the best fit involves solving a multivariate optimization problem, where the Nelder-Mead procedure can be used to determine the coefficients. If the coefficients of the model are linear in the function, there the method is termed linear regression. Different objective functions (termed loss functions) can be used to fit the model.
A new medication is being developed. Different doses of the medication have given to thirteen subjects and the resulting responses were recorded. Determine the dose-response relationship.
Solution: Always a good idea to first plot the data to see what the relationship might be between the independent dose variable, x
, and the dependent response variable, y
:
x = [121, 143, 63, 80, 198, 52, 160, 162, 85, 106, 129, 141, 106] # Dose
y = [237, 214, 161, 146, 331, 159, 423, 332, 139, 327, 152, 98, 116] # Response
using Plots
scatter(x, y, legend=false, xlabel="Dose", ylabel="Response", label="(x, y)")
Based on the plot, there seems to be some correlation between dose and response:
using Statistics
cor(x,y)
0.5994685344117657
With some correlation indicated, we will next use a linear model to represent the relationship. The model can then be used to predict what the response would be, $\hat{y}$, for any dose $x$:
$\quad \hat{y} = \alpha_1 + \alpha_2 x$.
The model will be fit by determining the values for the intercept $\alpha_1$ and slope $\alpha_2$ that minimize the sum of the squared distance between the predicted $\hat{y}$ and actual $y$ responses given the dose $x$ for the thirteen dose-response pairs of data:
$\quad \textrm{L2 loss: } \sum (\hat{y} - y)^2$
fŷ(α,x) = α[1] .+ α[2]x # Linear model
fL2(α) = sum((fŷ(α,x) .- y).^2) # L2 loss function
using Optim
αᵒ = optimize(fL2, [0., 1.]).minimizer # Starting α values = [0., 1.]
2-element Vector{Float64}: 44.86442369548186 1.4565087918592914
plot!(x -> fŷ(αᵒ,x), 0, maximum(x), label="L2")
Why is the L2 loss function being used? Might it be better to try to minimize the sum of the absolute deviations instead of the square of the deviations?
$\quad \textrm{L1 loss: } \sum \bigl| \hat{y} - y \bigr|$
fL1(α) = sum(abs.(fŷ(α,x) .- y)) # L1 loss function
αᵒ = optimize(fL1, [0., 1.]).minimizer
plot!(x -> fŷ(αᵒ,x), 0, maximum(x), label="L1",legend=true)
L2 loss is the only loss function that by minimizing the loss (minimum square error, MSE) allows the mean of the data to be determined; most importantly, any change in data is reflected in a change to the mean, and so the mean allows full information recovery. Minimizing L1 loss (minimum absolute deviation, MAD) corresponds to the median of data and does not allow full information recovery, which in some cases is a good feature because it makes any estimate more robust with respect to outliers.
x1, x2 = [1, 2, 6], [1, 2, 12]
@show sum(x1) sum(x2)
@show mean(x1),3*mean(x1) mean(x2),3*mean(x2)
@show median(x1),3*median(x1) median(x2),3*median(x2);
sum(x1) = 9 sum(x2) = 15 (mean(x1), 3 * mean(x1)) = (3.0, 9.0) (mean(x2), 3 * mean(x2)) = (5.0, 15.0) (median(x1), 3 * median(x1)) = (2.0, 6.0) (median(x2), 3 * median(x2)) = (2.0, 6.0)
The DataFrame
package can be used to work with structured tabular (i.e., row-column) data. To create a single-row table:
using DataFrames
df = DataFrame(Name = "Mike", Age = 44)
1 rows × 2 columns
Name | Age | |
---|---|---|
String | Int64 | |
1 | Mike | 44 |
To add a row to the table:
push!(df,("Bill",40))
2 rows × 2 columns
Name | Age | |
---|---|---|
String | Int64 | |
1 | Mike | 44 |
2 | Bill | 40 |
To create a two-row table:
df = DataFrame(Name = ["Mike","Bill"], Age = [44, 40])
2 rows × 2 columns
Name | Age | |
---|---|---|
String | Int64 | |
1 | Mike | 44 |
2 | Bill | 40 |
To access portions of a table:
name = df.Name
2-element Vector{String}: "Mike" "Bill"
df.Age
2-element Vector{Int64}: 44 40
df[2,:]
DataFrameRow (2 columns)
Name | Age | |
---|---|---|
String | Int64 | |
2 | Bill | 40 |
df[2,"Name"]
"Bill"
Typically, large tabular datasets are available in CSV (comma separated value) format. As an example, the Dose-Response example data is available in the CSV file _NLOpt-2-Data.csv:
Dose,Response
121,237
143,214
63,161
80,146
198,331
52,159
160,423
162,332
85,139
106,327
129,152
141,98
106,116
using DataFrames, CSV
df = DataFrame(CSV.File("NL_Opt-2-Data.csv"))
13 rows × 2 columns
Dose | Response | |
---|---|---|
Int64 | Int64 | |
1 | 121 | 237 |
2 | 143 | 214 |
3 | 63 | 161 |
4 | 80 | 146 |
5 | 198 | 331 |
6 | 52 | 159 |
7 | 160 | 423 |
8 | 162 | 332 |
9 | 85 | 139 |
10 | 106 | 327 |
11 | 129 | 152 |
12 | 141 | 98 |
13 | 106 | 116 |
df.Dose
13-element Vector{Int64}: 121 143 63 80 198 52 160 162 85 106 129 141 106
@show x, y = df.Dose, df.Response; # Assign to arrays
(x, y) = (df.Dose, df.Response) = ([121, 143, 63, 80, 198, 52, 160, 162, 85, 106, 129, 141, 106], [237, 214, 161, 146, 331, 159, 423, 332, 139, 327, 152, 98, 116])
scatter(x, y, legend=false, xlabel="Dose", ylabel="Response")