NL Opt 2: Multivariate Optimization

Packages Used: Functions from the following packages are used in this notebook for the first time:

Multivariate optimization corresponds to the optimization of a function of two or more variables and is the most flexible type of nonlinear optimization.

$ \begin{eqnarray*} \quad \mathbf{x}^* &=& \mathrm{arg}\underset{\mathbf{x}}{\operatorname{min}} \bigl\{ f(\mathbf{x}) \bigr\} \\ T\!C^* &=& f(\mathbf{x}^*) \end{eqnarray*}$

Procedures for unconstrained multivariate optimization differ with respect to whether they require information regarding the gradient, where the gradient is the multivariate version of the derivative. The Nelder Mead method (1965) is the most used most efficient direct, or gradient-free, method available, in contrast to gradient-based methods that require gradients to be either provided or estimated, which can be difficult or impossible for some functions. Gradient-based methods are typically only used when working in high-dimensional spaces, which is common when, for example, training a neural network.

1. Gradient-free Optimization

Ex 1: Euclidean 2-D distances

Given a 2-D point x and m other 2-D points in P, create a function dist2 to determine the Euclidean (i.e., straight-line) distance d from x to each of the m points in P:

$ \begin{array}{l} \quad\bf{x} = \left[ \begin{array}{cc} x_1 \\ x_2 \end{array} \right],\quad \bf{P} = \left[ \begin{array}{cc} p_{1,1} & p_{1,2} \\ \vdots & \vdots \\ p_{m,1} & p_{m,2} \end{array} \right] \\ \quad\bf{d} = \left[ \begin{array}{cc} d_1 \\ \vdots \\ d_m \end{array} \right] = \left[ \begin{array}{c} \sqrt{\left(x_1 - p_{1,1}\right)^2 + \left(x_2 - p_{1,2}\right)^2} \\ \vdots \\ \sqrt{\left(x_1 - p_{m,1}\right)^2 + \left(x_2 - p_{m,2}\right)^2} \end{array} \right] \end{array} $

The best way to start is to create some example data for which you know the correct answer:

In the figure, the sum of the distances from x to the three points in P is 10:

image-6.png

The first step is to transpose x since a 2-element array is interpreted as a column vector, and it needs to be a row vector so that it is compatible with P and can be broadcast so that it can be subtracted from each row in P:

Square each element of the result:

Add the elements in each row:

Then take the square root and assign the result to d:

A function can be created and then called to assign the result to d:

As it is written, the function will work for points of any dimension, not just 2-D points. For n-dimensional points, x would be an n-element array and P an m × n matrix; e.g., for 4-D points:

Can be used for 1-D points; e.g., Ex 1 from NL Opt 1:

To determine the number of elements in an array:

To determine the dimensions of an array:

The function dist2 can now be used inside of any expression; e.g., to calculate the sum of the distances from x to the points in P:

Ex 2: Minimum-distance location

The function optimize performs general-purpose multivariate optimization. Instead of specifying a bounded range, as was done for univariate optimization using optimize, for multivariate optimization, an initial location x0 is specified, from which the function determines the location x that minimizes the sum of distances to each point in P:

To assign the optimal location and minimum total distance values to variables:

To return only the optimal location:

Nelder-Mead Method

The Nelder-Mead method (1965) is the most widely used procedure for unconstrained multivariate optimization (more details on Julia's Optim package implementation of Nelder-Mead). It uses a d + 1 simplex to move towards an optimal solution. In 2-D, the simplex is a triangle and, as different simplexes are created through reflection, expansion, etc. (see figure, below), it looks like an amoeba moving; as a result, Nelder-Mead is sometimes referred to as the "amoeba method" (animation of Nelder-Mead).

image.png

In order to be able to see the locations x at which dist2 is evaluated during a run of the Nelder-Mead method, dist2 can be wrapped inside of another function, call it f2plot, that plots each location visited:

2. Gradient-based Optimization

Gradient-based algorithms either require the gradient to be provided as a function to optimize or, otherwise, the gradient is approximated using finite differences. Some algorithms (e.g., Newton's) also require the Hessian to be provided as a function.

Ex 3: Rosenbrock's Banana Function

Rosenbrock's banana function

$\quad f(x,y) = 100\left( y - x^2 \right)^2 + \left( 1 - x \right)^2$

is often used to illustrate numerical optimization methods because, although it is not difficult to determine the values for x and y that minimize the function using analytical procedures, many numerical procedures converge slowly to this solution.

The gradient of the function is

$\quad \nabla\!f(x,y) = \left[ \begin{array}{c} \dfrac{\partial f}{\partial x} \\ \dfrac{\partial f}{\partial y} \end{array} \right] = \left[ \begin{array}{c} 2x - 400x(y - x^2) - 2 \\ 200y - 200x^2 \end{array} \right] $

and its Hessian is

$\quad H\!f(x,y) = \left[ \begin{array}{c} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial xy} \\ \dfrac{\partial^2 f}{\partial yx} & \dfrac{\partial^2 f}{\partial y^2} \end{array} \right] = \left[ \begin{array}{c} 1200x^2 - 400y + 2 & -400x \\ -400x & 200 \end{array} \right] $.

Solving $\partial f/\partial x = 0$ and $\partial f/\partial x = 0$ for $x$ and $y$ results in a single optima of $[1, 1]$ with a value of $0$, which is a minimum since

$\quad \dfrac{\partial^2 f(1,1)}{\partial x^2} = 802 > 0$

and the determinate of the Hessian is positive,

$\quad \bigl| H\!f(1,1) \bigr| = \dfrac{\partial^2 f}{\partial x^2}(1,1) \dfrac{\partial^2 f}{\partial y^2}(1,1) - \left(\dfrac{\partial^2 f}{\partial xy}(1,1) \right)^2 = 400 > 0$.

Automatic differentiation (AD) evaluates the derivative using the function code. Julia has an active AD group. AD can be used to calculate the derivative for the optimization as long as the function is written only using Julia, without calls to non-Julia libraries.

Ex 4: Cantilever Beam Design

Determining the maximum deflection $x_1$ and rotation $x_2$ at the tip of a cantilever beam can be formulated as a minimization of the potential energy of the beam, which reduces to the unconstrained minimization of

$ \quad f(x_1,x_2) = 12x_1^2 + 4x_2^2 - 12x_1x_2 + 2x_1 $,

which has a known solution of $(-1/3, -1/2)$. See N.H. Kim, p. 129, for more details.

3. Regression

Regression is a method of modeling the relationships between a dependent variable and one or more independent variables. The model is expressed as a function that "best fits" the independent variables to the dependent variable. Determining the best fit involves solving a multivariate optimization problem, where the Nelder-Mead procedure can be used to determine the coefficients. If the coefficients of the model are linear in the function, there the method is termed linear regression. Different objective functions (termed loss functions) can be used to fit the model.

Ex 5: Dose-Response Relationship of a Medication

A new medication is being developed. Different doses of the medication have given to thirteen subjects and the resulting responses were recorded. Determine the dose-response relationship.

Solution: Always a good idea to first plot the data to see what the relationship might be between the independent dose variable, x, and the dependent response variable, y:

Based on the plot, there seems to be some correlation between dose and response:

With some correlation indicated, we will next use a linear model to represent the relationship. The model can then be used to predict what the response would be, $\hat{y}$, for any dose $x$:

$\quad \hat{y} = \alpha_1 + \alpha_2 x$.

The model will be fit by determining the values for the intercept $\alpha_1$ and slope $\alpha_2$ that minimize the sum of the squared distance between the predicted $\hat{y}$ and actual $y$ responses given the dose $x$ for the thirteen dose-response pairs of data:

$\quad \textrm{L2 loss: } \sum (\hat{y} - y)^2$

Why is the L2 loss function being used? Might it be better to try to minimize the sum of the absolute deviations instead of the square of the deviations?

$\quad \textrm{L1 loss: } \sum \bigl| \hat{y} - y \bigr|$

MSE vs. MAD

L2 loss is the only loss function that by minimizing the loss (minimum square error, MSE) allows the mean of the data to be determined; most importantly, any change in data is reflected in a change to the mean, and so the mean allows full information recovery. Minimizing L1 loss (minimum absolute deviation, MAD) corresponds to the median of data and does not allow full information recovery, which in some cases is a good feature because it makes any estimate more robust with respect to outliers.

image.png

4. DataFrames

The DataFrame package can be used to work with structured tabular (i.e., row-column) data. To create a single-row table:

To add a row to the table:

To create a two-row table:

To access portions of a table:

Reading Data from a CSV File

Typically, large tabular datasets are available in CSV (comma separated value) format. As an example, the Dose-Response example data is available in the CSV file _NLOpt-2-Data.csv:

Dose,Response
121,237
143,214
63,161
80,146
198,331
52,159
160,423
162,332
85,139
106,327
129,152
141,98
106,116