Multivariate optimization

From Teachwiki
Jump to: navigation, search


In this thesis we want to continue with optimization problematic, which appears in many real situation. So in the beginning we recall optimization problem. We have to arrange our notation.


Use the following throughout the course:

  • g(x) := \triangle_x f(x) gradient of f\!
  • H(x) := \triangle_{xx} f(x) Hessian matrix of f\!
  • f_k := f(x_k ), \, g_k := g(x_k ), \, H_k := H(x_k)
  • x*\! is a local minimizer
  •  m_{k}^{L}(s)= f_{k} + s^{\top}g_{k} is linear model we will mention it later
  •  m_{k}^{Q}(s)= f_{k} + s^{\top}g_{k} + \frac1{2} s^{\top} B_k s, where B_k\! is some appropriate matrix

Recall optimization problem[edit]

f(x),\,x\in D_f, where D_f \! is domain of definition,

subject to  c_i (x) \leq b_i,\, i= 1 \dots m .

  • x=(x_1,\dots ,x_n): denote optimization variable.
  • f: {\mathbb R }^n \rightarrow {\mathbb R }: objective function.
  •  c_i: {\mathbb R }^n \rightarrow {\mathbb R},\, i= 1 \dots m : constraint functions.

Optimal solution x*\! has the smallest value of f\! among all vectors that satisfy the constraints.

Minimization X maximization[edit]

We concentrate only on the problem how to minimize our function. Why? We can suppose that we would like to solve this problem

 \max f(x), \, x \in D_{f}

subject to   c_{i}(x) \leq b_{i}, \, i=1\cdots m. If we put  g(x):= - f(x) \! and then we solve the minimization problem for function g(x)\! we have the result for maximization problem (maximum respectively minimum is in the same point).

Which method should we use[edit]

Optimization tree

We have to know which kind of problem we want to solve. We divide optimization problem into many classes according our problem looks like e.g.

  • Which kind of function we have (discreet or continuous)?
  • Can we use derivations (is our function smooth enough)?
  • Do we want to find global or local minimum?
  • Have we any constraints?

Amount of possible methods is very high. We can concentrate only on some of them, because to show all methods in one work is impossible. But in practice we should find the most optimal one. This tree can help us to find the best method.

Cholesky decomposition[edit]

In other text we will need Cholesky decomposition in our algorithms. We note only that there is good theory and in which cases we can use this method. If A\! is symmetrical and positive definite, then A\! can be decomposed as

 A = L L^\top\!

where L\! is a lower triangular matrix with strictly positive diagonal entries. This is the Cholesky decomposition. The Cholesky decomposition is unique. Cholesky factorizations for positive semidefinite matrices are not unique in general. There is an algorithm, which can compute Cholesky decomposition!

Methods and algorithms[edit]

In this part we will describe some methods. But we will talk only about some chosen examples. This will not be self-contained theory.

Generic linesearch method[edit]

This will be only basic idea the further description we will see below. We would like to reduce  f ( x_k + \alpha p_k) \! Given an initial guess x_0\!, let k :=0\! until convergence:

  1. Find a descent direction p_k\! at x_k\!.
  2. Compute a stepsize \alpha_k\! using a backtracking-Armijo (later) linesearch along p_k\!.
  3. Set x_{k+1}:= x_k + \alpha_k p_k,\! and increase a k\! by 1.

We will occupy with method how to find \alpha_k\! very briefly, but we focus on looking for descent direction p_k\!.

Backtracking linesearch[edit]

Here is the Backtracking-Armijo linesearch method to find \alpha_k\!, where \beta\ \in (0,1) \!:

1. Given  \alpha_{ \mbox{init} } > 0  \! (e.g., \alpha_{ \mbox{init} } = 1\!), let \alpha^{(0)} := \alpha_{\mbox{init}}\! and l = 0.\!
2. Until f(x_k + \alpha^{(l)} p_k) < f(x_k) + \alpha^{(l)} \beta \langle p_k,g_k \rangle\!
set \alpha^{(l+1)} = \tau \alpha^{(l)}\! ,
where \tau \in (0,1)\! (e.g., \tau = \frac1{2}\!)
and increase l\! by 1.
3. Set \alpha_k = \alpha^{(l)}\!.

We will not mention why we use this method, but we can mention that this method is very good in empirical tests.

Method of steepest descent[edit]

The search direction:


called steepest-descent direction (It follows from the theory about gradient). p_k\! is a descent direction and solves the problem

\min_{p \in \mathbb{R}^{n}} m_k^{L}(x_k + p):= f_k + g_{k}^{\top}p \mbox{  subject to  }\| p \|_{2} = \| g_k \|_{2}


Any method that uses the steepest-descent direction is a method of steepest descent.

Some properties of steepest descent:

  • From the theory follows that this method converges.
  • Many other methods resort to steepest descent in bad cases.
  • Unfortunately this method is not scale invariant, as re-scaling variables can lead to widely different directions.
  • Convergence is usually very (very!) slow (linear).
  • Numerically often not convergent at all.

On the picture on the right side we can see contours for the objective function  f(x,y) = 10(y-x^2 )^2 + (x-1)^2 \! and the iterations generated by the Generic Linesearch steepest-descent method.

More general descent methods[edit]

Because we saw that method of steepest descent have many imperfections, we will find other possibilities how to solve our problem. We will investigate a generalization of descent methods.


Let B_k\! be a symmetric, positive definite matrix, and define the search direction p_k\! so that

B_k p_k=-g_k.\!

Then p_k\! is a descent direction and solves the problem

\min_{p \in \mathbb{R}^{n}} m_k^{Q}(x_k + p):= f_k + g_{k}^{\top}p + \frac1{2} p^{\top}B_k. p\!

If the Hessian H_k\! is positive definite, and B_{k} = H_{k}\! then this method calls Newton's method. General descent methods may be viewed as scaled steepest descent. Convergence is often faster than steepest descent. These methods can be made scale invariant.

On the right picture can we notice contours for the objective function  f(x,y) = 10(y-x^2 )^2 + (x-1)^2 \! and the iterations generated by the Generic Linesearch Newton method. We see that algorithm converges really faster than in the previous case.

Modified Newton methods[edit]

If H_k\! is indefinite, it is usual to solve instead

(H_k +M_k)p_k \equiv B_k p_k = - g_k\!


  • M_k\! is chosen so that B_k = H_k +M_k\! is sufficiently positive definite.
  • M_k = 0\! when H_k\! is itself sufficiently positive definite.

Now we show some ways to find M_k\!, or replace H_k\!.

  • If H_k\! has the spectral decomposition H_k = Q_k D_k Q_{k}^\top\! then

 B_k \equiv H_k + M_k = Q_k \max( \epsilon I; |D_k | )Q_{k}^\top \!

This will shift all the insufficiently small eigenvalues by as little as needed to make the matrix positive definite.

  •  M_k = \max(0; \epsilon - \lambda_{\min} H_k )) I\! This is cheaper to compute.
  • Modified Cholesky decomposition (We attempt a Cholesky factorization of H_k,\! and alter factors if there is an evidence that the factorization will otherwise fail) : B_k \equiv H_k +M_k = L_k L_{k}^\top\!

Quasi-Newton methods[edit]

Normally we do not know matrix H_k\!. Now we show various attempts to approximate H_k\!:

  • Finite-difference approximations:

(H_k ) e_i \approx h^{-1} (g(x_k + h e_i ) - g_k ) = (B_k ) e_i

for some small scalar h > 0.\!

  • Secant approximations: try to ensure the secant condition B_{k+1} s_k = y_k \approx H_{k+1} s_k,\! where s_k = x_{k+1} - x_{k}\! and y_k = g_{k+1} - g_k\!. There are two methods, which try to solve the problem under secant condition.
1. Symmetric Rank-1 method (but may be indefinite or even fail):

 B_{k+1} = B_k + \frac{(y_k - B_k s_k) (y_k - B_k s_k )^\top}{ (y_k - B_k s_k )^\top s_k}.

2. BFGS method: (symmetric and positive definite if y^{\top}_k s_k > 0):

 B_{k+1} = B_k + \frac{y_k y^{\top}_k}{y^{\top}_k s_k} - \frac{B_k s_k s^{\top}_k B_k}{s^{\top}_k B_k s_k}.

Trust region methods[edit]

Now we focus on trust region methods. For better understanding we compare it with linear search method. We describe the basic ideas of both algorithms. We start with linear search method:

  • Pick descent direction p_k.\!
  • Pick stepsize \alpha_k\! to reduce f(x_k+\alpha p_k).\!
  • x_{k+1}:= x_k + \alpha_k p_k.\!

And now we are describing Trust-region methods

  • Pick step s_k\! to reduce model of f(x_k+s).\! We are looking only for one parameter.
  • Accept x_{k+1}:= x_k +s_k\! if decrease in model inherited by f(x_k+s_k).\!
  • Otherwise set x_{k+1}:= x_k \!, refine model. We are not able to find better approach so we have to change our model. Thus, while a linesearch method recovers from a poor step by retreating along a parametric (usually linear) curve, a trust-region method recovers by reconsidering the whole step-finding procedure.

Trust region model[edit]

To work with f(x)\! should be very complicated, for this we will apply approximation. We can use different models for f(x_k +s )\!, but in this work we will use these two, which are built with help of Taylor's polynomial:

  • linear model

 m_{k}^{L}(s)= f_{k} + s^{\top}g_{k}

  • quadratic model where B_k\! have to be symmetric

 m_{k}^{Q}(s)= f_{k} + s^{\top}g_{k} + \frac1{2} s^\top B_k s.

But if we apply these models it appears some difficulties. They may not resemble f(x_k+s)\! if s\! is large models and may be unbounded from below. Linear model is unbounded from below always unless g_k=0\! and quadratic model always if B_k\! is indefinite, possibly if B_k\! is only positive semi-definite.

The trust region[edit]

We prevent model m_k(s)\! from unboundedness by imposing a trust-region constraint \| s \|
\leq \triangle_k\! for some suitable scalar radius


\Rightarrow\! trust-region subproblem

\mbox{approx }\, \min_{s \in \mathbb{R}^n} m_k (s)
\mbox{ subject to } \| s \| \leq

in theory does not depend on norm \| \cdot \| \! in practice it might! Before we have not any constraints on our minimization problem but now we have. For simplicity, concentrate on the second-order (Newton-like) model

 m_{k}=m_{k}^{Q}(s)= f_{k} +
s^{\top}g_{k} + \frac1{2} s^\top B_k s\!

and the Euclid's region norm \| \cdot \| = \| \cdot
\|_2 \! Note:

  • B_k = H_k\! is allowed if H_k\! is positive definite.
  • Analysis for other trust-region norms simply adds extra constants in following results.

Basic trust-region method[edit]

And now we are going to describe a basic algorithm connecting with trust region method.

Given k = 0, \triangle_0 > 0\! and x_0\!, until convergence do:

Build the second-order model m(s)\! of  f(x_k + s). \! Solve the trust-region subproblem to find s_k\! for which m(s_k) < f_k \! and \|s_k \| \leq \triangle_k \! , and define

 \rho_k = \frac{f_k - f( x_k + s_k)}{f_k - m_k ( s_k)}.

If  \rho_k \geq \eta_v \! then our step is very successful ( \eta_v \in (0,1) \! ).
Set  x_{k+1}:= x_k +s_k \! and  \triangle_{k+1}:= \gamma_i \triangle_k, \! where  \gamma_i \geq 1. \!
Otherwise if \rho_k \geq s_k then our step is successful, where  0 < \eta_s \leq \eta_v < 1. \!
Set  x_{k+1}:= x_k +s_k \! and  \triangle_{k+1}:=  \triangle_k. \!
Otherwise is our model unsuccessful.
Set  x_{k+1}:= x_k  \! and  \triangle_{k+1}:= \gamma_d \triangle_k, \! where  0 < \gamma_d < 1. \!

Trust region subproblem[edit]

From the algorithm above we saw that the main important question will be how to solve trust region subproblem or how to determine s_k\! for which m(s_k) < f_k \! and \|s_k \| \leq \triangle_k \! . At this point we make only some comments about it and we will pursue it later. We want to solve new aim. It means to achieve as much reduction in the model as would an iteration of steepest descent. We introduce Cauchy point: s_{k}^c = -\alpha_{k}^c g_k\!

 \alpha_{k}^{c}=\mbox{argmin}_{\alpha>0} m_k(-\alpha g_k)
\mbox{ subject to    }\, \, \alpha \| g_k \| < \triangle_k

 = \mbox{argmin}_{0<\alpha <
\frac{\triangle_k}{\| g_k \|}} m_k(-\alpha g_k) \!

Minimize quadratic on line segment \Rightarrow
\! very easy! We shall require that

 m_k (s_{k}) \leq m_k (s_{k}^c) \mbox{ and }  \| s_k \| < \triangle_k\!

In practice we can do far better than this.



We have a function f(x)=x_1 +x_1 x_2 + (1 + x_2)^2. \! The contours of the original function are shown as dotted line and the contours of trust-region model appear as solid lines. The trust region is black circle. The first plot shows quadratic model with positive definite Hessian. We see that in this case the model fits the function very good. It means the minimum of the model is very near to minimum of our function. The second plot shows linear model about the same point. The third plot shows a quadratic model with indefinite Hessian. The final plot is a quadratic model with positive definite Hessian whose minimizer lies outside the trust region.

Some theoretical results[edit]

We can guarantee reasonable reduction in the model at the Cauchy point.

Theorem 1.: If m_k (s)\! is the second-order model and s_{k}^c\! is its Cauchy point within the trust-region \| s \| \leq \triangle_k, \!

 f_k - m_k(s_{k}^c) \geq \frac1{2} \| g_k \| \min \left[
\frac{\| g_k \|}{1 + \| B_k \|},\triangle_k\right].\!

Since our step s_k\! does at least as well as Cauchy point (we choose it so) we get this consequence.

Corollary 2.: If m_k (s)\! is the second-order model, and s_{k}\! is an improvement on the Cauchy point within the trust-region \| s \| \leq
\triangle_k, \!

 f_k - m_k(s_{k}) \geq \frac1{2}
\| g_k \| \min \left[ \frac{\| g_k \|}{1 + \| B_k

Lemma 3.: Suppose that f \in C^2\! and that the true and model Hessians satisfy the bounds \| H(x) \|
\leq \kappa_h\! for all x\! and \|
B_k \| \leq \kappa_b\! for all k\! and some \kappa_h \geq 1\! and \kappa_b \geq
0.\! Then

 |f(x_k + s_k)- m_k (s_k ) | \leq
\kappa_d \triangle_{k}^{2} ,\!

where  \kappa_d = \frac{1}{2} ( \kappa_b + \kappa_h),\! for all k\!.

This lemma shows the maximal difference between model and function. Since we are using a second-order model for which the first two terms are exactly from the Taylor's approximation this imply that difference between model and function will vary like the square of the norm of s_k\!. So this result is consequence of theory about Taylor's polynomial.

Lemma 4.: Suppose that f \in C^2\! and that the true and model Hessians satisfy the bounds \| H(x) \|
\leq \kappa_h\! for all x\! and \|
B_k \| \leq \kappa_b\! for all k\! and some \kappa_h \geq 1\! and \kappa_b\geq
0\! and that  \kappa_d = \frac{1}{2} ( \kappa_b +
\kappa_h).\! Suppose furthermore that g_k \neq
0\! and that

 \triangle_k \leq  \| g_k \| \min
\left( \frac{1}{\kappa_h + \kappa_b}, \frac{1-\eta_v}{2\kappa_d}

Then iteration k\! is very successful and  \triangle_{k+1} \geq

This result is fairly intuitive, since when the radius is small the model looks more and more like its first order Taylor's expansion. From this follows that we make good progress because we approximate our function well.

Lemma 5.: Suppose that f \in C^2\! and that the true and model Hessians satisfy the bounds \| H(x) \|
\leq \kappa_h\! for all x\! and \|
B_k \| \leq \kappa_b\! for all k\! and some \kappa_h \geq 1\! and \kappa_b\geq
0\! and that  \kappa_d = \frac{1}{2} ( \kappa_b +
\kappa_h).\! Suppose furthermore that there exists a constant  \epsilon>0\! such that  \| g_k
\| \geq \epsilon \! for all k\!. Then

 \triangle_k \geq \kappa_{\epsilon} := \epsilon
\gamma_{d} \min \left( \frac{1}{\kappa_h + \kappa_b},
\frac{1-\eta_v}{2\kappa_d} \right)\!

for all k\!.

This result says that the radius uniformly bounded away from zero if the same is true of the sequence of gradients (non-optimal points).

Lemma 6.: Suppose that f \in C^2\! and that both the true and model Hessian remain bounded for all k.\! Then either

 g_l = 0 \, \mbox{ for some } \, l \geq 0\!


 \lim_{k\rightarrow \infty} f_k = - \infty\!


 \lim_{k \rightarrow \infty} g_k = 0.\!

In conclusion, we have seen that trust region methods have a very rich underlying convergence. We now turn to the outstanding practical issue.

Trust region subproblem[edit]

Now we remind trust region subproblem and we will concern with it. So we want to approximately  \min_{ s \in \mathbb{R}^n} q(s)
\equiv s^\top g + \frac1{2}s^\top B s\! subject to  \| s \| \leq \triangle\! Our aim is to find s_*\! so that

 q ( s_* ) \leq q ( s^c ) \, \mbox{ and } \, \| s_* \| \leq \triangle \!

We might solve it exactly \Rightarrow\!with use of Newton-like method, or approximately \Rightarrow\! then we apply steepest descent method.

Theorem 7.: Any global minimizer s_*\! of q(s)\! subject to  \| s \| \leq
\triangle\! satisfies the equation

 (B + \lambda_* I ) s_* = -g,\!

where B+ \lambda_*
I\! is positive semi-definite, \lambda_* \geq
0\! and  \lambda_* ( \| s_* \| -
\triangle ) =0.\! If B + \lambda_* I\! is positive definite, s_*\! is unique.

These optimality conditions suggest how we might solve the problem. We divide our problem into two cases:

  1. B\! is positive-semidefinite and the solution s\! to B s = - g\! satisfies  \| s \| \leq \triangle \Rightarrow    s_* = s\! It means that we already have found the solution. We must not interest in this case further.
  2. B\! is indefinite or B s = -g \! satisfies  \| s \| > \triangle.\! In this case satisfies s_* :\! (B + \lambda_* I ) s = -g\! and  s_{*}^\top s_{*} = \triangle^2.\! We have non-linear (quadratic) system of equation in s\! and \lambda.\! We will concentrate on this case. We should mention that all these results are easy consequences of theorem 7. For instance in the second case is  \lambda_* \neq 0 \Rightarrow \| s_* \| -
\triangle  =0 \Rightarrow s_{*}^\top s_{*} = \triangle^2.\! Why is  \lambda_* \neq 0\!? In case that B\! is indefinite is everything clear. In case that B s = -g \! satisfies  \| s \| > \triangle,\! we have to find another solution then B s = - g\! (for s\!) from this follows that  \lambda_* \neq 0.\!

Suppose B\! has spectral decomposition

 B=U^T \Lambda U\!

where U\! is matrix of eigenvectors (orthonormal) and \Lambda\! diagonal matrix of ranked eigenvalues: \lambda_1 \leq \lambda_2 \leq \ldots \leq \lambda_n.\! We require B + \lambda I\! positive semi-definite \Rightarrow \lambda \geq -
\lambda_1.\! Let

 s(\lambda) = - (B + \lambda I)^{-1} g\!

From the notes before follows that we require

 \psi(\lambda ) := \| s (\lambda) \|^2 = \triangle^2.\!

Define \gamma_i = e^{T}_i U g.\! Next formula follows from linear algebra.

 \psi(\lambda ) = \| U^T (\Lambda + \lambda I)^{-1} U g \|^2 =
\sum_{i=1}^n \frac{\gamma_{i}^2 }{(\lambda_i + \lambda

If we solve the equation  \psi(\lambda )= \triangle^2\! we find the solution of trust region subproblem.

Example of solution[edit]

A plot of \psi( \lambda ) \! as \lambda\! varies from -8 to 6. Note the poles at the negatives are the eigenvalues of H.\! The dashed vertical component corresponds to interior solutions that occur for all \triangle^2\! larger than roughly 1,15, while the remaining segment indicates boundary solutions.

The secular equation[edit]

Instead of \| s ( \lambda ) \| = \triangle\! solve the secular equation

 \phi ( \lambda ) := \frac{1}{ \| s (\lambda) \|} - \frac{1}{\triangle } = 0\!

We use the secular equation because it has no poles, takes the smallest values at eigenvalues, it is an analytic function \Rightarrow\! ideal for Newton method and it guarantees global convergence but we need to safeguard to protect Newton from the interior solution cases.

Now we will concentrate on finding solution of the secular equation with use of Newton method:

Let  \lambda > - \lambda_1\! and 
\triangle > 0 \! be given.

Until convergence do:

  • Factorize B + \lambda I = L L^T.\!
  • Solve  L L^T s = -g.\!
  • Solve Lw = s\!
  • Replace \lambda\! by

 \lambda + \left( \frac{ \| s \| - \triangle}{\triangle } \right) \left( \frac{ \| s \|^2 }{ \| w \|^2 } \right)\!

Conjugate gradient method[edit]

We just introduce another algorithm to minimize q(s)\!, which is appropriate when the number of dimension of our function is high.

Given s^0 = 0,\! set  g^0 = g
,\!  d^0 = - g \! and 

until breakdown or  g^i \! small, iterate:

  •  \alpha^i := \| g^i \|^2 / \langle d^i , B d^i \rangle\!
  •  g^{i+1} := g^i + \alpha^i B d^i \!
  •  \beta^i := \| g^{i+1} \|^2 / \| g^{i} \|^2 \!
  •  d^{i+1} := -g^{i+1} + \beta^i d^i\!

and increase i\! by 1.

Breakdown is intended to cover the fatal case when 
\langle d^i, B d^i \rangle = 0\! (or in practice is close to zero), for which the iteration is undefined, and non-fatal case when  \langle d^i, B d^i \rangle < 0\! for which is  q(s)\! unbounded from below.


Maple 8 and Mathematica 4.1[edit]

This software use methods mentioned in this presentation (not only).


On the first picture have we polynomial f(x,y) = x^4 + x^2 y^2 + y^2 + x^2 \! where  x \in [-1,1], \, \, y \in [-1,1]. \! In Mathematica I used command FindMinimum and this program found the point (0,0), or the point very near from this. In Mathematica is important the point from which we start. In Maple I used command minimize (only on the region, which is shown on the picture). And this program found exact solution.


On the second picture have we function f(x,y) = e^{-x^2 } + e^{-y^2 } \! where  x \in [-2,2], \, \, y \in [-2,2]. \! Mathematica found, like the minimum, the point (0,0) with the value 2, but it started from this point. It wrote: it may be a maximum or saddle point. I tried alternative methods as well, but it stopped, because it came out of the domain of definition (it did not show any point). Maple found all correct solutions (4) on the domain of definition.


Now we have the function  f(x,y) = - \max ( |x| , |y| ) \! where x \in [-1,1], \, \, y \in [-1,1]\! When I used the same command in Mathematica like in the previous cases. It found nothing. It wrote: could not symbolically find the gradient. Maple did not find the solution as well. It wrote: location = false. But when I tested the similar cases in univariate case Maple was more successful.

In the last case have we function f(x,y) = \cos \left( \frac1{x} \right) y,\! where x \in [0,0.001], \, \, y \in [-1,0]. \! I tested this function only in Maple, but the results were surprisingly good. Maple said that the minimal value is -1 and that this is when y=-1\!, only for x\! it did not found anything, but there is infinitely many x\! where \cos(\frac1{x}) = 1.\!

Generalized linear model[edit]

Now we show other algorithm, where we will solve optimization problem and then we draw some graphs. A generalized linear model (GLM) is a regression model of the form

 EY=G(X^\top \beta )\!

where  EY\! is the expected value of the dependent variable Y,\! X\! is a matrix of explanatory variables, \beta\! unknown parameter vector, and G\! known link function. We have to find maximum of likelihood function of Y\! with respect to \beta\! for maximum likelihood estimation suppose we, in our case, that the Y\! is normal distributed and Y_i\! are i.i.d.  G\! will be in our case exponential, quadratic and linear function. It is known that the least squares estimator 
\widehat{{{\beta}}}\! in the classical linear model is also the maximum-likelihood estimator for normally distributed errors and G(x)=x\! in this model. In our case is log-likelihood function \displaystyle
n\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right) -\frac
{1}{2\sigma^2} \sum_{i=1}^n (Y_i-\mu_i)^2\!

but we will use only this function

 \ell^* ({{Y}},{{\mu}}) =\sum_{i=1}^n \left(Y_i \mu_i -
\frac{\mu_{i}^{2}}{2}  \right) \!

where EY = \mu = G( X^\top \beta).\!

Let we set

  D_{{\beta}} =\frac{\partial}{\partial \beta} \ell^* ({{Y}},{{\mu}})= \sum_{i=1}^n \left( Y_i - \mu_{i}  \right) \frac{\partial}{\partial \beta} \mu_i.\!

We have to solve

 D_{{\beta}} = 0.\!

We will use Newton-Raphson algorithm. Until convergence do

\widehat{{{\beta}}}^{new} = \widehat{{{\beta}}}^{old} - \left(
H_{\widehat{{{\beta}}}^{old}} \right)^{-1}
D_{\widehat{{{\beta}}}^{old}} \!

where H\! is the Hessian of the \ell^* ({{Y}},{{\mu}})\! with respect to all components of  \beta.\! H\! is suppose to be positive definite (then algorithm converges).


In this part we show previous problematic on a pollution data. We will work with 60 observations from the years 1959 to 1961 from United States. Every observation represents one town. My source was MD BASE. The first explanatory variable describes Average January temperature in degrees of F. The second denotes population per sq. mile in urbanized areas in 1960. The third says how many percent of families has income lower then 3000 $. Dependent variable is Relative hydrocarbon pollution potential.


On Y\! axis is dependent variable and on X\! axis  X^\top b\! where X\! is the matrix of explanatory variables. Now we have 
G(x)=x\! (linear regression)  b= ( -32.447,5.6499,0.00676381,-10.2916)\! r^2\! (how good is our regression model r^2\leq 1\!) is 0.2924.

In the second case we have  G(x)=x^2\!, b=(-3.07994,0.572368,0.00131756,-1.29606)\! and r^2\! is 0.5353, so this model is better than previous.

In the last case we have  G(x)= e^x\! then b= (-3.12182,0.224662,0.000669758,-0.439031)\! and r^2\! is 0.858, so this model is the best one from all our models. It appears that this because of outliers, that is possible, but if we want to know why tends some town to be very polluted we have to use these outliers too. Of course would be better to make analysis without outliers too.


  • N.I.M. Gould, S. Leyer, An introduction to algorithms for nonlinear optimization, Frontiers in Numerical Analysis (Durham 2002)
  • S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press 2004
  • P. Lachout, Matematické programování
  • B. P. Flannery, W.H. Press, S. A. Teukolsky, W.T. Vetterling, Numerical Recipes in C,The Art of Scientific Computinion,Second Edition, Cambridge University Press 1988, 1992
  • Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg
  • Wikipedia different pages