Difference between revisions of "LChrisman/Constrained regression"

(Created page with "I consider here the solution to a regression problem with linear equality constraints: :<math>\min_a {1\over 2} || B a - y ||^2</math> :subject to <math>C a = d</math> wher...")
 
Line 1: Line 1:
 
I consider here the solution to a regression problem with linear equality constraints:
 
I consider here the solution to a regression problem with linear equality constraints:
  
:<math>\min_a {1\over 2} || B a - y  ||^2</math>
+
:<math>\min_a || B a - y  ||^2</math>
 
:subject to <math>C a = d</math>
 
:subject to <math>C a = d</math>
  
Line 12: Line 12:
 
:<math>a^* = a_{ols} - (B^T B)^{-1} C^T \lambda</math>
 
:<math>a^* = a_{ols} - (B^T B)^{-1} C^T \lambda</math>
  
When <math>k>n</math>, <math>B^T B</math> is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.
+
where <math>a_{ols}</math> is the unconstrained solution (ordinary least squares solution). When <math>k>n</math>, <math>B^T B</math> is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.
  
 
My intention here is to document the general solution that continues to work when <math>k>n</math>, and hopefully even when <math>k+m > n</math>.
 
My intention here is to document the general solution that continues to work when <math>k>n</math>, and hopefully even when <math>k+m > n</math>.
 +
 +
== The null space ==
 +
Let <math>r = rank(B) \le n</math> be the rank of <math>B</math>. Then there is a subspace, called the ''null space'', with dimension <math>k-r</math> with the property that you can add the projection of any vector from this space to a solution for <math>a</math> and still have a solution that minimizes loss.  Denote the null-space basis as
 +
:<math>\{ \upsilon_1,...,\upsilon_{k-r} \}</math>
 +
each <math>\upsilon_i</math> is a k-vector. Given any solution <math>a_{part}</math> that minimizes the loss then any <math>{\alpha_j\}</math> gives another <math>a</math> that also minimizes the loss using
 +
:<math>a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j</math>.
 +
 +
Denote B's SVD as
 +
:<math>B = U \Sigma V^T</math>
 +
so that
 +
:<math>B^T B = (V \Sigma^T U^T) ( U \Sigma V^T) = V \Sigma^T \Sigma V^T</math>
 +
with <math>\Sigma^T \Sigma</math> containing <math>\sigma_1^2,...,\sigma_r^2</math> on the diagonal and zeros elsewhere.  <math>V</math> is orthonormal <math>(V^T V = I)</math>, so we write
 +
:<math>a = V \beta</math> for some <math>\beta \in \Re^k</math>
 +
:<math>B a = U \Sigma V^T ( V \beta) = U \Sigma \beta</math>
 +
and the loss is
 +
:<math>|| B a - y ||^2 = || \Sigma \beta - U^T y ||^2</math>
 +
 +
Use <math>u := U^T y</math> to rotate <math>y</math> into the space that partitions <math>\beta</math> so that the first <math>r</math> rows correspond to the nonzero singular values of B, and the last <math>k-r</math> rows are the zero part of <math>\Sigma</math>. Denote this partition as
 +
:<math>\beta = \left( \begin{array}{c}
 +
\beta_1 \\
 +
\beta_2
 +
\end{array}\right)</math>
 +
where
 +
* <math>\beta_1\in \Re^r</math> corresponds to the ''nonzero'' singular values <math>\sigma_1,...,\sigma_r</math>.
 +
* <math>\beta_2\in Re^{k-r}</math> corresponds to the zero part of <math>\Sigma</math>.
 +
then
 +
:<math>\Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right)
 +
,
 +
u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right)</math>

Revision as of 21:29, 1 April 2025

I consider here the solution to a regression problem with linear equality constraints:

[math]\displaystyle{ \min_a || B a - y ||^2 }[/math]
subject to [math]\displaystyle{ C a = d }[/math]

where [math]\displaystyle{ B }[/math] is the [math]\displaystyle{ n \times k }[/math] basis matrix, [math]\displaystyle{ C }[/math] is a [math]\displaystyle{ m \times k }[/math] matrix where [math]\displaystyle{ m }[/math] is the # of equality constraints and [math]\displaystyle{ k }[/math] the number of terms, and [math]\displaystyle{ d }[/math] an m-vector. The target data, [math]\displaystyle{ y }[/math] is an n-vector (column), and the solution, [math]\displaystyle{ a }[/math] is a k-vector (column).

When [math]\displaystyle{ B }[/math] and [math]\displaystyle{ C }[/math] are of full rank with [math]\displaystyle{ k + m \ge n }[/math], then a solution should pass through all points. The presence of the constraints means that even in the full rank case, it may require more terms than points to reach a perfect fit.

The Theil and van der Panne (1960) solution to this constrained regression is

[math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} \right)^{-1} (C a_{ols} - d) }[/math]
[math]\displaystyle{ a^* = a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]

where [math]\displaystyle{ a_{ols} }[/math] is the unconstrained solution (ordinary least squares solution). When [math]\displaystyle{ k\gt n }[/math], [math]\displaystyle{ B^T B }[/math] is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.

My intention here is to document the general solution that continues to work when [math]\displaystyle{ k\gt n }[/math], and hopefully even when [math]\displaystyle{ k+m \gt n }[/math].

The null space

Let [math]\displaystyle{ r = rank(B) \le n }[/math] be the rank of [math]\displaystyle{ B }[/math]. Then there is a subspace, called the null space, with dimension [math]\displaystyle{ k-r }[/math] with the property that you can add the projection of any vector from this space to a solution for [math]\displaystyle{ a }[/math] and still have a solution that minimizes loss. Denote the null-space basis as

[math]\displaystyle{ \{ \upsilon_1,...,\upsilon_{k-r} \} }[/math]

each [math]\displaystyle{ \upsilon_i }[/math] is a k-vector. Given any solution [math]\displaystyle{ a_{part} }[/math] that minimizes the loss then any [math]\displaystyle{ {\alpha_j\} }[/math] gives another [math]\displaystyle{ a }[/math] that also minimizes the loss using

[math]\displaystyle{ a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j }[/math].

Denote B's SVD as

[math]\displaystyle{ B = U \Sigma V^T }[/math]

so that

[math]\displaystyle{ B^T B = (V \Sigma^T U^T) ( U \Sigma V^T) = V \Sigma^T \Sigma V^T }[/math]

with [math]\displaystyle{ \Sigma^T \Sigma }[/math] containing [math]\displaystyle{ \sigma_1^2,...,\sigma_r^2 }[/math] on the diagonal and zeros elsewhere. [math]\displaystyle{ V }[/math] is orthonormal [math]\displaystyle{ (V^T V = I) }[/math], so we write

[math]\displaystyle{ a = V \beta }[/math] for some [math]\displaystyle{ \beta \in \Re^k }[/math]
[math]\displaystyle{ B a = U \Sigma V^T ( V \beta) = U \Sigma \beta }[/math]

and the loss is

[math]\displaystyle{ || B a - y ||^2 = || \Sigma \beta - U^T y ||^2 }[/math]

Use [math]\displaystyle{ u := U^T y }[/math] to rotate [math]\displaystyle{ y }[/math] into the space that partitions [math]\displaystyle{ \beta }[/math] so that the first [math]\displaystyle{ r }[/math] rows correspond to the nonzero singular values of B, and the last [math]\displaystyle{ k-r }[/math] rows are the zero part of [math]\displaystyle{ \Sigma }[/math]. Denote this partition as

[math]\displaystyle{ \beta = \left( \begin{array}{c} \beta_1 \\ \beta_2 \end{array}\right) }[/math]

where

  • [math]\displaystyle{ \beta_1\in \Re^r }[/math] corresponds to the nonzero singular values [math]\displaystyle{ \sigma_1,...,\sigma_r }[/math].
  • [math]\displaystyle{ \beta_2\in Re^{k-r} }[/math] corresponds to the zero part of [math]\displaystyle{ \Sigma }[/math].

then

[math]\displaystyle{ \Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) , u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right) }[/math]
Comments


You are not allowed to post comments.