Regression analysis

Revision as of 00:40, 9 October 2018 by Lchrisman (talk | contribs) (→‎Bias term)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)



Release:

4.6  •  5.0  •  5.1  •  5.2  •  5.3  •  5.4  •  6.0  •  6.1  •  6.2  •  6.3  •  6.4  •  6.5


Regression is a widely used statistical method to estimate the effects of a set of inputs (independent variables) on an output (the dependent variable). It is a powerful method to estimate the sensitivity of the output to a set of uncertain inputs. Like the rank-correlation used in importance analysis, it is a global measure of sensitivity in that it averages the sensitivity over the joint distribution of the inputs, unlike Tornado analysis that is local, meaning it varies each variable one at a time, leaving all others fixed at a nominal value.

Regression(y,b,i, y)

Generalized linear regression. Finds the best-fit (least squared error) curve to a set of data points. Regression finds the parameters akin an equation of the form:

[math]\displaystyle{ y=\sum_{k} a_{k} b_{k}(\bar x) }[/math]

The data points are contained in «y» (the dependent variable) and «b» (the independent variables), both of which must be indexed by «i». «b» is the basis set and is indexed by «i» and «k». The function returns the set of parameters «ak» indexed by «k». Any datapoint having «y» = Null is ignored. The k index can be omitted when there is only a single basis term, or when a list of expressions is provided for b.

With the generalized form of linear regression, it is possible to have several independent variables, and your basis set might even contain non-linear transformations of your independent variables. Regression can be used to find the best-fit planes or hyperplanes, best-fit polynomials, and more complicated functions.

Regression uses a state-of-the-art algorithm based on singular-value decomposition that is numerically stable, even if the basis set contains redundant terms.

Bias term

Sometimes it is convenient to include a bias term outside of the sum

[math]\displaystyle{ y=a_0 + \sum_{k} a_k b_k(\bar x) }[/math]

This is equivalent to starting [math]\displaystyle{ k }[/math] at 0 and using [math]\displaystyle{ b_0(\bar x)=1 }[/math] for all [math]\displaystyle{ \bar x }[/math].

If the expression that calls Regression captures the second result value, then the bias is returned as the second return value, in which case you don't have to include a constant term in the basis. An example of an expression that captures the bias, a0 as the second return value is

Local (a, a0) := Regression( y, b, i, k );

Example 1:

y = mx + y0

Suppose you have a set of (x, y) points, contained in variables xand y, both indexed by i, and you wish to find the parameters m and y0 of the best-fit line y = mx + y0. You can calculate the coefficients using

Local (m, y0) := Regression( y, x, I );

The "fit" value of y at each x is then given by

m * x + y0

When calculated in this fashion, m and y0 are local values. You'll often want to store these in separate global variables. A convenient way to do this is to create two variable nodes defined as follows

Variable y0 := ComputedBy(m)
Variable m :=
       Local a;
       (a, y0) := Regression( y, x, I )

The last line captures the offset term in the global y0, then returns the slope, a, as the final result for variable m. Instead of separating the bias term as a separate variable, you can simply treat it as the coefficient in front of a constant basis term. For this, create an index

Index K := ['m', 'y0']

This index defines the terms or coefficients in your basis, indicating that you want two coefficients in your result, which you’ve named 'm' and 'y0'. Next, define your basis, b, as a table indexed by k:

Variable b := Array(K, [x, 1])
k ▶
m y0
X 1

This variable b is called the basis for the regression. The x in the m column says that the m coefficient is the coefficient associated with x, while the 1 in the y0 column specifies that they0 coefficient is the constant term. Regression(y, b, i, k) returns the coefficients m and b as an array indexed by k.

Regression(y, b, i, k) →
k ▶
m y0
-0.1896 0.03904

Note: The data set used for this result is in the “Regression Examples” example model.

Example 2

We wish to fit the following polynomial to (x, y) data: [math]\displaystyle{ y = a_5x^5 + a_4x^x4 + a_3x^3 + a_2x^2 + a_1x + a_0 }[/math]

Start by creating an index, K, that defines the terms for this basis:

Index K := 5..0

Next compute the basis:

Variable b := x^K

The coefficients are obtained using Regression(y, b, i, k).

We used a very compact expression here to compute the basis, i.e., x^K. In fact, it is so compact, you can easily omit the extra variable for b and simply use:

Regression(y, x^K, i, k)

You can also type out each component of the basis in a table:

Variable b := Array(K, [1, x, x^2, x^3, x^4, x^5])

Although this is entirely equivalent to the compact formula for the order-5 polynomial, it is less flexible. If you edit K to change the order of the polynomial, you may need to populate new cells in the table, whereas the x^K expression requires no modification.

Example 3

Find the best-fit hyperplane [math]\displaystyle{ y=c_xx+c_yy+c_0 }[/math] , where your data is in variables x, y and z, all indexed by i. Once again, start by defining the basis terms:

Index K := ['cx', 'cy', 'c0']

and then the basis:

Variable b := Array(K, [x, y, 1])

Regression(z, b, i, k) returns the coefficients of the best-fit hyperplane.

Residuals

When c is the result returned by Regression, the fitted value of y for your data points is computed using

Sum(c*b, k)

and the residual is

y - Sum(c*b, k)

This holds for any basis, even ones containing non-linear terms. To predict values on new data points, x_new, you need to compute the basis values, b_new, in the same manner that you computed b. For example, in Example 2 you would define b_new as

Variable b_new := [x_new^5, x_new^4, x_new^3, x_new^2, x_new, 1]

Then the predicted values are

Sum(c*b_new, k)

Secondary statistics

Regression analyses often include numerous secondary statistics, such as the coefficient covariances, the R^2 value (a.k.a., squared correlation, percentage of variance explained), etc. See Regression for how to compute these, and numerous other Regression-related topics.

More examples

The “Regression Examples” example model that is included with Analytica in the “Data Analysis” example models folder provides several examples of Regression on several bases, ranging from simple linear up to autoregressive time series.

Video tutorial

An in depth 1-hour tutorial video webinar on Using Regression.

See Also


Comments


You are not allowed to post comments.