Difference between revisions of "Regression"

(added R-squared)
Line 97: Line 97:
 
[[Image:Regression comparison plot.png]]
 
[[Image:Regression comparison plot.png]]
  
 +
 +
= Comparison of Alternative Bases =
 +
 +
Adding more basis terms to your regression model will improve the fit on your training data, but you may reach a point where the extra terms decreases the quality of predictions on new data.  This phenomena is referred to as ''over-fitting''.  As a general rule, as the number of data points increasees, the more basis terms you can have before overfitting becomes a problem. 
 +
 +
One approach that is often employed to evaluate whether the improvement justifies the additional basis terms is to use an '''F-Test''.  The output of this test is a ''p-value'', which gives the probability of seeing an improvement in test data fit equal to or exceeding the actual improvement found by the regression under the assumption that the extra basis terms do not contribute any additional information.  A small p-value means that there is statistically significant support for the hypothesis that the extra basis terms improves the goodness of fit.  One standard is to accept a model with more basis terms when the ''p-value'' is less than 5%.
 +
 +
To compute the F-test p-value, we'll use these variables:
 +
* ''Basis1'' indexed by ''K1'' and ''I'' : The smaller basis (simpler model)
 +
* ''Basis2'' indexed by ''K2'' and ''I'' : The larger basis (more complex model)
 +
* ''Y'' indexed by ''I'' : The observed values for the dependent variable
 +
 +
Then the regression coefficients for each model are:
 +
:Variable c1 := <code>[[Regression]](Y,Basis1,I,K1)</code>
 +
:Variable c2 := <code>[[Regression]](Y,Basis2,I,K2)</code>
 +
The forcasted values from each model are:
 +
:Variable y1 := <code>[[Sum]](c1*Basis1,K1)</code>
 +
:Variable y2 := <code>[[Sum]](c2*Basis1,K2)</code>
 +
And the sum-of-square residuals are:
 +
:Variable Rss1 := <code>[[Sum]]( (y1-y)^2, I )</code>
 +
:Variable Rss2 := <code>[[Sum]]( (y2-y)^2, I )</code>
 +
The ''F-statistic'' is given by:
 +
:Variable Fstat := <code>(Rss1-Rss2)/Rss2  * ([[Size]](I)-[[Size]](K2))/([[Size]](K2)-[[Size]](K1))</code>
 +
And the p-value is
 +
:Variable pValue := <code>1-[[CumFDist]](F,[[Size]](K2)-[[Size]](K1),[[Size]](I)-[[Size]](K2) )</code>
 +
''Note: The [[CumFDist]] is located in the '''Distribution Densities Library'''.''
  
 
= See Also =
 
= See Also =

Revision as of 19:21, 16 March 2012


Regression( Y,B,I,K )

Generalized Linear Regression.

  • Y : Classification value (dependent variable). Indexed by I.
  • B : Basis values (independent variables). Indexed by I and K.
  • I : Index for data points. Each element of I corresponds to a different data point.
  • K : Basis index. The result has one coefficient for each element of K.

Finds a set of coefficients, C, fitting the data so that Y at each data point is estimated by

Sum(C*B,K)

(see user guide for basic usage)

Details

(too detailed for user guide)

Underconstrained Problems

When you do a regression fit, the number of data points, size(I), should be greater than the number of basis terms, size(K). When the number of data points is less than the number of basis terms, the problem is under-constrained. Provided that there are no two data points having the same basis values but different Y values, the fit curve in an underconstrained problem will perfectly pass through all data points, however, the co-efficients in that case are not unique. In the under-constrained case, Analytica will issue a warning, since this most likely indicates that the I and K index parameters were inadvertently swapped. If you ignore the warning, embed the call within an IgnoreWarnings function call, or have the "Show Result Warnings" preference disabled, a set of coefficients that passes through the existing data points is arbitrarily chosen and returned. The algorithm used is computational inefficient in the under-constrained case where size(I) << size(K) -- i.e., the number of basis terms is much larger than the number of data points. If you know your problem is highly underconstrained, then you probably do not intend to use a regression.

Secondary Statistics

The Regression function computes the coefficients for the best-fit curve, but it does not compute secondary statistics such as parameter covariances, R-value correlation, or goodness-of-fit.

In what follows, we'll assume that Variable C is the computed regression coefficients, e.g.

Variable C := Regression(Y,B,I,K)

For each data point, the predicted expected value (from the regression) is given by

Sum( C*B, K )

However, this prediction provides only the expected value. The RegressionDist function may be used to obtain a distribution over C, and hence a probabilistic estimate of Y.

The R-squared value is given by:

Correlation( Y, Sum(C*B,K), I )^2

If your basis B might contain NaN or INF values, the corresponding coefficient in C will generally be zero (reliably so in release 4.1.2). However, because 0*NaN and 0*INF are indeterminate, the expression Sum(C*B,K) will return NaN in those cases. To avoid this, use the following expression instead:

Sum( if C=0 then 0 else C*B, K )

If you know the measurement noise in advance, then S is given and may (optionally) be indexed by I if the measurement noise varied by data point. If you do not know s in advance, then S can be obtained from the RegressionNoise function as:

RegressionNoise(Y,B,I,K,C]]

Alternatively, S may be estimated as

var y2 := Sum(C*B,K);
Sqrt( Sum( (Y-Y2)^2,I ) / (size(I)-size(K)) )

Estimating S in either of these ways assumes that the noise level is the same for each data point.

In a generalized linear regression, the goodness of fit, or merit, is often characterized using a Chi-squared statistic, computed as:

Sum( (Y-Sum(C*B,K))^2 / S^2, I )

Denoting the above as chi2, The probability that the data fit as poor as this would occur by chance is given as:

GammaI( Size(I)/2 - 1, chi2 / 2 )

This metric can be conveniently obtained using the RegressionFitProb function.

Another set of secondary statistics are the covariances of the fitted parameters. The covariance is an estimate of the amount of uncertainty in the parameter estimate given the available data. As the number of data points increases (for a given basis), the variances and covariances tend to decrease. To compute the covariances, a copy of Index K is required (since the covariance matrix is square in K); hence, you need to create a new index node defined as:

Index K2 := CopyIndex(K)

The co-variances are then computed as:

Invert( Sum(B * B[K=K2] / S^2, I ), K, K2 )

The diagonal elements of this matrix give the variance in each parameter. Since there is only a finite number of samples, the parameter estimate may be off a bit due to random chance, even if the linear model assumption is correct; this variance indicates how much error exists from random chance at the given data set size.

With S and CV_C (the covariance of parameters C), a distribution on the expected value of Y can be obtained for a given input X (indexded by I), using:

 Variable Coef := Gaussian( C, CV_C, K, K2 )
 Variable Y_predicted := Sum(Coef * X, K ) + Normal(0,S)

The RegressionDist returns the uncertain coefficients directly, and is the more convenient function to use when you wish to estimate the uncertainty in your predicted value.

Numeric Limitations

The Regression function is highly robust to the presence of redundant basis terms. For example, if one of the basis functions is a linear combination of a subset of other basis functions, the coefficients are not unique determined. For many other implementations of Regression (e.g., in other products), this can lead to numeric instabilities, with very large co-efficients and losses of precision from numeric round-off. Analytica uses an SVD-based method for Regression which is extremely robust to these effects, and guarantees good results even when basis terms are redundant.

Prior to build 4.0.0.59, this method that ensures robustness works for basis values up to about 100M. If basis values exceed 100M, then they should be scaled prior to using Regression. Starting with build 4.0.0.59, Analytica automatically scales basis values so that large values are handled robustly as well.

Weighted Regression

Weighted regression assigns each data point a non-negative weight, where some points that are not to contribute at all are assigned a weight of zero. The weight, w, therefore is indexed by I.

A weighted regression be interpreted as an indication that the noise level is not the same for every point, and that we have information about the relative noise level for each point. If we assume that each point is drawn from a linear model with zero-mean noise with distribution Normal(0,s / w_i ), where s is an unknown global noise level, and w_i is the noise for the ith data point. Compare to a non-weighted regression, where we assume all points were measured with the same amount of noise, according to Normal(0,s). A point with weight 0 has infinite standard deviation, and thus no usable information.

The coefficients for a weighted regression are given by

Regression( Y * w, B * w, I, K )

where Y,B,I, and K are the customary parameters to regression, and w is the relative weighting which is indexed by I.

Using weights of 0 and 1 makes it possible to ignore certain points. However, to ignore points where a basis term or the Y value might be NaN, you need to test for w:

var Y2 := if w=0 then 0 else Y*w;
var B2 := if w=0 then 0 else B*w;
Regression( Y2, B2, I, K )

Plotting Regression lines Compared to Data

To overlay the regressed curves on the original data, a good method is to continue using a scatter plot for both data and curves. Create a variable as a list of identifiers -- the first identifier being the original Y-value data, the second identifier being the fit-Y value (i.e., Sum( Basis*X, K ) ). Then plot the result as an XY plot, using your X value variable as an XY Comparison Source. The model RegressionCompare.ana demonstrates this, with the plot shown here:

Regression comparison plot.png


Comparison of Alternative Bases

Adding more basis terms to your regression model will improve the fit on your training data, but you may reach a point where the extra terms decreases the quality of predictions on new data. This phenomena is referred to as over-fitting. As a general rule, as the number of data points increasees, the more basis terms you can have before overfitting becomes a problem.

One approach that is often employed to evaluate whether the improvement justifies the additional basis terms is to use an 'F-Test. The output of this test is a p-value, which gives the probability of seeing an improvement in test data fit equal to or exceeding the actual improvement found by the regression under the assumption that the extra basis terms do not contribute any additional information. A small p-value means that there is statistically significant support for the hypothesis that the extra basis terms improves the goodness of fit. One standard is to accept a model with more basis terms when the p-value is less than 5%.

To compute the F-test p-value, we'll use these variables:

  • Basis1 indexed by K1 and I : The smaller basis (simpler model)
  • Basis2 indexed by K2 and I : The larger basis (more complex model)
  • Y indexed by I : The observed values for the dependent variable

Then the regression coefficients for each model are:

Variable c1 := Regression(Y,Basis1,I,K1)
Variable c2 := Regression(Y,Basis2,I,K2)

The forcasted values from each model are:

Variable y1 := Sum(c1*Basis1,K1)
Variable y2 := Sum(c2*Basis1,K2)

And the sum-of-square residuals are:

Variable Rss1 := Sum( (y1-y)^2, I )
Variable Rss2 := Sum( (y2-y)^2, I )

The F-statistic is given by:

Variable Fstat := (Rss1-Rss2)/Rss2 * (Size(I)-Size(K2))/(Size(K2)-Size(K1))

And the p-value is

Variable pValue := 1-CumFDist(F,Size(K2)-Size(K1),Size(I)-Size(K2) )

Note: The CumFDist is located in the Distribution Densities Library.

See Also

Comments


You are not allowed to post comments.