Uncertainty in regression results

These functions help estimate the uncertainty in the results from a regression analysis, including uncertainty in the regression coefficients and the noise. Together they are useful for generating a probability distribution that represents the uncertainty in the predictions from a regression model. When applying regression to make projections into the future based on historical data, there might be additional sources of uncertainty because the future might be different from the past. These functions estimate uncertainty due to noise and imperfect fit to the historical data. You might wish to add further uncertainty for projections into the future to reflect these additional differences.

RegressionDist(y, b, i, k)

RegressionDist() estimates the uncertainty in linear regression coefficients, returning probability distributions on them. Suppose you have data where «y» was produced as:

y = Sum(c*b, k) + Normal(0, s)

«s» is the measurement noise. You have the data b[i, k] and y[i]. You might or might not know the measurement noise «s». So you perform a linear regression to obtain an estimate of «c». Because your estimate is obtained from a finite amount of data, your estimate of «c» is itself uncertain. This function returns the coefficients «c» as a distribution (i.e., in sample mode, it returns a sampling of coefficients indexed by Run and «k»), reflecting the uncertainty in the estimation of these parameters.

Library: Multivariate Distributions

Examples: If you know the noise level «s» in advance, then you can use historical data as a starting point for building a predictive model of «y», as follows:

{ Your model of the dependent variables: }
Variable y := your historical dependent data, indexed by «i»
Variable b := your historical independent data, indexed by «i», «k»
Variable x := { indexed by «k». Maybe others. Possibly uncertain }
Variable s := { the known noise level }
Chance c := RegressionDist(y, b, i, k)
Variable Predicted_y := Sum(c*x, k) + Normal(0, s)

If you don’t know the noise level, then you need to estimate it. You’ll need it for the normal term of Predicted_y anyway, and you’ll need to do a regression to find it. So you can pass these optional parameters into RegressionDist. The last three lines above become:

Variable e_c := Regression(y, b, i, k)
Variable s := RegressionNoise(y, b, i, k, e_c)
Chance c := RegressionDist(y, b, i, k, e_c)
Variable Predicted_y := Sum(c*x, k) + Normal(0, s)

If you use RegressionNoise() to compute «s», you should use Mid(RegressionNoise(...)) for the «s» parameter. However, when computing «s» for your prediction, don’t use RegressionNoise in context. Better is if you don’t know the measurement noise in advance, don’t supply it as a parameter.

RegressionFitProb(y, b, i, k, c, s)

When you’ve obtained regression coefficients «c» (indexed by «k») by calling the Regression function, this function returns the probability that a fit this poor would occur by chance, given the assumption that the data was generated by a process of the form:

Y = Sum(c*b, k) + Normal(0, s)

If this result is very close to zero, it probably indicates that the assumption of linearity is bad. If it is very close to one, then it validates the assumption of linearity. See more RegressionFitProb.

Library: Multivariate Distributions This is not a distribution function — it does not return a sample when evaluated in sample mode. However, it does complement the multivariate RegressionDist function also included in this library.

Example: To use, first call the Regression function, then you must either know the measurement knows a priori, or obtain it using the RegressionNoise function.

Var e_c := Regression(y, b, i, k);
Var s := RegressionNoise(y, b, i, k, c);
Var PrThisPoor := RegressionFitProb(y, b, i, k, e_c, s)

RegressionNoise(y, b, i, k, c)

When you have data, y[i] and b[i, k], generated from an underlying model with unknown coefficients c[k] and «s» of the form:

y = Sum(c*b, i) + Normal(0, s)

This function computes an estimate for «s» by assuming that the sample noise is the same for each point in the data set. See RegressionNoise.

When using in conjunction with RegressionDist, it is most efficient to provide the optional parameter «c» to both routines, where «c» is the expected value of the regression coefficients, obtained from calling Regression(y, b, i, k). Doing so avoids an unnecessary call to the built-in Regression function.

Library: Multivariate Distributions

These functions express uncertainty in the coefficients of a linear regression. If you are using results form a linear regression, you can use these functions to estimate uncertainty in predictive distributions.

These uncertainties reflect only the degree to which the regression model fits the observations to which it was fit. They do not reflect any possible systematic differences between the past process that generated those observations and the process generating the results being predicted, usually in the future. In this way, they are lower bounds on the true uncertainty.

See Also


Comments


You are not allowed to post comments.