# RegressionDist

## RegressionDist(Y, B, I, K, C, S)

RegressionDist is similar to Regression(Y, B, I, K), but it returns linear regression coefficients not as mid values but as a probability distribution reflecting the uncertainty in the regression fit and measurement noise. You can use the uncertain coefficients from RegressionDist to generate a predictive probability distribution on «Y» that reflects this uncertainty.

Suppose you have data where «Y» was produced as:

Y = Sum(C*B, K) + Normal(0, S)

«S» is the measurement noise. You have the data («B[I, K]» and «Y[I]»). You might or might not know the measurement noise «S». So you perform a linear regression to obtain an estimate of «C». Because your estimate is obtained from a finite amount of data, your estimate of «C» is itself uncertain. This function returns the coefficients «C» as a distribution (i.e., in Sample mode, it returns a sampling of coefficients indexed by Run and «K»), reflecting the uncertainty in the estimation of these parameters.

## Library

Multivariate Distributions library functions (Multivariate Distributions.ana)

## Examples

If you know the noise level «S» in advance, then you can use historical data as a starting point for building a predictive model of «Y», as follows:

{ Your model of the dependent variables: }
Variable Y := your historical dependent data, indexed by I
Variable B := your historical independent data, indexed by I, K
Variable X := { indexed by K. Maybe others. Possibly uncertain }
Variable S := { the known noise level }
Chance C := RegressionDist(Y, B, I, K)
Variable Predicted_Y := Sum(C*X, K) + Normal(0, S)

If you don't know the noise level, then you need to estimate it. You'll need it for the normal term of Predicted_Y anyway, and you'll need to do a regression to find it. So you can pass these optional parameters into RegressionDist. The last three lines above become:

Variable E_C := Regression(Y, B, I, K)
Variable S := RegressionNoise(Y, B, I, K, E_C)
Chance C := RegressionDist(Y, B, I, K, E_C)
Variable Predicted_Y := Sum(C*X, K) + Normal(0, S)

If you use RegressionNoise to compute «S», you should use Mid(RegressionNoise(...)) for the «S» parameter. However, when computing «S» for your prediction, don't RegressionNoise in context. Better is if you don't know the measurement noise in advance, don't supply it as a parameter.

## Errors That Might Result

Evaluation Error in C:
Array is not symmetric in System Function Decompose.
while evaluating function Gaussian.
Call stack:
Gaussian
RegressionDist
C

Possible causes:

• One of your independent variables might be zero for every data point. As of Analytica 4.2, RegressionDist is not robust to this singularity. Note that this singularity is problematic -- the mean coefficient value for that variable is undefined and the variance on the coefficient uncertainty is infinite.
Remedy: Eliminate independent variables that are everywhere zero from the basis before calling.
• Your data (most likely in the basis) contains NaN values.