# Logistic regression functions

You can use the functions in this section to estimate the probability (or probability distribution) of a binary or categorical dependent (output) variable as a function of known values for independent (input) variables. This is similar to linear regression, which predicts the value of a dependent variable as a function of known values for independent variables. Logistic regression is the best known example generalized regression, so even though the term logistic regression technically refers to one specific form of generalized regression (with probit and poisson regression being other instances), it is also not uncommon to hear the term logistic regression functions used synonymously with generalized linear regression, as we have done with the title of this section.

The functions LogisticRegression() and ProbitRegression() predict the probability of a Bernoulli (i.e., 0,1-valued) random variable from a set of continuous independent variables. Both functions apply to the same scenarios and accept identical parameters; the final models differ slightly in their functional form. The function PoissonRegression predicts the probability distribution for the number of events that occur, where the dependent (output) variable is a non-negative integer.

All three functions accept the same parameters as the Regression function. As with those functions, you construct a basis for your dependent variables, and will usually want to include the constant term (a 1 in the basis). In addition to those parameters, these functions also have two parameters, «priorType» and «priorDev», which allow you to specify a Bayesian prior.

### Bayesian priors

The regression methods in this section are highly susceptible to overfitting. The problem is particularly bad when there are a small number of data points or a large number of basis terms. When your model has been overfit, it will produce probability estimates that are too close to zero or one; in other words, its predictions are overconfident. To avoid overfitting, you will usually want to employ a Bayesian prior, which you do by specifying the «priorType» parameter, which recognizes these options:

0 = Maximum likelihood (default)
1 = Exponential L1 prior
2 = Normal L2 prior

Maximum likelihood corresponds to having no prior. The L1 and L2 priors penalize larger coefficient weights. The joint prior probability of each coefficient is statistically independent, having the shape of a decaying exponential function in the case of an L1 prior or of a half-normal distribution in the case of the L2 prior.

You can also optionally specify the strength of the prior using the «priorDev» parameter. which specifies the standard deviation of each marginal prior distribution on each coefficient. Cross validation techniques vary this parameter to find the optimal prior strength for a given problem, which is demonstrated in the Logistic Regression prior selection.ana example model included with Analytica in the Data Analysis example models folder. If you omit the «priorDev» parameter, the function makes a reasonable guess, which will usually be superior to maximum likelihood.

## LogisticRegression(y, b, i, k, priorType, priorDev)

Logistic regression is a technique for predicting a Bernoulli (i.e., 0,1-valued) random variable from a set of continuous dependent variables. See the Wikipedia article on logistic regression for a simple description. The LogisticRegression() function finds the parameters «ck» that fit a model of the form

$\displaystyle{ Logit(p(x)) = \sum_k c_kb_k(x) }$

where p(y) is the probability of outcome «y», and bk(x) is the basis vector for a data which is indexed by «k». To understand how to put together a basis from your independent variables, you should read the section on the Regression function, it is exactly the same here. Notice that the righthand side of the Logit equation above is the same as for standard Regression equation, but the lefthand side involves the Logit function. The inverse of the Logit function is the Sigmoid function, so that once you’ve obtained the result from LogisticRegression(), you can use it to predict the probability for a new data point using

Sigmoid(Sum(c*B(x), k))

where B(x) is a user-defined function that returns the basis vector for the data point.

Example: Suppose you have Height, Weight and Gender information for a set of people, with these three variables indexed by Person. A logistic regression model might estimate the probability that a given person is male based on height and weight, encoded as follows:

Index K := ['b', 'height', 'weight']
Function PersonBasis(height, weight) :=
Array(K, [1, height, weight])
Variable coef :=
LogisticRegression(Gender = 'M', PersonBasis(Height, Weight),
Person, K, priorType: 2)

With these coefficients, the probability that a 85kg, 170cm tall person is male is

Sigmoid(Sum(coef*PersonBasis(170, 85), k))

## ProbitRegression(y, b, i, k, priorType, priorDev)

A probit model relates a continuous vector of dependent measurements to the probability of a Bernoulli (i.e., 0, 1-valued) outcome. In econometrics, this model is sometimes called the Harvard model. The ProbitRegression function finds the parameters «ck» that fit a model of the form

$\displaystyle{ CumNormal(p(x)) = \sum_k c_kb_k(x) }$

where p(y) is the probability of outcome «y», and bk(x) is the basis vector for a data which is indexed by «k». To understand how to put together a basis from your independent variables, you should read the section on the Regression function, it is exactly the same here. Notice that the righthand side of the ProbitRegression equation is the same as for standard Regression equation, but the lefthand side involves the CumNormal function. Once you’ve obtained the result from LogisticRegression(), you can use it to predict the probability for a new data point using

CumNormalInv(Sum(c*B(x), k))

where B(x) is a user-defined function that returns the basis vector for the data point.

Example: Suppose you want to predict the probability that a particlar treatment for diabetes is effective given several lab test results. Data is collected for patients who have undergone the treatment, as follows, where the variable Test_results consists lab test data and Treatment_effective is set to 0 or 1 depending on whether the treatment was effective or not for that patient.

Using the data directly as the regression basis, the logistic regression coefficients are computed using this.

Variable c := ProbitRegression(Treatment_effective,
Test_results, Patient_ID, Lab_test, priorType: 2)

We can obtain the predicted probability for each patient in this testing set this.

Variable Prob_Effective :=
CumNormalInv(Sum( c*Test_results, Lab_Test ))

If we have lab tests for a new patient, say New_Patient_Tests, in the form of a vector indexed by Lab_Test, we can predict the probability that treatment will be effective this.

CumNormalInv(Sum(c*New_patient_tests, Lab_test))

## PoissonRegression(y, b, i, k, priorType, priorDev)

A Poisson regression model is used to predict the number of events that occur, «y», from a vector independent data, «b», indexed by «k». The PoissonRegression() function computes the coefficients, «c», from a set of data points, («b», «y»), both indexed by «i», such that the expected number of events is predicted by this formula.

$\displaystyle{ E(Y) = exp (\sum_k c_kb_k) }$

The random component in the prediction is assumed to be Poisson-distributed, so that given a new data point «b», the distribution for that point is

Poisson(Exp(Sum(c*b, K)))

Example: You have data collected from surveys on how many times TV viewers were exposed to your ads in a given week, and on how many times you ran ads in each time slot on those weeks. You want to fit a model to this data so that you can predict the distribution of exposures that you can expect in the future for a given allocation of ads to each time slot.

Each data point used for training is one survey response (from one person) taken at the end of one particular week (Training_exposures indexed by Survey_response). The basis includes a constant term plus the number of times ads were run in each time slot that week (Training_basis indexed by Time_slot_k and Survey_response).

Index Time_Slot_K := [1, 'Prime time', 'Late night', 'Day time']
Variable exposure_coefs :=
PoissonRegression(Training_exposures, Training_basis,
Survey_response, Time_slot_K)

To estimate the distribution for how many times a viewer will be exposed to your ads next week if you run 30 ads in prime time, 20 in late night and 50 during the day, use

Decision AdAllocation := Table(Time_slot_K)(1, 30, 20, 50)
Chance ViewersExposed :=
Poisson(Exp(Sum(Exposure_coefs*AdAllocation, Time_slot_K)))

This example can be found in the Example Models / Data Analysis folder in the model file Poisson regression ad exposures.ana.