Difference between revisions of "Kernel Density Smoothing"

Revision as of 20:22, 29 June 2011

In Analytica 4.4, if you go to the Uncertainty Options dialog (an option under the Results drop down menu), and select Probability Density from the Analysis option drop down menu, you will see the Probability Density panel has changed. There are two new radio buttons: "Histogram" and "Smoothing."

"Histogram" will give you a Probability Density Function (PDF) graph which is a histogram, or step function, as before.

However, "Smoothing" will give you a smoothed, continuous curve for your PDF. This is a new feature in 4.4.

When you generate a random variable, either from a built in distribution, or from a sequence of calculations based on random distributions, there is an underlying theoretical PDF. Before 4.4, you are able to graph this PDF as a histogram, and able to use this histogram for further calculations. The histogram gives you an indication of what the underlying PDF is, but can be quite sensitive to your random sampling methodology.

If you want to use the ramdom sample to get an idea of the underlying PDF, you can improve on the histogram using some kind of smoothing technique. There are various smoothing techniques you might try. But one smoothing technique that, in some cases, produces rather awesome results is called Kernel Density Smoothing, based on a technique called Kernel Density Estimation (KDE). If you click on the radio button "Smoothing" this activates that technique.

Kernel Density Estimation is a general approach to the smoothing problem. In 4.4 we are using one variation, based on what is called the Fast Gaussian Transform. In essence, we replace each sample point x from your random sample by a smear over a normal distribution of values that sample point x might have had. This normal distribution has a standard deviation, or bandwidth, let's call it h. To get the KDE curve, we simply add up all these normal distributions for all the sample points. So, if you select "Smoothing," the curve you see plotted is this KDE curve.

This is a Fast Gaussian Transform because if you try to calculate the sum of all these Gaussians in a naive fashion, and your sample size is large (e.g., 1,000,000) computation time can be huge. But through a trick involving such esoteric techniques as Hermite Series and Taylor Series, computation time can be reduced significantly, and that is the secret of the Fast Gaussian Transform.

You will notice that if you click on "Smoothing," the panel is altered. The "Samples per PDF Step interval" text field disappears, as also the radio buttons for "Equal X axis steps" and so on. These no longer apply.

Instead, you see a drop down menu for "Smoothing factor." Above, I referred to the bandwidth, h -- the standard deviation for all the Gaussians I am adding up. 4.4 software computes a value for h, the default. However, if you want to try different h values, the "Smoothing factor" menu allows you to alter the system determined h value. These are the options:

maximum detail: minimum h value

medium detail: medium low h value

default: system determined h value

medium smoothing: medium high h value

maximum smoothing: maximum h value

As the bandwidth h decreases, the KDE smoothed PDF gets more sensitive to the random variation in your random sample, and can get quite wavy. As bandwidth h increases, the KDE smoothed PDF gets less sensitive to random variation and gets lots smoother, however, the match to the true underlying PDF may not be so good. If your underlying PDF is a simple normal distribution, higher bandwidth or h will increase the variance of KDE curve and lower the peak.

Determining the optimal bandwidth, h, value is a difficult problem in general. The smoothing factor is offered as a way to try different h values, and judge, by eyeballing the graphs produced, what looks best.

One clue here: compare the KDE smoothed graph with the histogram, to determine what smoothing factor seems to smooth the original histogram best. Also, in this process, for the histogram, try different "Samples per PDF step interval" values, since the histogram's random variation is sensitive to this. Find the best fit, between combinations of the smoothing factor and the samples per PDF step interval, and that might well be your best estimate of the underlying PDF.

@@ Line 20: / Line 20: @@
 maximum detail: minimum h value
 medium detail: medium low h value
 default: system determined h value
 medium smoothing: medium high h value
 maximum smoothing: maximum h value