Difference between revisions of "Selecting the Sample Size"

Revision as of 18:45, 6 June 2017

Analytica User GuideExpressing UncertaintySelecting the Sample Size

Analytica represents the uncertain value (prob value) of an uncertain quantity as a random sample from its probability distribution, an array of samples indexed by Run. When viewing an uncertain result, you can use the uncertainty view to see it as a probability density, cumulative probability distribution, or even the underlying random sample. Whether you use Monte Carlo simulation, Latin Hypercube sampling or another method, the accuracy of probabilistic results depends on the sample size. A larger sample size gives more accurate results with less random noise, but takes longer to compute and uses more memory. Here's suggestions for how to think about trade-offs when choosing a sample size.

You can change the sample size in the Uncertainty Setup dialog, available from the Result menu or with keyboard shortcut control+U

It's good to start with a small sample size, say the default setting of 100, as you build and test a model. Then you may want to increase the sample size when you want to obtain more reliable results. Here's a guide to selecting a sample size to meet your needs.

To set the sample size

The default sample size is 100 (unless you've modified your preferences). You can change the SampleSize in Uncertainty Setup dialog, available from the Result menu or control+U on the keyboard:

Choosing the Sample Size

It's usually a good idea to start with a small sample size, such as the default 100, for initial building and testing a model, so you don't have to wait for long computations. When you are ready to generate results to show a client or for a report, you may want to use a larger sample to provide greater accuracy. How large?

It depends on what you want the results for. Sometimes you just want to get a rough idea of the range, say 10th to 90th percentile of the resulting distributions. In that case, a sample of 100 to 1000 may be sufficient. Often modelers want to visualize the shape of the resulting distributions. For a given sample size, the cumulative probability distribution shows less noise than the probability density function.

If you want to see or show the shape of the density function without spurious noise due to random sampling, you can use the Smoothing option in the Probability density tab of the Uncertainty Setup dialog:

We generally recommend using the default smoothing factor. Higher levels of smoothing can give misleading results, e.g. with tails inappropriately wide.

Convergence and statistics

Even experienced risk analysts often resort to what they call "convergence" testing to decide on sample size: They run simulations with increasing sample sizes to see "when the results seem to converge" to a consistent value. Or they compare multiple runs to see how well they agree. This kind of empirical exploration of sample size can be time consuming. But, contrary to common practice, it is often unnecessary.

The key point is that the results generated by Monte Carlo simulation are a random sample from the "true" distribution, assuming all the input distributions are well-chosen. This means that you can use simple statistics to select the sample size you need provided you can specify well-defined goals -- for example, if you want to estimate the true mean of the distribution has a 95% probability of being within 1% of the estimated mean -- or if you want to know that the estimated median (50th percentile) has a 95% probability of being between the estimated 49th and 51st percentile. Below we show how to estimate the sample sizes needed to obtain results with the specified accuracy. You can also download an Analytica library with functions to help you do these calculations.

Note that these results assume you are using simple Monte Carlo simulation. Analytica also offers median and random Latin hypercube sampling as other options in the Uncertainty Setup dialog. These sometimes provide faster convergence than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.

Function library for choosing a sample size

You can download this library from here: It includes functions to estimate sample size using the three methods described below. You may include it in your model if you want to use any of them to help select a sample size for your model.

Estimating sample size for a confidence interval on the Mean

First, suppose you are primarily interested in the precision of the mean of your output variable «y». Assume you have a random sample of «m» output values generated by Monte Carlo simulation:

[math]\displaystyle{ (y_1, y_2, y_3, …y_m) }[/math]

(1)

Here's how to estimate the mean and standard deviation of «y»:

[math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math]

(2)

[math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math]

(3)

This gives us a confidence interval with confidence «α», where «c» is the deviation for the unit normal enclosing probability «α»:

[math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math]

(4)

Suppose you want to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:

[math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math]

(5)

Rearranging the inequality:

[math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math]

(6)

To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».

For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:

[math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math]

(7)

So, to get the desired precision for the mean, you should set the sample size to about 64.

Estimating Confidence Intervals for Fractiles

Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:

[math]\displaystyle{ y_1 \le, y_2 \le, ...y_m }[/math]

«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:

[math]\displaystyle{ (y_i, y_k) }[/math]

where:

[math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math]

(8)

[math]\displaystyle{ i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil }[/math]

(9)

Note

The brackets in equations (8) and (9) above mean round up

and round down

, since they are computing numbers that need to be integers.

Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Y_p» is given by (y₁, y₂), where «y_i» is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and «y_k» is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want «[math]\displaystyle{ \alpha }[/math]» confidence of «Y_p» being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:

[math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math]

(10)

Thus:

[math]\displaystyle{ k - i = 2m{\Delta}p }[/math]

(11)

From equations (8) and (9) above, you have:

[math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math]

(12)

Equating the two expressions for k - 1 , you obtain:

[math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math]

(13)

[math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math]

(14)

For example, suppose you want to be 95% confident that the estimated fractile Y_.90 is between the estimated fractiles Y_.85 and Y_.95. So you have [math]\displaystyle{ \Delta p }[/math] = 0.05, and c ≈ 2. Substituting the numbers into equation (14), you get:

[math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math]

(15)

On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:

[math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math]

(16)

These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.

@@ Line 30: / Line 30: @@
 Note that these results assume you are using simple  [[Monte Carlo]] simulation.  Analytica also offers median and random [[Latin Hypercube|Latin hypercube]] sampling as other options in the   [[Uncertainty Setup dialog]].  These sometimes provide faster convergence than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.
+== Function library for choosing a sample size ==
+You can download this library from here:
+It includes functions to estimate sample size using the three methods described below.
+You may  include it in your model if you want to use any of them to help select a sample size for your model.
 ====Estimating sample size for a confidence interval on the Mean====