Selecting the Sample Size

Analytica User GuideAppendicesSelecting the Sample Size

Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution.

You can control the sampling method and sample size by using Uncertainty Setup. This appendix briefly discusses how to select a sample size.

Choosing an Appropriate Sample Size

There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate.

How should you choose the sample size m? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.

Uncertainty about the Mean

First, suppose you are primarily interested in the precision of the mean of your output variable y. Assume you have a random sample of m output values generated by Monte Carlo simulation:

(y₁, y₂, y₃, …y_m)

(1)

You can estimate the mean and standard deviation of y using the following equations:

[math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math]

(2)

[math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math]

(3)

This leads to the following confidence interval with confidence α, where c is the deviation for the unit normally enclosing probability α:

[math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math]

(4)

Suppose you wish to obtain an estimate of the mean of y with an α confidence interval smaller than w units wide. What sample size do you need? You need to make sure that:

[math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math]

(5)

Or, rearranging the inequality:

[math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math]

(6)

To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of y — that is, s2. You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width w.

For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation c enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get:

[math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math]

(7)

So, to get the required precision for the mean, you should set the sample size to about 64.

Estimating Confidence Intervals for Fractiles

Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample m values of y are relabeled so that they are in increasing order:

[math]\displaystyle{ y_1 \le, y_2 \le, ...y_m }[/math]

c is the deviation enclosing probability α of the unit normal. Then the following pair of sample values constitutes the confidence interval:

(y_i, y_k)

where:

[math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math]

(8)

[math]\displaystyle{ i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil }[/math]

(9)

Note

The brackets in equations (8) and (9) above mean round up

and round down

, since they are computing numbers that need to be integers.

Suppose you want to achieve sufficient precision such that the a confidence interval for the pth fractile Y_p> is given by (y₁, y₂), where y_i is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and y_k is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want [math]\displaystyle{ \alpha }[/math] confidence of Yp being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:

[math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math]

(10)

Thus:

[math]\displaystyle{ k - i = 2m{\Delta}p }[/math]

(11)

From equations (8) and (9) above, you have:

[math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math]

(12)

Equating the two expressions for k-1 , you obtain:

[math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math]

(13)

[math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math]

(14)

For example, suppose you want to be 95% confident that the estimated fractile Y_.90 is between the estimated fractiles Y_.85 and Y_.95. So you have [math]\displaystyle{ \Delta p }[/math]=0.05, and c≈2. Substituting the numbers into equation (14), you get:

[math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math]

(15)

On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:

[math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math]

(16)

These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.