Difference between revisions of "Selecting the Sample Size"

m
m
 
(18 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
<breadcrumbs> Analytica User Guide > Expressing Uncertainty > {{PAGENAME}}</breadcrumbs><br />
 
<breadcrumbs> Analytica User Guide > Expressing Uncertainty > {{PAGENAME}}</breadcrumbs><br />
  
Analytica represents the probability distribution on each uncertain value ([[prob value]]) as a random sample from the distribution. The [[Uncertainty view of a result|uncertainty view]] lets you view the sample directly or as a probability density, cumulative probability distribution, and other viewsThe accuracy of the representation depends on the sample size. A larger sample size gives a more accurate distribution with less random noise, but takes longer to compute and uses more memory. It's good to start with a small sample size, say the default setting of 100, as you build and test a model. Then you may want to increase the sample size when you want to obtain more reliable results. Here's a guide to selecting a sample size to meet your needs.
+
Analytica represents the uncertain value ([[prob value]]) of an uncertain quantity as a random sample from its probability distribution -- an array of sample values indexed by [[Run]]. You can use the [[Uncertainty view of a result|uncertainty view]] to show each uncertain quantity as a probability density function (PDF) , cumulative distribution function (CDF), selected statistics, fractiles (percentiles), or even the underlying random sampleBy default, it uses a sample size of 1000. A larger sample size reduces random noise in the estimated distribution, increasing the accuracy of estimates of the  mean, median, median, and other statistics. distribution. But the computation time and memory used go up roughly linearly with the sample size.  
  
=== To set the sample size ===
+
== Changing the sample size ==
The default sample size is (usually) 100. You can modify it in [[Uncertainty Setup dialog]], available from the [[Result menu]] or control+U on the keyboard.  [[SampleSize]] is a system variable, which you can use in Definitions. 
 
  
====Choosing an Appropriate Sample Size====
+
You can change the sample size from its default of 1000 in the [[Uncertainty Setup dialog]] from the [[Result menu]] or just press control+U ("U" for uncertainty):
 +
[[File:Uncertainty setup.png|centre|frameless|388x388px]]
  
How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the [[Monte Carlo]] method is that you can apply standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.  Although, the other [[Sampling method|Sampling methods]], [[Latin Hypercube|Random Latin hypercube]] and [[Latin Hypercube|Median Latin hypercube]] sampling are not quite random, these statistical methods usually apply fairly well -- although since these methods reduce the randomness, you may be able to get the accuracy you want with a slightly smaller sample size than these statistics suggest.
+
== Sample Size and Smooth Probability Distributions ==
  
====Uncertainty about the Mean====
+
When you are first building a model, it's usually best to start with a moderate sample size, such as the default 1000 -- or even smaller if you have a large model. That way, you can test it out as you build without having to wait for long calculation times.  When you are happy with your model and ready to generate results for a client or report, you might then increase the sample size to provide greater accuracy.
  
First, suppose you are primarily interested in the precision of the [[mean]] of your output variable «y». Assume you have a random sample of «m» output values generated by [[Monte Carlo]] simulation:
+
How many samples do you need? It depends on what you want the results for.  If you just want a rough idea of the range of key results, say 10th to 90th percentile, a sample of 100 to 1000 may be plenty.  You can visualize distributions by selecting an [[Uncertainty views]] in the Result window. The cumulative probability distribution shows less noise -- i.e. roughness due to random sampling that does not reflect anything real about the actual distribution -- than the probability density function.
 +
 
 +
If you find the density view has too much noise in your initial view, you can reduce the noise -- without using a larger sample size --with the Smoothing option in the Probability density tab of the  [[Uncertainty Setup dialog]]:
 +
 
 +
[[File:Uncertainty Setup KDE.png|frameless|378x378px]]
 +
 
 +
We generally recommend using the default smoothing factor.  Higher levels of smoothing may give misleading results, e.g. with  inappropriately wide tails.
 +
 
 +
== Function library for choosing a sample size ==
 +
 
 +
This Analytica library includes functions to estimate the sample size that you need to estimate the mean or a fractile (percentile) of a probability distribution with a specified confidence interval.  It also contains a function to create a meta sample -- that is to rerun a Monte Carlo simulation multiple times if you want to estimate the variability of a statistic over multiple runs for a given sample size. 
 +
 
 +
Download library: [[Media:Choose sample size.ANA|Choose sample size.ANA]] to help select a sample size to meet your needs.
 +
:[[File:Diagram for model Choose sample size.ANA.png|880px]]
 +
 
 +
The next sections explain the rationale and statistics underlying the methods in this model.
 +
 
 +
=== Convergence and statistics ===
 +
 
 +
Even experienced risk analysts sometimes resort to "convergence" testing to decide on sample size: They run simulations with increasing sample sizes to see "when the results seem to converge to a consistent value." Or they compare multiple runs to see how well they agree. This kind of empirical exploration of sample size can be very time consuming. But, it is usually unnecessary. 
 +
 
 +
The key point is that the results generated by [[Monte Carlo]] simulation are a random sample from the "true" distribution, assuming all the input distributions are well-chosen. This means that you can use simple statistics to select the sample size you need provided you can specify well-defined goals -- for example, if you want to estimate the true mean of the distribution has a 95% probability of being within  1% of the estimated mean -- or if you want to know that the estimated median (50th percentile) has a 95% probability of being between the estimated 49th and 51st percentile.  Below we show how to estimate the sample sizes needed to obtain results with the specified accuracy.  You can also download an Analytica library with functions to help you do these calculations.
 +
 
 +
Note that these results assume you are using simple  [[Monte Carlo]] simulation.  Analytica also offers median and random [[Latin Hypercube|Latin hypercube]] sampling as other options in the  [[Uncertainty Setup dialog]].  These often converge a little faster  than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.
 +
 
 +
===Estimating sample size for a confidence interval on the Mean===
 +
 
 +
First, suppose you are interested in the precision of the [[mean]] of a result variable «y» that your client cares about. Suppose you have a random sample of «m» values from «y» generated by [[Monte Carlo]] simulation:
  
 
:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
 
:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
Line 23: Line 50:
 
:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
 
:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
  
This gives us a confidence interval with confidence «α», where «c» is the deviation for the unit normal enclosing probability «α»:
+
The Central Limit Theorem says that the sampling distribution on the mean tends to normal for large  «m» for any distribution for «y» (if it has finite variance).  Given «c» is the deviation for the unit normal enclosing probability «α», this gives us the confidence interval on the true mean with probability «α»:
  
 
:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
 
:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
  
Suppose you want to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:
+
What sample size do you need to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide? You need to make sure that:
  
 
:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
 
:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
Line 35: Line 62:
 
:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
 
:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
  
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».
+
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples you need to reduce the confidence interval to the desired width «w».
  
 
For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:
 
For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:
Line 43: Line 70:
 
So, to get the desired precision for the mean, you should set the sample size to about 64.
 
So, to get the desired precision for the mean, you should set the sample size to about 64.
  
==== Estimating Confidence Intervals for Fractiles ====
+
=== Estimating Confidence Intervals for Fractiles ===
  
Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:
+
Another way to select a sample size is to obtain a confidence interval for  the median or other fractile (percentile) of a the probability distribution for an uncertain result of interest. Suppose that we label the sample «m» values of «y» so that they are in increasing order:
  
:<math> y_1 \le, y_2 \le, ...y_m</math>
+
:<math> y_1 \le y_2 \le ...y_m</math>
  
«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:
+
Define «c» as the deviation enclosing probability «α» of the unit normal (<code>c := -[[CumNormalInv]]((1-alpha)/2)</code>). Then these two sample values enclose the confidence interval on the pth percentile «Y<sub>p</sub>»:
  
 
:<math>(y_i, y_k)</math>
 
:<math>(y_i, y_k)</math>
Line 57: Line 84:
 
:<math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math><div class="floatright">(8)</div><br />
 
:<math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math><div class="floatright">(8)</div><br />
  
:<math>i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil</math><div class="floatright">(9)</div>
+
:<math>k = \left \lceil mp + c \sqrt {mp (1-p)} \right \rceil</math><div class="floatright">(9)</div>
  
 
<Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip>
 
<Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip>
  
Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Y<sub>p</sub>» is given by (y<sub>1</sub>, y<sub>2</sub>), where «y<sub>i</sub>» is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and «y<sub>k</sub>» is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want «<math>\alpha</math>» confidence of «Y<sub>p</sub>» being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
+
Suppose you want to achieve sufficient precision such that the «<math>\alpha</math>» confidence interval for the pth fractile «Y<sub>p</sub>» is given by (y<sub>1</sub>, y<sub>2</sub>), where «y<sub>i</sub>» is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and «y<sub>k</sub>» is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want «<math>\alpha</math>» confidence of «Y<sub>p</sub>» being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
  
 
:<math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math><div class="floatright">(10)</div><br />
 
:<math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math><div class="floatright">(10)</div><br />

Latest revision as of 21:57, 15 August 2024


Analytica represents the uncertain value (prob value) of an uncertain quantity as a random sample from its probability distribution -- an array of sample values indexed by Run. You can use the uncertainty view to show each uncertain quantity as a probability density function (PDF) , cumulative distribution function (CDF), selected statistics, fractiles (percentiles), or even the underlying random sample. By default, it uses a sample size of 1000. A larger sample size reduces random noise in the estimated distribution, increasing the accuracy of estimates of the mean, median, median, and other statistics. distribution. But the computation time and memory used go up roughly linearly with the sample size.

Changing the sample size

You can change the sample size from its default of 1000 in the Uncertainty Setup dialog from the Result menu or just press control+U ("U" for uncertainty):

Uncertainty setup.png

Sample Size and Smooth Probability Distributions

When you are first building a model, it's usually best to start with a moderate sample size, such as the default 1000 -- or even smaller if you have a large model. That way, you can test it out as you build without having to wait for long calculation times. When you are happy with your model and ready to generate results for a client or report, you might then increase the sample size to provide greater accuracy.

How many samples do you need? It depends on what you want the results for. If you just want a rough idea of the range of key results, say 10th to 90th percentile, a sample of 100 to 1000 may be plenty. You can visualize distributions by selecting an Uncertainty views in the Result window. The cumulative probability distribution shows less noise -- i.e. roughness due to random sampling that does not reflect anything real about the actual distribution -- than the probability density function.

If you find the density view has too much noise in your initial view, you can reduce the noise -- without using a larger sample size --with the Smoothing option in the Probability density tab of the Uncertainty Setup dialog:

Uncertainty Setup KDE.png

We generally recommend using the default smoothing factor. Higher levels of smoothing may give misleading results, e.g. with inappropriately wide tails.

Function library for choosing a sample size

This Analytica library includes functions to estimate the sample size that you need to estimate the mean or a fractile (percentile) of a probability distribution with a specified confidence interval. It also contains a function to create a meta sample -- that is to rerun a Monte Carlo simulation multiple times if you want to estimate the variability of a statistic over multiple runs for a given sample size.

Download library: Choose sample size.ANA to help select a sample size to meet your needs.

Diagram for model Choose sample size.ANA.png

The next sections explain the rationale and statistics underlying the methods in this model.

Convergence and statistics

Even experienced risk analysts sometimes resort to "convergence" testing to decide on sample size: They run simulations with increasing sample sizes to see "when the results seem to converge to a consistent value." Or they compare multiple runs to see how well they agree. This kind of empirical exploration of sample size can be very time consuming. But, it is usually unnecessary.

The key point is that the results generated by Monte Carlo simulation are a random sample from the "true" distribution, assuming all the input distributions are well-chosen. This means that you can use simple statistics to select the sample size you need provided you can specify well-defined goals -- for example, if you want to estimate the true mean of the distribution has a 95% probability of being within 1% of the estimated mean -- or if you want to know that the estimated median (50th percentile) has a 95% probability of being between the estimated 49th and 51st percentile. Below we show how to estimate the sample sizes needed to obtain results with the specified accuracy. You can also download an Analytica library with functions to help you do these calculations.

Note that these results assume you are using simple Monte Carlo simulation. Analytica also offers median and random Latin hypercube sampling as other options in the Uncertainty Setup dialog. These often converge a little faster than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.

Estimating sample size for a confidence interval on the Mean

First, suppose you are interested in the precision of the mean of a result variable «y» that your client cares about. Suppose you have a random sample of «m» values from «y» generated by Monte Carlo simulation:

[math]\displaystyle{ (y_1, y_2, y_3, …y_m) }[/math]
(1)

Here's how to estimate the mean and standard deviation of «y»:

[math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math]
(2)
[math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math]
(3)

The Central Limit Theorem says that the sampling distribution on the mean tends to normal for large «m» for any distribution for «y» (if it has finite variance). Given «c» is the deviation for the unit normal enclosing probability «α», this gives us the confidence interval on the true mean with probability «α»:

[math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math]
(4)

What sample size do you need to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide? You need to make sure that:

[math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math]
(5)

Rearranging the inequality:

[math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math]
(6)

To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples you need to reduce the confidence interval to the desired width «w».

For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:

[math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math]
(7)

So, to get the desired precision for the mean, you should set the sample size to about 64.

Estimating Confidence Intervals for Fractiles

Another way to select a sample size is to obtain a confidence interval for the median or other fractile (percentile) of a the probability distribution for an uncertain result of interest. Suppose that we label the sample «m» values of «y» so that they are in increasing order:

[math]\displaystyle{ y_1 \le y_2 \le ...y_m }[/math]

Define «c» as the deviation enclosing probability «α» of the unit normal (c := -CumNormalInv((1-alpha)/2)). Then these two sample values enclose the confidence interval on the pth percentile «Yp»:

[math]\displaystyle{ (y_i, y_k) }[/math]

where:

[math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math]
(8)

[math]\displaystyle{ k = \left \lceil mp + c \sqrt {mp (1-p)} \right \rceil }[/math]
(9)
Note
The brackets in equations (8) and (9) above mean round up Lceil.png and round downRfloor.png, since they are computing numbers that need to be integers.

Suppose you want to achieve sufficient precision such that the «[math]\displaystyle{ \alpha }[/math]» confidence interval for the pth fractile «Yp» is given by (y1, y2), where «yi» is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and «yk» is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want «[math]\displaystyle{ \alpha }[/math]» confidence of «Yp» being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:

[math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math]
(10)

Thus:

[math]\displaystyle{ k - i = 2m{\Delta}p }[/math]
(11)

From equations (8) and (9) above, you have:

[math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math]
(12)

Equating the two expressions for k - 1 , you obtain:

[math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math]
(13)

[math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math]
(14)

For example, suppose you want to be 95% confident that the estimated fractile Y.90 is between the estimated fractiles Y.85 and Y.95. So you have [math]\displaystyle{ \Delta p }[/math] = 0.05, and c ≈ 2. Substituting the numbers into equation (14), you get:

[math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math]
(15)

On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:

[math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math]
(16)

These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.

See Also


Comments


You are not allowed to post comments.