Sample size calculation: Cross-sectional studies

Let us consider the estimation of sample size for a cross-sectional study.

In order to estimate the required sample size, we need to know the following:

p: The prevalence of the condition/ health state. If the prevalence is 32%, it may be either used as such (32%), or in its decimal form (0.32).

q: i. When p is in percentage terms: (100-p)

    ii. When p is in decimal terms: (1-p)

d (or l): The precision of the estimate. This could either be the relative precision, or the absolute precision. This will be discussed later in this post.

Za [Z alpha]: The value of z from the probability tables. If the values are normally distributed, then 95% of the values will fall within 2 standard errors of the mean. The value of z corresponding to this is 1.96 (from the standard normal variate tables). 

The formula for estimating sample size is given as:

        (Za)^2[p*q]      where the symbol ^ means ‘to the power of’; * means ‘multiplied by’

N=      d^2                                      that is, “Z-alpha squared into pq; upon d-square”

 substituting the values of Za, we get:

N= (1.96)^2[p*q]

           d^2

We can round off the value of Za (1.96) to 2, to obtain:

N= (2)^2[p*q]

         d^2

or, N= 4pq/ d^2      that is, “4 pq by d-square”

 

Example:

I wish to conduct a cross-sectional study on awareness of Hepatitis B among school children. A literature search reveals that other investigators have reported knowledge to range from 5% to 20% among students of grades 6 through 8. What should the size of my sample be?

 

The formula requires us to input the value of d (precision). If the absolute precision is known, there is no problem. However, often we can only input a relative precision. Where do we get the value of relative precision from?

Typically, relative precision is taken as a proportion of ‘p’. The maximum permissible limit is 20% of ‘p’.

In the above example, if ‘p’ is 20%, then ‘d’ will be (20/100)*20= 0.2*20= 4 {Taking a relative precision of 20%}.

This means that we will be able to detect a ‘p’ (prevalence) of 18% or more {half the value of relative precision on either side of ‘p’–> +/- 2%: 18% to 22%}.

That is, by taking a relative precision of 20% of ‘p’, the study will be able to detect the true awareness level if the actual prevalence is 18% or more. If the actual prevalence is less than 18%, however, the study will be unable to detect it accurately.

Therefore, the larger the value of ‘p’ (prevalence), the larger the possible value of ‘d’ (relative precision), keeping ‘d’ fixed (say, at 20% of ‘p’). If the prevalence is 50%, ‘d’ (20% of ‘p’) would then be 0.2*50= 10 (as compared to ‘d’ = 4 when ‘p’ = 20%).

The reverse is also true: the smaller the value of ‘p’, the smaller the value of ‘d’. A smaller ‘d’ implies a larger sample size. Therefore, the choice of ‘p’ is crucial. 

We can now input the values in the formula to obtain the sample size:

For the calculation we will take ‘d’ as 4. This yields:

N= (4*20*80)/ (4*4)

  = 400 this sample size will enable us to detect the truth if the prevalence is between 18-22% (or more).

If we took ‘p’= 5, then the sample size would be:

N= (4*5*95)/(1*1)                                           [‘d’= 0.2*5= 1]

  = 1900 this sample size will enable us to detect the truth if the prevalence is between 4-6% (or more).

So should I take ‘p’= 20% or ‘p’=5%?

That depends upon:

1. The  location of the original study- if you are planning to conduct the study in an urban area, use the prevalence reported by studies conducted in urban areas, and vice versa.

2. The available resources (time, manpower, money, etc.). Aim for the largest feasible sample size. The size should be adequate to yield 80% power. Do not unnecessarily increase the sample size unless the intention is to obtain greater power. If so, please mention the same in the methodology section.

3. The results of your pilot study. If you have conducted a pilot study, the prevalence obtained from that study should be taken as ‘p’. This will be much more accurate than any other external value.

 

Note 1: If you have multiple objectives, you must calculate the required sample size for each objective, then choose the largest sample size thus obtained. This will ensure adequate power for all objectives, else the study will lack power for one or more objectives. That is, you may not be able to detect a significant result where it actually exists because you failed to include enough subjects to detect it.

Note 2: It is advisable to mention a range rather than a single value for sample size. This is standard practice in the west, but not in India. A range may be obtained by calculating the sample size for different values of ‘p’.

 

Advertisements

135 thoughts on “Sample size calculation: Cross-sectional studies

  1. Dear Dr Roopesh,

    I would like to conduct a cross sectional study and I have difficulties to find the formula to calculated my sample size because the population is quite huge about 211,857. I am going to survey the knowledge, health belief and intention of female adolescent towards HPV vaccination and no previous study had ever done about this topic in my country. Could you please give me an advice about that matter?

    Your help is greatly appreciated.

    Sincerely,
    Sekartaji

    • Dear Sekartaji,

      If I understand the question correctly, you want to know how to compute sample size from a population of 211,857 individuals.

      Please use the prevalence from the following (and similar) articles to estimate the required sample size using the formula for cross-sectional studies:
      https://www.ncbi.nlm.nih.gov/pubmed/24188759

      In order to obtain your sample, you might consider cluster or multi-stage sampling.

      Hope this helps.

      Regards,
      Dr. Roopesh

    • Dear David,

      It is not ethical or practical to unnecessarily inflate the sample size for any study.

      The commonest reason for wanting to do so would be to increase the power of the study to detect even minor differences of interest.

      Another reason could be the desire to capture as much variation in the population as possible. However, this could be achieved by adopting a good sampling method.

      Regards,
      Dr. Roopesh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s