Sample size calculation: Cross-sectional studies

Let us consider the estimation of sample size for a cross-sectional study.

In order to estimate the required sample size, we need to know the following:

p: The prevalence of the condition/ health state. If the prevalence is 32%, it may be either used as such (32%), or in its decimal form (0.32).

q: i. When p is in percentage terms: (100-p)

    ii. When p is in decimal terms: (1-p)

d (or l): The precision of the estimate. This could either be the relative precision, or the absolute precision. This will be discussed later in this post.

Za [Z alpha]: The value of z from the probability tables. If the values are normally distributed, then 95% of the values will fall within 2 standard errors of the mean. The value of z corresponding to this is 1.96 (from the standard normal variate tables). 

The formula for estimating sample size is given as:

        (Za)^2[p*q]      where the symbol ^ means ‘to the power of’; * means ‘multiplied by’

N=      d^2                                      that is, “Z-alpha squared into pq; upon d-square”

 substituting the values of Za, we get:

N= (1.96)^2[p*q]

           d^2

We can round off the value of Za (1.96) to 2, to obtain:

N= (2)^2[p*q]

         d^2

or, N= 4pq/ d^2      that is, “4 pq by d-square”

 

Example:

I wish to conduct a cross-sectional study on awareness of Hepatitis B among school children. A literature search reveals that other investigators have reported knowledge to range from 5% to 20% among students of grades 6 through 8. What should the size of my sample be?

 

The formula requires us to input the value of d (precision). If the absolute precision is known, there is no problem. However, often we can only input a relative precision. Where do we get the value of relative precision from?

Typically, relative precision is taken as a proportion of ‘p’. The maximum permissible limit is 20% of ‘p’.

In the above example, if ‘p’ is 20%, then ‘d’ will be (20/100)*20= 0.2*20= 4 {Taking a relative precision of 20%}.

This means that we will be able to detect a ‘p’ (prevalence) of 18% or more {half the value of relative precision on either side of ‘p’–> +/- 2%: 18% to 22%}.

That is, by taking a relative precision of 20% of ‘p’, the study will be able to detect the true awareness level if the actual prevalence is 18% or more. If the actual prevalence is less than 18%, however, the study will be unable to detect it accurately.

Therefore, the larger the value of ‘p’ (prevalence), the larger the possible value of ‘d’ (relative precision), keeping ‘d’ fixed (say, at 20% of ‘p’). If the prevalence is 50%, ‘d’ (20% of ‘p’) would then be 0.2*50= 10 (as compared to ‘d’ = 4 when ‘p’ = 20%).

The reverse is also true: the smaller the value of ‘p’, the smaller the value of ‘d’. A smaller ‘d’ implies a larger sample size. Therefore, the choice of ‘p’ is crucial. 

We can now input the values in the formula to obtain the sample size:

For the calculation we will take ‘d’ as 4. This yields:

N= (4*20*80)/ (4*4)

  = 400 this sample size will enable us to detect the truth if the prevalence is between 18-22% (or more).

If we took ‘p’= 5, then the sample size would be:

N= (4*5*95)/(1*1)                                           [‘d’= 0.2*5= 1]

  = 1900 this sample size will enable us to detect the truth if the prevalence is between 4-6% (or more).

So should I take ‘p’= 20% or ‘p’=5%?

That depends upon:

1. The  location of the original study- if you are planning to conduct the study in an urban area, use the prevalence reported by studies conducted in urban areas, and vice versa.

2. The available resources (time, manpower, money, etc.). Aim for the largest feasible sample size. The size should be adequate to yield 80% power. Do not unnecessarily increase the sample size unless the intention is to obtain greater power. If so, please mention the same in the methodology section.

3. The results of your pilot study. If you have conducted a pilot study, the prevalence obtained from that study should be taken as ‘p’. This will be much more accurate than any other external value.

 

Note 1: If you have multiple objectives, you must calculate the required sample size for each objective, then choose the largest sample size thus obtained. This will ensure adequate power for all objectives, else the study will lack power for one or more objectives. That is, you may not be able to detect a significant result where it actually exists because you failed to include enough subjects to detect it.

Note 2: It is advisable to mention a range rather than a single value for sample size. This is standard practice in the west, but not in India. A range may be obtained by calculating the sample size for different values of ‘p’.

 

Advertisements

147 thoughts on “Sample size calculation: Cross-sectional studies

  1. Dear Dr Roopesh,

    I would like to conduct a cross sectional study and I have difficulties to find the formula to calculated my sample size because the population is quite huge about 211,857. I am going to survey the knowledge, health belief and intention of female adolescent towards HPV vaccination and no previous study had ever done about this topic in my country. Could you please give me an advice about that matter?

    Your help is greatly appreciated.

    Sincerely,
    Sekartaji

    • Dear Sekartaji,

      If I understand the question correctly, you want to know how to compute sample size from a population of 211,857 individuals.

      Please use the prevalence from the following (and similar) articles to estimate the required sample size using the formula for cross-sectional studies:
      https://www.ncbi.nlm.nih.gov/pubmed/24188759

      In order to obtain your sample, you might consider cluster or multi-stage sampling.

      Hope this helps.

      Regards,
      Dr. Roopesh

    • Dear David,

      It is not ethical or practical to unnecessarily inflate the sample size for any study.

      The commonest reason for wanting to do so would be to increase the power of the study to detect even minor differences of interest.

      Another reason could be the desire to capture as much variation in the population as possible. However, this could be achieved by adopting a good sampling method.

      Regards,
      Dr. Roopesh

  2. How do I calculate the sample size for which the cases will be matched with control, give previous study gave prevalence of 32%.

    • Dear Achanya,

      Do you intend to have 1:1 matching, or higher?

      I hope you realize that in a case control study one is comparing proportions of outcome between cases and controls.
      Therefore, for sample size calculation, you need to provide proportions for both cases and controls.

      Regards,
      Dr. Roopesh

  3. when calculating sample size for three communities using sloven’s formula, if you add total for the three (for example 1474) and calculate you get about half the size ( 315) then you can use proportion formula to redistribute. However, if you were to calculate for each of the communities with populations 350, 774 and 350 you get a total of 624. Now, if I am using a mixed methods what number should I interview 315 or 624?

  4. In fact the design is exploratory sequential so I will do a questionnaire survey generalise results and based on that select my qualitatives ( FGDs and Indepth interviews etc. The three communities are made up of farmers who all practice rainfed farming, but farmers from 2 of the communities also practice dry season farming because they use small scale dams during the dry season. Again what are my justifications for interviewing 315, and not 624 is it okay so I do not incur unnecessary cost ?

  5. I Would like to conduct a study which hasn’t been done in my country, so how can I estimate a sample size. My study is the influence of body mass index on liver size.
    Regards,

    • Dear Qusay,

      Even though the study hasn’t been conducted in your country, it is possible to estimate sample size.

      From literature, identify the findings reported by other investigators. They would likely have reported several measures- AP diameter/ Transverse diameter/ Volume, etc. Determine which measure is of importance to your study, and note the relationship between BMI and that specific measure.

      Identify a study that was conducted in a setting similar to your own (even if in another country, factors like setting (rural/ urban); economic status (developing/ developed); etc. could be similar).

      Then determine what proportion of subjects in that study have the relationship of interest. Use that to estimate sample size using the formula provided in the article above.

      Hope this helps.

      Regards,
      Dr. Roopesh

  6. Hi
    i am going to conduct a cross section study about the prevalence of cancer in ladys around the age of the menopause with an ovarian cyst and looking of a biochemical marker called Ca 125
    still i am unable to calculate the sample size ?

    • Dear Someone,

      Please perform a detailed review of literature and determine what proportion of perimenopausal women with ovarian cysts have elevated Ca 125 levels.

      Use that proportion to estimate sample size by substituting in the formula provided in the article above.

      If you get a range, estimate sample size using the lowest proportion, and use that to conduct your study if feasible.

      Regards,
      Dr.Roopesh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s