Sample size calculation: Cross-sectional studies

Let us consider the estimation of sample size for a cross-sectional study.

In order to estimate the required sample size, we need to know the following:

p: The prevalence of the condition/ health state. If the prevalence is 32%, it may be either used as such (32%), or in its decimal form (0.32).

q: i. When p is in percentage terms: (100-p)

    ii. When p is in decimal terms: (1-p)

d (or l): The precision of the estimate. This could either be the relative precision, or the absolute precision. This will be discussed later in this post.

Za [Z alpha]: The value of z from the probability tables. If the values are normally distributed, then 95% of the values will fall within 2 standard errors of the mean. The value of z corresponding to this is 1.96 (from the standard normal variate tables). 

The formula for estimating sample size is given as:

        (Za)^2[p*q]      where the symbol ^ means ‘to the power of’; * means ‘multiplied by’

N=      d^2                                      that is, “Z-alpha squared into pq; upon d-square”

 substituting the values of Za, we get:

N= (1.96)^2[p*q]


We can round off the value of Za (1.96) to 2, to obtain:

N= (2)^2[p*q]


or, N= 4pq/ d^2      that is, “4 pq by d-square”



I wish to conduct a cross-sectional study on awareness of Hepatitis B among school children. A literature search reveals that other investigators have reported knowledge to range from 5% to 20% among students of grades 6 through 8. What should the size of my sample be?


The formula requires us to input the value of d (precision). If the absolute precision is known, there is no problem. However, often we can only input a relative precision. Where do we get the value of relative precision from?

Typically, relative precision is taken as a proportion of ‘p’. The maximum permissible limit is 20% of ‘p’.

In the above example, if ‘p’ is 20%, then ‘d’ will be (20/100)*20= 0.2*20= 4 {Taking a relative precision of 20%}.

This means that we will be able to detect a ‘p’ (prevalence) of 18% or more {half the value of relative precision on either side of ‘p’–> +/- 2%: 18% to 22%}.

That is, by taking a relative precision of 20% of ‘p’, the study will be able to detect the true awareness level if the actual prevalence is 18% or more. If the actual prevalence is less than 18%, however, the study will be unable to detect it accurately.

Therefore, the larger the value of ‘p’ (prevalence), the larger the possible value of ‘d’ (relative precision), keeping ‘d’ fixed (say, at 20% of ‘p’). If the prevalence is 50%, ‘d’ (20% of ‘p’) would then be 0.2*50= 10 (as compared to ‘d’ = 4 when ‘p’ = 20%).

The reverse is also true: the smaller the value of ‘p’, the smaller the value of ‘d’. A smaller ‘d’ implies a larger sample size. Therefore, the choice of ‘p’ is crucial. 

We can now input the values in the formula to obtain the sample size:

For the calculation we will take ‘d’ as 4. This yields:

N= (4*20*80)/ (4*4)

  = 400 this sample size will enable us to detect the truth if the prevalence is between 18-22% (or more).

If we took ‘p’= 5, then the sample size would be:

N= (4*5*95)/(1*1)                                           [‘d’= 0.2*5= 1]

  = 1900 this sample size will enable us to detect the truth if the prevalence is between 4-6% (or more).

So should I take ‘p’= 20% or ‘p’=5%?

That depends upon:

1. The  location of the original study- if you are planning to conduct the study in an urban area, use the prevalence reported by studies conducted in urban areas, and vice versa.

2. The available resources (time, manpower, money, etc.). Aim for the largest feasible sample size. The size should be adequate to yield 80% power. Do not unnecessarily increase the sample size unless the intention is to obtain greater power. If so, please mention the same in the methodology section.

3. The results of your pilot study. If you have conducted a pilot study, the prevalence obtained from that study should be taken as ‘p’. This will be much more accurate than any other external value.


Note 1: If you have multiple objectives, you must calculate the required sample size for each objective, then choose the largest sample size thus obtained. This will ensure adequate power for all objectives, else the study will lack power for one or more objectives. That is, you may not be able to detect a significant result where it actually exists because you failed to include enough subjects to detect it.

Note 2: It is advisable to mention a range rather than a single value for sample size. This is standard practice in the west, but not in India. A range may be obtained by calculating the sample size for different values of ‘p’.


109 thoughts on “Sample size calculation: Cross-sectional studies

    • Dear Sangeeta,

      That depends upon your objectives, study design, power, effect size (expected), etc.

      I recommend consulting a local statistician for further inputs as it is not feasible to discuss this matter here.

      I’d be glad to clarify any doubts you may have subsequent to such a consultation.

      Dr. Roopesh

  1. dear sir..

    I am using only questionnaire forms to obtain all the data for my research… is this formula is applicable for my study.. How did you choose 18%? is it an estimation value?

    • Dear Shazwan,

      This formula can be used for the kind of studies you’ve described.

      What I have described is a hypothetical situation where the prevalence is 20%.
      The question any researcher has to face is, “What if my prevalence estimate is wrong?”. That is, what if the true prevalence is not 20%?

      In the example, taking a relative precision of 20% yields d=4. This indicates the range of prevalence that can be captured by the study.

      If we were to consider that d gives us the 95% limits of prevalence, then the range of prevalence detectable by the study is half of d on either side of the prevalence estimate (here 20%).

      Thus, the range of prevalence lies from 20-2 to 20+2: 18% to 22% respectively.

      In other words, the study will capture a prevalence of 18% or higher. If the true prevalence is lower than 18%, the study may not have adequate power to detect a significant difference.
      I hope this clarifies the matter.

      Dr. Roopesh

  2. Dear Farouk,

    If you plan to conduct a study on macular thickness, I assume it is with regard to a specific condition. That is, your objective would be something like this: “To determine macular thickness among individuals with/ without XYZ..”.

    In order to have adequate power for the study (that is, to be able to detect a difference in macular thickness between the subjects of interest and others [or any two groups]), one needs to factor in the prevalence of the condition of interest among them.

    Failure to do so may result in the study having inadequate power. Thus, no difference may be detected although there actually is a difference between the two groups.

    Bottomline: The prevalence is required to estimate sample size.

    The only exception to this rule is an in-vitro study or a study in lab animals, where a rule of thumb usually applies.

    I hope that helps clear your doubt.

    Dr. Roopesh

  3. Dear Roopesh,
    I have doubt in using absolute and relative precision.
    For example,I want to assess prevalence of oral malodour in a city, in a pilot study I got prevalence as 20%; I would like to have 95% Confidence Interval;here which precision should I consider for sample size calculation.i.e. ABSOLUTE OR RELATIVE PRECISION?

    Kindly let me know, what conditions we should use absolute precision and relative precisions?

  4. Dear Dr,

    I’ve been reading your post & really appreciate your concern in spreading the knowledge. I am new in research and i am conducting a research regarding incidence of an acute infection in one particular hospital. I need to ask you that keeping in mind the previous incidence of 11.2% which was from another state another hospital but same country. However, no other study has been conducted as per now in this country on the specific topic. what should be my expected sample size??
    Cross-sectional study means a duration based study with minimum sample size as per calculated using this formula? right?

    I will be grateful for you guidance! thanks.

    • Dear Omaid,
      A cross-sectional study is not suitable to determine incidence- it can only help estimate the prevalence. In order to obtain incidence (new cases), one needs to undertake a longitudinal (cohort) study.
      When one calculates sample size for a cross-sectional study, the usual input parameter is prevalence.
      A cross-sectional study simply means a study that captures a snap-shot of the population at a point in time. It has nothing to do with sample size. The key thing is to interview each subject only once during the study duration. You may read about cross-sectional studies in my post by the same name.
      In view of the above, I am unable to respond appropriately to your query on sample size at present. Maybe I will be able to help you estimate sample size if you modify the question suitably.

      Hope this helped.
      Dr. Roopesh

      • Thanks for your guidance Dr. I studied and got the idea of study designs. I’ve changed my design to prospective observational study.
        Now kindly let me know what should be the appropriate sample size to calculate the incidence by keeping in mind the prevalence of 11.2% & CI 95%?

        • Dear Omaid,

          A prospective observational study could be longitudinal (cohort), or otherwise (cross-sectional). However, traditional teaching indicates that one can determine incidence (new cases) only via a cohort study. Are you planning to conduct a cohort study?

          I will assume that you actually plan on conducting a cross-sectional study (simply because it is the easiest and least expensive). In that case, the sample size would be:
          {4*11.2*(100-11.2)}/ {(11.2*0.2)^2}

          This works out to be: 3978.24/ 5.0= 795 or approximately 800.

          Again, please note that a cross sectional study is not appropriate to obtain incidence.

          As a rule of thumb, the lower the prevalence, the larger the sample size.

          I hope this was useful.

          Dr. Roopesh

          • Dr. I have to find incidence of an acute disease in a current hospital population. So I am conducting a prospective observational study which can be termed as cohort study. Because I’ll be comparing two groups with similar characteristics out of which one will be those who develop the disease and other did not. I’m finding up the risk factors also.
            So, I calculated my population size by Denial’s Formula. And it gave me sample said of 186 including 25% dropout rate. Is it wrong?

            • Denials formula is one which is given in Cochran book of Sampling Techniques. The formula is: n = Z^2 x p(1-p)/d^2
              where Z is Confidence Interval = 1.96 for 95%
              d (margin of error) = 0.05 for 5%
              p (Prevalence) = 0.11 for 11.2%

              This gives me n=151
              adding 25% drop off rate it becomes 189 which is my final sample size.
              HOWEVER, i will be conducting a duration based study like 4-5 months study in which i will be taking all the registered patients during that time period.
              Finding out the sample size is only so that i have an idea of at least what should be the minimum number of subjects to be taken into account.
              Please let me know what u think over it??
              Thanks again.. u r doing a great deed, no one now a days takes out time and help people like this. Really appreciable!

              • Dear Omaid,

                The formula is correct, but your application of the formula is not.
                You need to calculate ‘d’ as a proportion of ‘p’- 5% of 11.2.

                That should yield the required sample size.
                Of course, the sample size will skyrocket, but it is expected with a prevalence as low as 11.2%.

                First calculate the required sample size using my suggestion above, then calculate how many more you’d need assuming 25% loss to follow up.
                That will give you the grand total required.

                Typically, one does not compute sample size using an estimated loss to follow up of more than 20%.
                Any loss to follow up more than 20% is considered unacceptably high and reflects poor patient selection or other methodological flaws.

                Hope this helps clarify things.

                Dr. Roopesh

    • Dear Fatima,
      Using the formula for a cross-sectional study, and assuming precision to be 10%, the sample size would be between 84 and 100.
      I would recommend that the larger sample size be used, though.

      Dr. Roopesh

    • Dear Fatima,

      Let us assume that the true population value is 100 units.

      When we estimate something, there is always a chance of error. The amount of error indicates the precision of the estimate- smaller the error, more precise the estimate. The margin of error merely refers to the magnitude of error we would like to have.

      If the error margin is 5%, then in the above example the estimate should be accurate to within +/-5% of 100 (the true population value).

      The 95% CI indicates that if we obtained sample estimates 100 times, in 95 instances we would get the true population value.

      Combining the two:
      The sample estimate will not differ from the true population value by more than +/-5 percent (margin of error) 95 percent of the time (confidence interval).

      I hope this helps clarify your doubt.

      Dr. Roopesh

  5. Dear Dr. Roopesh
    Hey & good-day
    It’s with great pleasure that I write for you
    I’m PhD student, I prepare myself to do my research, so I would like to ask you; How can I calculate sample size, if the prevalence of previous study is 30%?

      • Dear Ammar,

        The sample size calculation does not depend upon the type of variable.

        However, the number of variables should not be excessive- that would decrease the power.

        Please read the related article on a general rule of thumb for sample size calculation for more details.

        Dr. Roopesh

    • Dear Ammar,

      I presume the study design is a cross sectional study. If so, the sample size would be:
      233 (relative precision 20%) to 933 (relative precision 10%).
      Please note that the calculation is based on the formula 4pq/ (d^2), where
      p= (prevalence)=30
      d= (precision)= either 10% or 20% of p
      I would recommend the larger sample size as the possibility of having low power is reduced. However, practical considerations might dictate otherwise.

      I hope this helps.

      Dr. Roopesh

  6. Dear Dr. Roopesh
    Hello and good-day
    Many thanks for your information, I appreciated it. I asked before about quantitative variables, because I had read some articles talking about ” quantitative variables” so I ask you again “What’s the different?”
    Also how you calculate the precision? please tell in detail. thank you.

    • Dear Ammar,

      I have described quantitative and qualitative variables in this post:

      I have discussed precision in this post:

      Please go through the above links and let me know if you continue to have queries.

      Dr. Roopesh

        • Dear Ammar,

          I’m afraid I don’t understand what exactly you mean by the term ‘equation satiable’.

          I have not come across any sample size estimation formula that is dependent upon the type of variable.
          I have mentioned the elements required for sample size calculation in the articles on the topic.
          Unless you specify/ clarify your question further, I’m afraid I will be unable to respond appropriately.

          Dr. Roopesh

            • Dear Ammar,

              Thanks for the clarification and link. I disagree with the authors. I learnt the same formulae as:
              ‘Formula to calculate sample size when proportions are known;
              Formula to calculate sample size when mean, SD is known’, etc.

              The reason for my disagreement is that numeric variables can also be expressed as a proportion/ percentage. Therefore, to claim that the first formula (using proportions) is only applicable to qualitative data does not make sense to me.
              Moreover, often, the only details available are the relative proportions, not mean/ SD. If the formula were applicable only to qualitative variables, most studies would never have seen the light of day!
              You can test the veracity of my statement by searching for journal articles mentioning percentages/ proportions for outcome variables as opposed to mean/ SD.
              You should find that the former far outnumber the latter; and that most of those articles describe quantitative variables.

              In addition, you could seek expert opinion from someone else as well.

              Do let me know what you discover- I might learn something in the process.

              Dr. Roopesh

            • Dear Ammar,

              I haven’t heard from you since I posted my response.
              I wish to clarify that I wasn’t being sarcastic when I said I could learn something through your efforts. I truly believe that every interaction is a learning experience, and that one could learn from anyone. Besides, I don’t profess to know everything about everything, either. Naturally, there is a distinct possibility of learning something through your response(s).

              Dr. Roopesh

              • Dear Dr. Roopesh
                Good day
                I apologize for delay reply to you
                You very much for the detailed explanation. This has been useful in my research
                I will continue to search in the subject until I get to a satisfactory result
                And I’ll tell you everything findings

                With best regards

                Ammar Elmajzoup

  7. Dear Dr. Roopesh
    As, sample size = 4PQ/L^2, Where , P= Prevalence & Q =100-P
    L= Margin of error
    If P = 20%,L=10%, Than Q=100-20=80%
    Sample size = 4PQ\L^2
    = 4*20*80\10*10 =64
    Or some books have mention
    L= 10 OF P =10*20\100 = 2
    Than Sample size = 4PQ\L^2
    = 4*20*80\4
    = 1600
    In that case which ans is correct.Either 1600 sample size needed or 64 sample size needed

    • Dear Haripal,

      I would recommend the second approach.
      With a sample size of 64, you may encounter issues with power. The same is unlikely with a sample size of 1600.
      However, practical considerations would supervene.

      Dr. Roopesh

  8. Dear Dr. Roopesh,

    You mentioned in the article that the sample size should be adequate to yield 80%. May I know which part of the formula that are related to power? For example, what value do I change if I want to increase or decrease the power?

    We want to conduct a cross sectional study and from previous study the prevalence of disease is 6%. I would like to know how many samples I should include in the study so it does capture “adequate” number of people with the disease. Is the formula given a right formula to use in this context?

    Many thanks and your help is much appreciated.


    • Dear Betsy,

      The formula mentioned in the article is for the situation when a proportion is the parameter of study. The actual formula is:
      n= (Za^2 *p*q)/ d^2
      Za (Z alpha)= Standard Normal Deviate (Z value)
      p= proportion/ prevalence of interest
      q= 100-p
      d= expected precision

      Alpha refers to the Type I error rate. This is usually kept at 5% (ie 0.05).
      The corresponding value of Z (for alpha of 0.05) is 1.96.
      The formula given in the article is a simplification- as 1.96 is ~2,
      Za^2 = (2^2)= 4,
      yielding the formula
      n= 4pq/d^2
      Alpha error of 0.05 (5%) corresponds to Confidence Interval of 95%.
      If you wish to alter the Confidence Interval up or down, one merely needs to change the value of Za in the formula.
      For 99% CI, Za= 2.57
      For 99.9% CI, Za= 3.29

      The above formula does not permit alteration of power. For that, one may use the following:
      n= [(Za+Zb)^2*p1q1+p2q2]/ d^2
      Za (Z alpha)= Z value for alpha error
      Zb (Z beta)= Z value for beta error
      p1= proportion in first group;
      q1= 1-p1
      p2= proportion in second group
      q2= 1-p2
      d= clinically meaningful difference between two groups

      Zb influences the power
      Za influences the Confidence Interval

      Beta is usually =80%

      The n obtained yields the required number for each arm/ group

      With a prevalence of 6%, the sample size would be
      4*6*94/ (6*0.2)^2
      = 2566/1.44
      =1781.9 or 1782

      I hope this helps.
      Dr. Roopesh

      • Dear Dr. Roopesh,

        Many thanks for you prompt reply. Please correct me if I am wrong, if we think of the set up as hypothesis testing, the first formula is for H0: p = 6% vs H1: p not equal 6%, the second formula is for comparison between two groups?

        For the first formula, is that mean that power is fixed at 80% or it has nothing to do with power at all?

        The prevalence of disease of my population is 6%, I understand that I need n that is big enough so it contains some people with the disease. I was a asked to get the value of n such that the study has 80% power to capture adequate number of people with disease. Is this the same as using your first formula? Or the 80% power just can’t be used in this way?

        Sorry for the long question. Your help is much appreciated.


        • Dear Betsy,

          If my memory serves me right, the first formula is fixed at 80%. You could alter the denominator (precision) to improve your chances of capturing individuals with the disease.
          I am travelling at present, so am unable to provide a better response.
          If you will bear with me, I will provide a detailed/ more accurate response upon my return.
          Dr. Roopesh

          • Dear Dr. Roopesh,

            Many thanks for your quick reply. I am looking forward to hear from you again.


            • Dear Betsy,

              Based on my research, the first formula does not include beta, so power cannot be altered (or estimated). However, power will increase with decrease in alpha- larger n.
              A statistician friend told me that one does not compute power for the first formula.
              According to him, power computations are restricted to hypothesis testing situations (where one is dealing with two proportions).
              I have managed to obtain a formula for estimating beta from alpha:
              Zb = [n(p1-p0)^2/2pq]^(1/2) – Za

              Zb = Z beta
              n= sample size
              p1= proportion as per Ha (alternative hypothesis)
              p0= proportion as per H0 (null hypothesis)

              The reference for the formula is:
              Case Control Studies design, conduct, analysis by
              James J Schlesselman
              Chapter: Sample Size
              Page: 149

              One always has the option of performing post-hoc power analysis. In addition, if you have an estimate of the effect size, it may be possible to estimate power.

              Apologies for the delay.

              I hope this helps.

              Dr. Roopesh

  9. Hello Sir,
    Kindly, inform me whether I can use this formula to determine the sample size in a community based study in social science, particularly Psychology. As I have to find the prevalence of a syndrome in the population.
    Can this formula be used for any other social science subject, like: Tourism, Management etc. ?


      • Respected Sir,
        Thank you so much for the quick response. It is really going to help me out a lot.

        Regards and best wishes,
        Vipasha Kashyap,
        Doctoral Student,
        Department of Psychology,
        Himachal Pradesh University

      • Hello Sir,

        I have one more query. Would it be appropriate to calculate the denominator (precision) as 12% of ‘p’. Because in my study, if I am calculating it, at 5 or 10% the sample size is coming out very large. Kindly, suggest me a reason which I could write in my methodology as an explanation for calculating the precision at 12%. I have to find the ‘p’ first by conducting a pilot study.
        I shall be highly thankful to you.

        Best wishes and regards,

        • Dear Vipasha,

          The maximum acceptable level for precision is 20% of p. It should not be very difficult to obtain references for the same. Try looking up good epidemiology and biostatistics books. Research methodology books may also help.

          Dr. Roopesh

    • Dear Najihah,
      You will need to first perform a literature review to ascertain the prevalence of one or more gut microbes of interest in the specific population. Calculate sample size for each of them, then select the largest sample size as the required sample size for your study.

      Dr. Roopesh

  10. Dear Dr Roopesh
    I am carrying out a study of dry eye among diabetics in comparison to non-diabetics. My study design is a comparative cross-sectional design. I wonder if this is right and if so, what formula should I use to calculate my sample size. Thanks

    • Dear Anonymous,

      The study design depends upon your research question.

      If you are attempting to determine the prevalence of dry eye, it would be a cross-sectional study.

      If you are trying to compare the two- occurrence of dry eye in diabetic vs non-diabetic, it would have to be a case-control study.

      The sample size calculation depends upon the study design.

      I hope this helps.

      Dr. Roopesh

  11. Pingback: Sample size calculation: Cross-sectional studies | mwebazavanessa

  12. Dr Roopesh, Please I am finding it difficult to identify an estimated proportion to enable calculate the sample size of a purely cross sectional descriptive study on decentralization of PHC services. I am doing an all purposive sampling

    • Dear Uche,
      The sample size will depend upon your objective(s).
      I might be able to better help you if you provide a sample objective.
      The sampling method will affect validity, not sample size.

      Dr. Roopesh

  13. Dear Dr. Roopesh,

    Currently, I’m doing a cross-sectional study involving measurements of parameters value (numerical) from MRI images. I’m having a problem in calculating the number of sample needed for validation process in order to calculate the inter-rater agreement using intraclass correlation coefficient (ICC)

    The questions that I want to ask are :

    1. Is the calculation method more or less same like the calculation for determination of sample size for my study ?

    2. Is there any way that the number of sample for validation can be determined just by random assumption in case of no previous study done before ?


  14. Dear Dr. Roopesh,
    i am trying to conduct a study about “Frequency of diabetes in pregnant women at first antenatal visit” in our hospital. according to literature the prevalence rate is 1%. what formula should i use to calculate sample size and margin of error?

    thank you,

  15. Dear Dr. Roopesh,
    please can you provide the sample size and margin of error calculation formulas for the following:
    1) Clinical Randomized Control (CRT)
    2) Cohrt Study
    3) Cross-section studies (i have got it from you above article but i am little confuse about margin of error. should i consider “d” as margin of error? )

    thank you
    Salman Karim

  16. Dr, my research is prevalence of diarrhoea in rural areas and using only questionnaire form. if the prevalence is 45 % what should be the sample size. the confidence interval be 95 %.

  17. dear Roopesh,
    I am doing comparative cross-sectional study where i am planning to use cluster sampling in choosing sample unit. I wanted to keep difference of Sd in my two study setting 0.5 , 95% CI and with design effects 3 and power 80%. I want to insight from you in my sampling technique..
    hoping to hear soon
    with regards,

    • Dear Ms. Acharya,

      The calculation seems fine- except the use of design effect of 3, that is.

      Such a high design effect indicates very high inter and/or intra-cluster variability.

      I would recommend evaluating the necessity for such a high design effect, preferably through a pilot study.

      The reason for my suggestion is simple- if you can obtain your answers with a lower sample, you have no reason to waste resources by taking a larger sample. In the specific case of your proposed study, a design effect of 3 would imply trebling the initial sample size, while lowering the design effect to 2 would mean sampling ‘only’ twice the initial sample size.

      You have not mentioned the number of clusters and size of each cluster.
      My recommendation would be to increase the number of clusters while reducing the size of each cluster. This will have the effect of increasing power. You may discuss this with your statistician/ epidemiologist (to learn what happens and/ or how this occurs).


  18. Dr,
    Currently I am trying to look into Family functioning and other exposure like coping mechanism, Prevelance of PTSD in family and its relation to Children PTSD.
    My desire CI is 95% with power 0.8 but Dr i am confused which formula will be best for me to calculate sample size. Wish to get insight from you Dr.
    With Regards,
    Shneha Acharya

  19. Hello sir , I’m doing a research project for my studies on the association between stress with dietary intake and anthropometric measurement among undergraduate , may i know what kind of formula shall i use for this ?

  20. Hi! I would like to apologize ahead of time if this post will be lengthy. I have been really troubled lately about the reliability of my study and I am about to have my Final defense on my paper (on March 14).

    I conducted a study entitled “Situation of Drug Resistant Tuberculosis in the Municipalities of Molave and Tambulig, Zamboanga del Sur”. At the beginning of the study, I did not apply any form of sampling design. What I did was just considered the entire population and did purposive sampling. My study was also cross sectional.

    Anyway, if I would have gone back and computed my sample size, how would I do it? The following are considered.

    1) Total population where my samples were taken is 1,208 previously treated tuberculosis patients.
    2) My objectives are the following:
    a. Identify prevalence of TB symptoms among previously treated TB patients (there is no
    given statistics for this)
    b. Identify prevalence of MDRTB among previously treated patients (there is a national
    incidence rate of 5.7% among patients being treated for Tuberculosis)

    In the course of my study, I was able to interview 368 out of the 1,208 patients. Out of the 368, 124 turned out to be positive for symptoms. I required all 124 for testing for MDRTB but only 83 showed up and were hence tested. Out of the 83, only 1 turned out positive for MDRTB.

    Can you please help? I’m sorry if I am hardly making any sense. Haha. I have been told that my research was a mess (and I believed that – having had no experience prior and having too little of a guidance doing it). Thank you so much!

    • Dear Kerwin,

      I’m afraid I was unable to respond earlier, so this response is probably not of much benefit to you, considering that your thesis defense is over.
      Firstly, you could have applied cluster sampling to obtain the sample size.
      Next, I have several questions for you:
      Were the patients treated for pulmonary TB only, or did you include extra-pulmonary TB as well?
      What was the time frame under consideration- those who were treated within the last year/ last 2 years/ last 5 years?
      Were HIV positive individuals (HIV-TB co-infection) included- this would affect the probability that one would continue be symptomatic after treatment completion; it would also influence the risk of developing MDR-TB?
      What was the treatment outcome of the subjects- In India following first line treatment, sputum+ve pulmonary TB patients may either be ‘cured’ (sputum+ve at start of treatment, but sputum-ve at end of intensive phase and end of treatment), ‘treatment completed’ (sputum+ve initially, then sputum-ve after intensive phase, but not at end of treatment), or ‘failure’ (initially sputum+ve, continues to remain sputum+ve till end of treatment)?
      What was the minimum time after cure/treatment completion for subjects to be included?
      Were smokers/ ex-smokers included- COPD/ post-TB Bronchiectasis excluded (“symptoms of TB”)?
      What is the mortality rate due to TB in your country- did you factor that in your calculations/ estimations regarding how many would be alive/ available?
      Was Diabetes considered as a major risk factor during study design/ analysis?

      Hopefully the above queries will help clarify things for you.

      Apologies for the delay in responding.

      Dr. Roopesh

  21. i am doing a study on comparison of bronchial wash and biopsy with bronchial brushing and bronchial biopsy. i want to calculate my sample size.

  22. Dear Dr. Roopesh

    I am planning a study titled ‘estimation of measles antibody levels in aged 0 to 9 months healthy children’ and Compare the titre levels between groups of each month (0-1, 1.1-2, 2.1-3, 3.1-4 etc) as well as between groups of each quartile age (0-3 months, 4-6 months, 7-9 months)
    please suggest the sample size needed and which formula to use

    Thanking you

  23. Hello Dr.
    I’m doing a cross sectional analytical study on physical activity and postpartum depression(PPD) in women. I’m trying to find prevalence of PPD and the association between the PPD and physical activity. what sample size formula am i expected to use

    • Dear Grace,

      The prevalence of moderate -vigorous physical activity (MVPA) among women with postpartum depression is around 32% according to a study.

      You will have to perform a detailed literature search to obtain prevalence rates from several studies.

      Then, compute the sample size requirement using the lowest prevalence and highest prevalence reported in literature. That will give you a range within which your own study’s sample size should fall.

      The estimation may be further refined by performing a pilot study in your area and using the prevalence thus obtained for sample size estimation.

      If you want to play it safe with regards to power considerations, simply use the smallest reported prevalence for sample size estimation- it should yield the largest required sample size.

      Hope this helps.

      Dr. Roopesh

      PS: You may need to frame your research question carefully, as the prevalence of MVPA varies with time after delivery, as does the occurrence of postpartum depression.

  24. Dear Dr Roopesh,
    I want to conduct a simple descriptive assessment on the healthcare behaviors of patients in a specific department of a public sector health facility (specifically I would like to know for eg if the patients come to this facility for 2nd opinion, if this facility is their first choice, why do they choose it, would they go to a private facility if they could afford it etc.). I am struggling a bit with: 1)sample size (as I am not sure which size of the population should I choose (is it the average number of patients admitted to this department by day, or by week ,or by month??), 2)the period over which I should be conducting the study (should I choose one day per week and question all patients coming into the department during that day say for a month, or 2; or maybe 2 days a week on a period of 1 month etc.)
    Hope you would be able to help!

  25. Respected Sir,
    Kindly, suggest me a few references for cluster sampling and the formula used to make the cluster (from the population of the concerned area). I want to make make a cluster for one of my research study.
    Vipasha Kashyap

  26. Dear Dr Roopesh,


    I am now doing comparison study among vegetarian and non vegetarian
    Purpose: to compare lifestyle factor, dietary intake, physical activity among vegetarian and non vegetarian
    In the mean time, i planned to use cluster sampling design in choosing respondents.

    My question is should i named my study as comparison cross sectional design or just comparison study design?

    Thank you.

    • Dear Qi,

      A cross-sectional study is any study that engages with study subjects only once during the period of study.

      Thus, each subject contributes only one set of responses/ values to the study during its tenure.

      If the subjects are interviewed/ investigated on more than one occasion, the study design then changes to a longitudinal study.

      In such studies, subjects contribute more than one set of values to the study- obtained at different points in time.

      If your study involves interviewing subjects just once during the study period, it is a cross-sectional study. It doesn’t matter if the total time taken to interview all subjects is 1 or even 2 years, as long as each subject was interviewed only once.

      How you obtain/ recruit the subjects is the purview of sampling, and does not affect the study design.

      Please note that all epidemiological studies involve comparisons. Therefore, to call your study a comparison study design would be of no benefit (there is no such study design).

      I wonder why you wish to use cluster sampling, though, as it is a less than robust method- unless you have a large sample size/ geographical area to cover, and desire the convenience of cluster sampling.

      Hope this helped.

      Dr. Roopesh

  27. Dear Dr. Roopesh,

    I’m new in research and now I need to conduct one for my thesis. It is about the prevalence of optic neuritis in a particular hospital, also its clinical presentation (e.g visual acuity, visual field, color vision). The study design is cross sectional, with the sampling is consecutive sampling from medical records. May I know what formula I’m expected to use to calculate the sample size? Please tell in detail. Thanks in advance for your guidance.

    • Dear Someone,

      The calculation of sample size for cross sectional studies requires the use of the formula mentioned in the above article: 4pq/ l^2

      Since you have more than one objective, you will need to perform a detailed review of literature and obtain prevalence values from published studies in similar settings. Identify the setting that is most like your own, then take the lowest value of prevalence for optic neuritis. Impute the value in the formula to obtain a sample size estimate for optic neuritis.

      Next, establish threshold levels for visual acuity, visual field and color vision depending upon your hypotheses- visual acuity less than a/b; etc.

      Perform the same procedure as for optic neuritis- review of literature, then selection of prevalence value to estimate sample size.

      Once you have calculated sample size for each objective, select the largest sample size as the required sample size for your study. That way the study will be adequately powered for all objectives.

      I hope this helps.

      Dr. Roopesh

  28. Hello sir,
    At the outset, thank you so much for the valuable discussion. I am planning to do a comparative cross sectional study among obese and non-obese children to see if there is any association (of course not causal) between dental caries and obesity. How should I calculate the sample size? What data do I need to derive an appropriate sample size? My prespecified hypothesis is that obesity and dental caries can have common risk factors and hence there could be an association between dental caries and obesity.

    Thanks a lot.

      • Hello sir,
        To check association between obesity as identified by BMI scores and caries experience…. so caries experience of normal weight individuals will be compared with that of obese individuals.

        • Dear Someone/ Dhyan,

          You will need to state the study population.
          In that population, you will have to determine the prevalence of obesity and caries (separately) from published literature.
          You will typically have a few prevalence values for obesity in that study population, and a similar number for caries.
          Calculate sample size for each value, and take the largest sample size thus obtained.
          If there is a similar study, use the smaller value for prevalence of dental caries among obese subjects/ non-obese subjects to calculate the sample size.
          Hope this helps.

          Dr. Roopesh

  29. Hello sir,
    In my work i going to compare the serum levels of a certain protein in obese and non obese adults. Having difficulty with the sample size calculation. please can you help?

    • Dear Mahpara,

      Thank you for the words of encouragement.

      I’m glad you find this blog useful.

      Do spread the word about this blog so that many others may benefit as well.

      Dr. Roopesh

  30. Dear Dr. Roopesh,

    Thank you for your post. This is indeed helpful.

    I am trying to estimate the sample size of a national oral health survey.

    From the previous report, the prevalence rates of dental caries and gum diseases were 90% and 85%, respectively. Using the formula you provided, I came up with n=385 for dental caries and 545 for gum diseases, with a margin of error of 3% and 95% CI.

    I would like to ask, how could I make sure that the power of this study is 80% or more?

    Thank you.


    • Dear Qin,

      The formula provided above has a presumed power of 80%. I say presumed because I have been unable to find something to support that assumption.

      I would recommend that you try to use the largest sample size obtained.

      The only way to assure yourself of a power of 80% or more is to take a small value for ‘L’. This will cause inflation of the sample size, and increase power.

      In your case, I would take a smaller margin of error to be certain of the power- provided it is feasible.

      I hope this helps.

      Dr. Roopesh

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s