Sample size calculation: Cross-sectional studies

Let us consider the estimation of sample size for a cross-sectional study.

In order to estimate the required sample size, we need to know the following:

p: The prevalence of the condition/ health state. If the prevalence is 32%, it may be either used as such (32%), or in its decimal form (0.32).

q: i. When p is in percentage terms: (100-p)

    ii. When p is in decimal terms: (1-p)

d (or l): The precision of the estimate. This could either be the relative precision, or the absolute precision. This will be discussed later in this post.

Za [Z alpha]: The value of z from the probability tables. If the values are normally distributed, then 95% of the values will fall within 2 standard errors of the mean. The value of z corresponding to this is 1.96 (from the standard normal variate tables). 

The formula for estimating sample size is given as:

        (Za)^2[p*q]      where the symbol ^ means ‘to the power of’; * means ‘multiplied by’

N=      d^2                                      that is, “Z-alpha squared into pq; upon d-square”

 substituting the values of Za, we get:

N= (1.96)^2[p*q]

           d^2

We can round off the value of Za (1.96) to 2, to obtain:

N= (2)^2[p*q]

         d^2

or, N= 4pq/ d^2      that is, “4 pq by d-square”

 

Example:

I wish to conduct a cross-sectional study on awareness of Hepatitis B among school children. A literature search reveals that other investigators have reported knowledge to range from 5% to 20% among students of grades 6 through 8. What should the size of my sample be?

 

The formula requires us to input the value of d (precision). If the absolute precision is known, there is no problem. However, often we can only input a relative precision. Where do we get the value of relative precision from?

Typically, relative precision is taken as a proportion of ‘p’. The maximum permissible limit is 20% of ‘p’.

In the above example, if ‘p’ is 20%, then ‘d’ will be (20/100)*20= 0.2*20= 4 {Taking a relative precision of 20%}.

This means that we will be able to detect a ‘p’ (prevalence) of 18% or more {half the value of relative precision on either side of ‘p’–> +/- 2%: 18% to 22%}.

That is, by taking a relative precision of 20% of ‘p’, the study will be able to detect the true awareness level if the actual prevalence is 18% or more. If the actual prevalence is less than 18%, however, the study will be unable to detect it accurately.

Therefore, the larger the value of ‘p’ (prevalence), the larger the possible value of ‘d’ (relative precision), keeping ‘d’ fixed (say, at 20% of ‘p’). If the prevalence is 50%, ‘d’ (20% of ‘p’) would then be 0.2*50= 10 (as compared to ‘d’ = 4 when ‘p’ = 20%).

The reverse is also true: the smaller the value of ‘p’, the smaller the value of ‘d’. A smaller ‘d’ implies a larger sample size. Therefore, the choice of ‘p’ is crucial. 

We can now input the values in the formula to obtain the sample size:

For the calculation we will take ‘d’ as 4. This yields:

N= (4*20*80)/ (4*4)

  = 400 this sample size will enable us to detect the truth if the prevalence is between 18-22% (or more).

If we took ‘p’= 5, then the sample size would be:

N= (4*5*95)/(1*1)                                           [‘d’= 0.2*5= 1]

  = 1900 this sample size will enable us to detect the truth if the prevalence is between 4-6% (or more).

So should I take ‘p’= 20% or ‘p’=5%?

That depends upon:

1. The  location of the original study- if you are planning to conduct the study in an urban area, use the prevalence reported by studies conducted in urban areas, and vice versa.

2. The available resources (time, manpower, money, etc.). Aim for the largest feasible sample size. The size should be adequate to yield 80% power. Do not unnecessarily increase the sample size unless the intention is to obtain greater power. If so, please mention the same in the methodology section.

3. The results of your pilot study. If you have conducted a pilot study, the prevalence obtained from that study should be taken as ‘p’. This will be much more accurate than any other external value.

 

Note 1: If you have multiple objectives, you must calculate the required sample size for each objective, then choose the largest sample size thus obtained. This will ensure adequate power for all objectives, else the study will lack power for one or more objectives. That is, you may not be able to detect a significant result where it actually exists because you failed to include enough subjects to detect it.

Note 2: It is advisable to mention a range rather than a single value for sample size. This is standard practice in the west, but not in India. A range may be obtained by calculating the sample size for different values of ‘p’.

 

Advertisements

165 thoughts on “Sample size calculation: Cross-sectional studies

  1. Dear Dr Roopesh,

    I would like to conduct a cross sectional study and I have difficulties to find the formula to calculated my sample size because the population is quite huge about 211,857. I am going to survey the knowledge, health belief and intention of female adolescent towards HPV vaccination and no previous study had ever done about this topic in my country. Could you please give me an advice about that matter?

    Your help is greatly appreciated.

    Sincerely,
    Sekartaji

    • Dear Sekartaji,

      If I understand the question correctly, you want to know how to compute sample size from a population of 211,857 individuals.

      Please use the prevalence from the following (and similar) articles to estimate the required sample size using the formula for cross-sectional studies:
      https://www.ncbi.nlm.nih.gov/pubmed/24188759

      In order to obtain your sample, you might consider cluster or multi-stage sampling.

      Hope this helps.

      Regards,
      Dr. Roopesh

    • Dear David,

      It is not ethical or practical to unnecessarily inflate the sample size for any study.

      The commonest reason for wanting to do so would be to increase the power of the study to detect even minor differences of interest.

      Another reason could be the desire to capture as much variation in the population as possible. However, this could be achieved by adopting a good sampling method.

      Regards,
      Dr. Roopesh

  2. How do I calculate the sample size for which the cases will be matched with control, give previous study gave prevalence of 32%.

    • Dear Achanya,

      Do you intend to have 1:1 matching, or higher?

      I hope you realize that in a case control study one is comparing proportions of outcome between cases and controls.
      Therefore, for sample size calculation, you need to provide proportions for both cases and controls.

      Regards,
      Dr. Roopesh

  3. when calculating sample size for three communities using sloven’s formula, if you add total for the three (for example 1474) and calculate you get about half the size ( 315) then you can use proportion formula to redistribute. However, if you were to calculate for each of the communities with populations 350, 774 and 350 you get a total of 624. Now, if I am using a mixed methods what number should I interview 315 or 624?

  4. In fact the design is exploratory sequential so I will do a questionnaire survey generalise results and based on that select my qualitatives ( FGDs and Indepth interviews etc. The three communities are made up of farmers who all practice rainfed farming, but farmers from 2 of the communities also practice dry season farming because they use small scale dams during the dry season. Again what are my justifications for interviewing 315, and not 624 is it okay so I do not incur unnecessary cost ?

  5. I Would like to conduct a study which hasn’t been done in my country, so how can I estimate a sample size. My study is the influence of body mass index on liver size.
    Regards,

    • Dear Qusay,

      Even though the study hasn’t been conducted in your country, it is possible to estimate sample size.

      From literature, identify the findings reported by other investigators. They would likely have reported several measures- AP diameter/ Transverse diameter/ Volume, etc. Determine which measure is of importance to your study, and note the relationship between BMI and that specific measure.

      Identify a study that was conducted in a setting similar to your own (even if in another country, factors like setting (rural/ urban); economic status (developing/ developed); etc. could be similar).

      Then determine what proportion of subjects in that study have the relationship of interest. Use that to estimate sample size using the formula provided in the article above.

      Hope this helps.

      Regards,
      Dr. Roopesh

  6. Hi
    i am going to conduct a cross section study about the prevalence of cancer in ladys around the age of the menopause with an ovarian cyst and looking of a biochemical marker called Ca 125
    still i am unable to calculate the sample size ?

    • Dear Someone,

      Please perform a detailed review of literature and determine what proportion of perimenopausal women with ovarian cysts have elevated Ca 125 levels.

      Use that proportion to estimate sample size by substituting in the formula provided in the article above.

      If you get a range, estimate sample size using the lowest proportion, and use that to conduct your study if feasible.

      Regards,
      Dr.Roopesh

  7. am going to do survey on bankingt industry . but there population size are different from one another. how am i going to deaal with that please help

    • Dear Solomon,

      You could try using cluster sampling method to conduct your survey. Each Bank would constitute a cluster, and you could perform sampling proportionate to size.

      If restricted to branches of a single bank, clusters could be determined on the basis of zones or regions, with business handled (in money terms- $, ₹, etc.) determining the proportionate size of each cluster.

      Hope this helps.
      Regards,
      Dr. Roopesh

  8. Dear Dr Roopesh,

    I am conducting a cross sectional study on prevalence of cardiomyopathy among diabetes patients. Similar study done in my country showed a prevalence of 40%. I used the above formula for cross- sectional studies and used relative precision, 20%(of 40%). I was asked by my university research committee, why have I chosen relative precision instead of absolute precision. Initially when I was writting my proposal I tried absolute precision and it had given me a high sample of 334. When i used a relative precision, 20%(of 40%), it had given me,144, which I preferred (due to the limited study budget). How do you think I should answer the above question? And help me specifically with reasons for using relative precision instead of absolute precision?

  9. I am conducting a research on Sleep disorders in children with enlarged adenoids and tonsils in a hospital in Nigeria.Kindly help me with the type of study design and sample size calculation since I could not find a similar study and prevalence

  10. hello,drroopesh im planning conduct cross sectional study of tb cervical lymphadenopathy clinico patho and demographic profile without folllowup for minimum of 1 yr … i dont know how to to calculate sample size .. previous studies are there but they are having indifferent sample size .. and pls help help me to calculate sample size of around 100

      • objectives in the mean of demographic and clinico pathological profile , study population is op patients and in ward patients , outcome measures based final reports

      • THESIS PROTOCOL

        CLINICO-PATHOLOGICAL AND DEMOGRAPHIC PROFILE OF
        TUBERCULAR CERVICAL LYMPHADENOPATHY
        Thesis Protocol Submitted For
        DIPLOMATE OF NATIONAL BOARD
        (RESPIRATORY MEDICINE)

        AIMS AND OBJECTIVES

        PRIMARY OUTCOME

        • TO STUDY THE CLINICO-PATHOLOGICAL AND DEMOGRAPHIC
        PROFILE OF TUBERCULAR CERVICAL LYMPHADENOPATHY PATIENTS

        MATERIAL AND METHODS

        STUDY DESIGN

        The present study is proposed to be a Cross-Sectional study will be conducted in NATIONAL INSTITUTE OF TB AND RESPIRATORY DISEASES where the patients in both OPD and IPD.The patients will be enrolled between aug’2017 to dec’ 2018 will be part of the study .

        STUDY METHOD

        Patients who are attending OPD and pt’s in IPD will be enquired about detailed history and through clinical examination will be done.Followed by all routine investigations and special tests like mantoux test, usg abdomen and FNAC of lymphnode with sample direct smear, cytopathological examination and culture for MTB will be done at NITRD. And finally reports will be analyzed as in the profoma.

        SAMPLE SIZE AND STUDY PERIOD

        The expected patients in the study will be between aug’2017 to dec’2018 who are giving consent for the study and those who are eligible for study.

        CRITERIA FOR SELECTION OF PATIENTS

        Inclusion criteria;
        • All patients who are agree to participate in the study.
        Exclusion criteria;
        • Patients who are not willing to participate in the study.
        • Patients with primary diagnosis of other diseases(e.g: cancer,sarcoidosis, pyogenic infections & etc).
        REVIEW OF LITERATURE
        DEMOGRAPHIC INCIDENCE;
        Mm rahman et al Out of 60 patients 40 were female and 20 were male and female male ratio was 2: 1. The most vulnerable age group was the 2nd decade 23(38.33%). The present study shows that the peak age incidence is 2nd decade of life (38.3%) and the 2nd highest incidence 3rd decade with 30%.

        Hussain et al out of 50 patients Male to female ratio is 2.1:1 most common during 2nd and 3rd decade of life (52% )with a peak incidence in the 2nd decade (32%).

        Devendra et al Out of 118 cases was found to be more prevalent in females as 30 out of 54(55.55%). In this study, we found out that TBL are commoner in 13-30 age groups, 83.33% .

        Vasuda et al out of 227 There were 113 (49.7%) female and 114 (50.3%) The maximum number [167 (73.6%)] of cases suggestive of cytomorphology of tubercular lymphadenitis were aged in the range of 11–30 years.

        Shaukat et al total 110 cases Out of these 42(38.1%) were males and 68(61.8%) were female. The majority of patients were in the age range between 10 to 30 years and next group belong to the 4th decade.
        Rasool et al Total 46 of which cases Female gender was found in the majority 28(61.87%) while male gender was 18(39.13%).
        Soumya et al A total of 63 patients were enrolled in the study of which 25 were males and 38 females The most commonly affected group in the study was 15–24 years age comprising of 57.1% (36 cases).

        Mohammed ali et al 115 cases there were 71 males and 44 females. The male to female ratio in present study was 1.61:1The majority ofpatients affected were in the age group of 13 to 20 years (39.13%) followed by 21 to 30 years (28.70%). The least affected age group was 61 to 70 years (1.74%).

        Chaitali et al Data of 80 patients was analyzed in this study.Gender wise 57 (71.3%) were females and remaining 23 (28.7%) were males.

        Naresh et al Males 48% and females 52%. In 50 cases the disease commonly affected the affected were 2nd decade 18% and 3rd decade 8% respectively. Commonest age group affected is between 11and 20> 21, and 30 closely followed by 31 and 40 years .

        CLINICAL PRESENTATION

        Karthi et al, Majority did not have symptoms 16 cases (31.4%) out of 51 showed symptoms fever was the most common , seen in 31% of cases, followed by malaise in 18% . It was observed 8 cases (15.6%) out of 51 cases had a positive history contact with tb . It was observed that posterior triangle was the commonest to get involved (31.3%) followed by upper deep jugular (21.5%). Levels 1, 3 and 4 were equally involved.And the majority of nodes (78.4%) were 4 cm. It was seen in 41 cases out of total 51 cases (80.3%) had U/L involvement. The remaining (19.7%) had bilateral involvement. and multiple node involvement in 39 cases (76.5%) while 12 cases (23.5%) showed single. Matting was observed in 14 of the 51 cases (27.4%). discrete lymph nodes which was present in 37 of the 51 cases (29.7%).

        Mohankumar et al 18 cases (27.69%) out of 65 cases of tubercular showed presence of symptoms. It was observed that only 4 cases (6.15%) out of 65 cases had a positive history.It was observed that the majority of nodes affected in tuberculosis (80%) were less than 4 cm in size it was observed that Upper jugular group (level-2) was the commonest to get involved in tuberculosis (30.76%) .2-5 Among the cases only 15.39% cases presented with bilateralnode

        mmrahman et al Out of 60 patients BCG vaccination had a significant protective role; 19(31.67%) were vaccinated and 41 (68.33%) wereTuberculin test was positive in 44(73.34%) and negative in 2 (3.33%) and doubtful in 14 (23.33%).The common presentations were neck swelling 60 (100%), fever 40 (66.67%) and night sweat in 30(50%), wt loss 21(35%).

        Devendra et al In this study 1-2 cm size group were found to be having equal chances of tubercular and non-specific reactive lymphadenitis but 78.94% lymph nodes with size >2 cm were positive for tubercular lymphadenitis .Fever> anorexia>malaise>night sweats & weight loss was commoner symptoms in TBL

        Vasuda et al The study having 227 tb cervical lymphadenopathy pts
        The majority of the patients were otherwise healthy adults, and constitutional symptoms were present in 13% only. All the groups of cervical lymph node were involved including right and left cervical, posterior triangle, submental, submandibular, and supraclavicular regions.

        Zyedzulfiquer et al Study having 242 cases of tb cervical lymphadenopathy
        Most common constitutional symptoms are fever as wt loss(75%), night sweats(72%), LOA(45%).Most of the patients don’t have active contact only 28% had contact and 28% had past h/o tb treatment duration of lymphadenopathy in most of cases was less than 3 months.The size of Lymph Node was more than 1 cm and less than 2 cms in 70% of the patients. Gross appearance of Lymphadenopathy was multiple mattered in 65% of the patients with no tenderness in 78%

        Salman et al study population is 50 patients.Symptoms vary from 6 months to 2 yrs but m/c 7 wks to 3months 39 patients didn’t have any constitutional symptoms and remaining m/c had fever>malaise> LOA. H/O tb contact history was present in 19 patients. Examination showed b/l seen in 60% and location m/c post triangle(70%) f/b upper deep cervical(24%) and most of the lymphnode size was <1.15cm.

        Shaukat et al study population was 80 patients. In our study fever and weight loss are common complaint 52.7% and 63.6% respectively And b/l more common than unilateral and anterior group of nodes are more common than post group of nodes
        Rasool et al Multiple lymphadenitis was found in majority of the cases 26(56.53%), while 20(43.47%) cases were found with presentation.We found lymph node less than 3 CM found in 31(67.39%) cases and more on of single lymphadenitis than 3 CM were in15 (32.61%) cases. Fever was commonest clinical feature in 76% cases, following by swelling, abscess, solid nodes, weight loss, loss of appetite and others were noted with percentage of 55.69%, 39.13%, 45.65%, 58.69% and 21.73% respectively

        CYTO PATHOLOGICAL, CULTURE AND DIRECT SMEAR EXAMINATION
        Karthikeyan et al Out of the 51 histopathologically confirmed cases of tuberculous cervical lymphadenitis, a diagnosis of tuberculosis was made in 43 cases by FNAC. The other 7 cases were diagnosed as chronic non-specific lymphadenitis. There were no false positive cases on FNAC. 44 cases were true negative for tuberculosis. The sensitivity and specificity of FNAC for diagnosing tuberculous lymphadenitis is therefore 86% and 100% respectively .

        Mohan kumar et al In the present study, both sensitivity and specificity of FNAC for for tuberculosis sensitivity was only 86.20% and specificity was 100%.

        Mm rahman et al In this study among 60 patients 44 (73.34%) were tuberculin positive (more than 10 mm induration), 14 (23.33%) were doubtful (between 1-10 mm) and 2 (3.33%) were negative(no induration seen Among the 60 patients of tuberculouscervicallymphadenitis 51 (85%) had caseation.

        Vasuda et al In this study, the cytomorphological features observed in the cases were caseating epithelioid granulomas [47.6%(108/227)], granulomatous lymphadenitis [33.9% (77/227)], necrotizing lymphadenitis [1.8% (4/227)], and necrotizing suppurative lymphadenitis [16.7% (38/227)] of cases. ZNstaining for AFB was done in all the cases. Smear positivityfor Mycobacterium sp. by conventional ZN method was 19.4% (44/227). AFB positivity was the maximum (44.7%) in necrotizing suppurative lymphadenitis .
        The appearance of aspirates found more commonly was blood mixed in 68.3% cases, followedby whitish cheesy material in 21.1%, pus-like in 6.2%, and yellowish in 4.4%. AFB positivity was the maximum (42.8%)in pus-like aspirate.

        Salman et al The study having population of 50 cases of which 41(82%) cases have been confirmed by FNAC. AFB seen in by direct smear examination in 12 cases and 9(18%) needed excisinal biopsy to confirm the diagnosis.

        Soumyajit et al FNAC was diagnostic in 42 cases (73.7%) where epitheloid granuloma and Langhan’s cells with or without necrosis was seen. The aspirate from affected lymph nodes did not reveal AFB in most of the cases. Only 23 samples (40.4%) revealed AFB after ZN staining. FNAC was non specific in 15 samples which further required incision/ excision biopsy for diagnosis.

        PROFORMA

        CASE NO: OPD REG NO:
        NAME: FATHER/HUSBAND NAME:
        AGE: SEX:
        OCUPATION: MARIETAL STATUS:
        AREA:

        PRESENTING COMPLIANT: DURATION
        LYMPHNODE ENLARGEMENT:
        FEVER:
        COUGH:
        WEIGT LOSS:
        LOSS OF APPETITE:
        CHEST PAIN:
        OTHERS POSITIVE HISTORY:

        PAST HISTORY:
        TUBERCULOSIS:
        HYPERTENSION:
        DIABETES:
        HIV:
        SURGICAL INTERVENTION:
        BLOOD TRANSFUSION:
        OTHER PAST SIGNIFICANT HISTORY:

        PERSONAL HISTORY:
        H/O SMOKING:
        H/O ALCOHOL:
        H/O DRUG ABUSE:
        BLADDER AND BOWEL COMPLIANT:
        H/O CONTACT WITH TB:
        NO OF CHILDREN:

        TREATMENT HISTORY:
        H/O ATT:
        ANY OTHER MEDICATION:

        GENERAL EXAMINATION:
        TEMPERATURE:
        B.P: PULSE: RESPIRATORY RATE:
        PALLOR: ICTERUS: CLUBBING: CYANOSIS: PEDAL EDEMA:
        BCG SCAR:
        LYMPHNODE :

        SYSTEMIC EXAMINATION
        CVS:

        RS:

        P/A:

        CNS:

        INVESTIGATIONS REPORTS;
        HB: TLC: DLC: ESR:
        Blood sugar(random): UREA: CREATININE:
        S.BILIRUBIN:Total- Direct- SGOT/SGPT/ALP:
        S.PROTEIN:Total- Albumin-
        URINE:Albumin- sugar- microscopy
        Sputum for AFB(D/S):
        X-ray CHEST:
        USG abdomen:
        FNAC report:
        AFB by D/S:
        CULTURE report:

        • I am not sure I understand what exactly you intend to do.

          You will recruit patients with tuberculous cervical lymphadenopathy, and obtain some information- this much is clear.

          What is not clear is what question you are trying to answer by collecting that information. That is why I requested you to provide your research question in PICO format.

          Please note that unless you provide an answerable research question, I will be unable to provide additional assistance.

          Regards,
          Dr. Roopesh

  11. hello please i m trying to correlate two variables in estimating the severity of chronic liver disease how do i go about calculating my sample size since it is a cross sectional study m conducting, thanks.

    • Dear Adaze,

      Please use the formula provided in above: 4pq/ l^2.

      If you provide details of your objectives and outcome variables, I might be able to provide specific guidance.

      Please note that I will be very busy this week, so might not be able to respond before the weekend.

      Regards,
      Dr. Roopesh

  12. hello dr.
    my study is to identify the number of stem cells in diabetic patients group and non diabetic group then compare between tow groups. so is it comparative cross sectional design or case cnotrol? and how i can estimate the sample size?

    • Dear Sara,

      What is your research question? The study design is determined by the research question.

      Please formulate your research question using the PICO criteria and revert to me.

      Please note that I will be very busy over the coming week, hence might be unable to respond before the weekend.

      Regards,
      Dr. Roopesh

      • thanks dr. for replying…
        my research question is:
        in mild gestational diabetic women, is the number and quality of the haematopoietic stem cells of umbilical cord blood affected compared to non-gestational diabetic women?

  13. Dear Roopesh,

    I am still confused about sample size calculation. My study is on prevalence and factors associated with cardiomyopathy among diabetic patients. I wanted to used a prevalence of 67.8 ( a similar study done in my country). Please show me how your sample size will be, so that I can compare with what I got( which I think is not correct). Use absolute precision and 95% confident interval.

    With regards,
    Boniface

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s