Disclaimer: This article is primarily intended for my students. However, others may also find it useful.
Distribution: This indicates how values are spread. If all values were plotted on a graph, one would be able to see how and where the values are spread about. This is the ‘distribution’ in question.
When values are plotted, they may fall in one of several patterns- toward one side, irregular, or a bell shape. When most values fall toward one or other side, the distribution is said to be ‘skewed’. When values fall so that a bell shape is created, the distribution is said to be ‘normally’ distributed.
The use of the term ‘normal’ does not suggest that other distributions are somehow ‘abnormal’- the term ‘normal’ is not used as an adjective.
The normal distribution is also called a ‘Gaussian distribution’, after Karl Friedrich Gauss, who contributed to its understanding. It consists of a curve, below which all values are located.
In a normal distribution, the mean = median = mode. That is, the average value is also the middle value, and the most frequently occurring value. This property of the normal distribution is responsible for the peak in the middle, because the most frequently occurring value is the middle value. Another consequence of this property is that 50% of the values are on one side of the mean; and 50% of the values are on the other side of the mean (since mean = median). Typically, the distribution of values is identical on either side of the mean. This is how the normal distribution becomes bilaterally symmetrical ‘around the mean’ (meaning, with the mean in the middle), and acquires its characteristic ‘bell shape’. Therefore, one half of the normal distribution is the mirror image of the other half.
This distribution is important because it forms the basis for several key statistical theorems. Moreover, it is useful in describing real-world data because of its characteristics.
When sufficiently large (number of) samples are taken from a population, the distribution of sample means follows a normal distribution. You can check this for yourself using the following applet developed by Rice University:
Tip (Applet use): To save time, simply click on ‘Begin’, then click on the drop-down menu next to the first panel (on the right of the first panel) and select ‘custom’. Create a distribution by clicking the left mouse button and dragging it in an arbitrary manner from left to right. Then, click on the ‘100,000’ button to the right of the second panel. See what emerges on the third panel. You can choose any of the other available options, but in each case, as the number of samples increases, the distribution of sample means (third panel) assumes a bell shape- normal distribution.
Unfortunately, all normal distribution curves are not identical in height of the middle ‘peak’, or width from the mean- some are taller than wide, and vice versa (see image above). This poses a problem because statistical calculations would have to vary with the nature of the normal distribution. To solve this problem, statisticians devised a Standard Normal Distribution, that has fixed properties. This involves standardizing the values by converting observations into z-scores.
Standard Normal Distribution
A standard normal distribution is a special type of normal distribution that has a mean of zero, and standard deviation of one.
Like any other normal curve, it is bilaterally symmetrical, and has a bell shape. The area under the standard normal curve is equal to one.
Due to the fixed properties of the standard normal curve, statisticians are able to estimate probabilities for locations under the curve.
We use z-scores to indicate the location of any value under the curve. However, the distance is indicated not in units of length, but in terms of how many standard deviations away from the mean the value lies.
To calculate the z-score, one needs the following:
- a number of observations
- the mean of those observations
- the population standard deviation
With the above, the z-score for any observation is calculated as under:
z-score = (observation – mean)/ standard deviation of the population
Naturally, z-scores may take both negative and positive values, depending upon if the mean is larger or smaller than the observed value. The negative values fall to the left of the mean, while the positive values fall to the right of the mean.
A researcher measured the hemoglobin of 100 students, and found that the mean was 12, with standard deviation of 2. What is the z-score for a hemoglobin of 15?
We know that z-score is given as
z= (observation – mean)/ standard deviation
substituting, we have
z = (15 – 12)/ 2
z = 3/2 = 1.5
What does the z-score mean?
It means that a hemoglobin of 15 is located 1.5 standard deviations away from the mean.
How is all this useful?
Thanks to the work of statisticians, we know that in a standard normal curve,
- 68% of all values fall between one standard deviation of the mean;
- 95% of all values fall between 2 standard deviations from the mean; and
- 99% of all values fall between 3 standard deviations from the mean.
Knowing this enables us to determine how extreme a particular value is with respect to its location under the curve. Typically, the farther away from the mean, the more extreme the value. This property has been used to create growth charts, for instance- more extreme values indicate more severe malnutrition. We will discuss growth charts in the next article.
Link to a simple explanation of normal distribution:
Link to Wikipedia article on Standard Normal Table (contains description of how to calculate z-score):