RELATIVE VALUES AND THEIR TYPES
If the same
conditions always produce the same result, you don’t need Statistics.
But things do vary. Some variation is meaningful,
some is not. Often the biggest challenge in science is to tell the difference between meaningful pattern and chance-like
variability. (What‘s the signal? What‘s the noise?) Graphs
and numerical summaries like averages and percentages can often reveal meaningful patterns that might otherwise remain hidden because of
described our Seven-step Method for statistical investigation. Step 4 is to
describe your data. The goal of that
step is to find summaries and plots that show patterns and help separate meaningful patterns from nuisance variation.
We start with
a standard format for data: A data table (statistical spreadsheet) has one row
for each observational unit, one column for each
variable. Our goal as statisticians is to go from that table to numerical summaries and graphs that give us information
about variability and pattern.
Example 0.2A: World Mental Health Survey
In Section 0.1
we talked about how Statistics is a discipline that guides us in weighing
evidence about phenomena in the world
around us. Typically, Statistics weighs evidence that comes in the form of data stored in a data file. The rows of the data file represent the observational units, which are the individuals (not necessarily people) being measured in
the study. The columns represent the variables, the characteristics of the observational units.
So, each entry in the data file gives the
value of the variable for the observational unit of interest.
effort of the World Mental Health Organization (WMHO) is to evaluate the
frequency of mental health disorders
and their impact on individuals in countries around the world.
gives an example data file from a survey conducted by the WMHO of residents of
the United States. The survey was conducted on a
representative sample of 1,860 individuals living in the United States in 2001 to 2002. Notice the file is organized so
each observational unit (in this case a person) occurs
on a single row of the data file. For example, the first row is an 18- year old Hispanic male. The names or identifiers of the observational
units are provided on the left hand side of the table;
in this case, they are ID #‘s. The number of observational units is 1,860 because that is the sample size. So, if Table 0.1 was complete, it
would have 1,860 rows—one for each person.
how each column of the data file gives information on a different
characteristic of each observational unit. The
names of the variables for this data table are provided in blue at the top of each column of the data table. As these data are from a survey,
most columns represent the answer to a
single question on the survey. For example, the second column is the answer to a question asking for the respondent‘s sex. It is important to
note that sometimes variables don‘t have
information on all of the observational units. Notice the ―Years Married‖ variable above, this variable only has information if a person‘s marital
status is ―Married,‖ otherwise it contains
nothing. Sometimes variables don‘t contain values for some observational units for legitimate reasons.
if the variable does not apply to all observational units. In other cases, a variable might not have values for all
observational units for less legitimate reasons, like a
person skipped a question accidentally on a survey. Different statistical analysis programs have different ways of representing
―missing‖ data including ―”NA”, “.” or just leaving an empty box.
OF VARIABLES, THEIR VALUES, AND THEIR DISTRIBUTIONS
As the name
suggests, a variable varies, that is, it takes on different values for
different cases. Depending on its values, a
variable is either quantitative or categorical. For a quantitative variable, it
makes sense to do arithmetic (add, subtract, etc.) with the values. Examples
are height, weight, distance and time. For a categorical
variable the values are
labels for which arithmetic does not make
sense. Examples are sex, ethnicity, and eye color. The two kinds of variables
lead to different kinds of summaries. For example, you can compute an average value or median for a quantitative variable like height, but
not for a categorical variable like ethnicity.
Much of the rest of this section illustrates some useful summaries, but first, you need the key idea of a distribution. Statistics relies on looking at a lot of cases all
at once, rather than one case at a time. The key idea
is the distribution of a variable:
For large datasets
like the WMHO survey, it is hard to detect patterns among the thousands of
cases just by looking at a list of values. By thinking instead of the
distribution as a whole, we are led to various ways to describe, summarize and
compare distributions, much as a naturalist would describe and compare
different plants or animals.
The most common
summaries for distributions are either numerical or graphical.
You don‘t need a definition, because the names mean what you would expect, and
you can get the idea from examples. Here are several based on the WMHO survey:
summaries, categorical variables:
The proportion of females
in the survey is 0.553.
The proportion of
Hispanics in the survey is 0.097.
summaries, categorical variable:
summaries, quantitative variables:
Average age for
married individuals is 52.
Average age for
those who have never married is 42.
summaries, quantitative variables:
Types of data
Just as a farmer gathers
and processes a crop, a statistician gathers and processes data. For this
reason the logo for the UK Royal Statistical Society is a sheaf of wheat. Like
any farmer who knows instinctively the difference between oats, barley and
wheat, a statistician becomes an expert at discerning different types of data.
Some sections of this book refer to different data types and so we start by
considering these distinctions. Figure 1.2 shows a basic summary of data types,
although some data do not fi t neatly into these categories.
Categorical or qualitative data
or categorical data are data that one can name and put into categories. They are
not measured but simply counted.
They often consist of unordered ‘either–or’ type observations which have two categories
and are often know as binary. For example: Dead or Alive; Male or
Female; Cured or Not Cured; Pregnant or Not Pregnant. In Table 1.1
having a first-degree relative with cancer, or taking regular exercise are binary
variables. However, categorical data often can have more that two categories, for example: blood group O, A, B, AB, country of origin,
ethnic group or eye colour. In Table 1.1 marital status is of this type. The
methods of presentation of nominal data are limited in scope. Thus,
Table 1.1 merely gives the number and percentage of people by marital status.
there are more than two categories of classification it may be possible to order them in
some way. For example, after treatment a patient may be either improved, the
same or worse; a woman may never have conceived, conceived but spontaneously
aborted, or given birth to a live infant. In Table 1.1 education is
given in three categories: none or elementary school, middle school,
college and above. Thus someone who has been to middle school has more
education than someone from elementary school but less than someone from
college. However, without further knowledge it would be wrong to ascribe a
numerical quantity to position; one cannot say that someone who had middle
school education is twice as educated as someone who had only
elementary school education. This type of data is also known as ordered
some studies it may be appropriate to assign ranks. For example, patients with rheumatoid
arthritis may be asked to order their preference for four dressing aids. Here,
although numerical values from 1 to 4 may be assigned to each aid, one
cannot treat them as numerical values. They are in fact only codes for best,
second best, third choice and worst.
Numerical or quantitative
1.1 gives details of the number of pregnancies each woman had had, and this is termed
count data. Other examples are often counts per unit of time such as the
number of deaths in a hospital per year, or the number of attacks of asthma
a person has per month. In dentistry, a common measure is the number of
decayed, filled or missing teeth (DFM).
or numerical continuous
data are measurements that can, in
theory at least, take any value within a given range. These data contain the most information, and
are the ones most commonly used in statistics. Examples of continuous data
in Table 1.1 are: age, years of menstruation and body mass index.
for simplicity, it is often the case in medicine that continuous data are dichotomised
to make nominal data. Thus diastolic blood pressure, which is continuous, is
converted into hypertension (>90 mmHg) and normotension (≤90 mmHg). This
clearly leads to a loss of information. There are two main reasons for doing
this. It is easier to describe a population by the proportion of people
affected (for example, the proportion of people in the population with hypertension
is 10%). Further, one often has to make a decision: if a person has
hypertension, then they will get treatment, and this too is easier if the population
can also divide a continuous variable into more than two groups. In Table 1.1 per capita income
is a continuous variable and it has been divided into four groups to summarise it, although a better choice may have been to split at the more convenient
and memorable intervals of 4000, 6000 and 8000 yuan. The authors give no
indication as to why they chose these cut-off points, and a reader has to
be very wary to guard against the fact that the cuts may be chosen to make a
Interval and ratio scales
can distinguish between interval and ratio scales. In an interval
as body temperature or calendar dates, a difference between two measurements has meaning, but their ratio
does not. Consider measuring temperature (in degrees centigrade) then
we cannot say that a temperature of 20°C is twice as hot as a temperature of 10° C. In a
ratio scale, however, such as
body weight, a 10% increase implies the same weight increase whether expressed in kilograms or
pounds. The crucial difference is that in a ratio scale, the value of zero has
real meaning, whereas in an interval scale, the position of zero is
difficulty with giving ranks to ordered categorical data is that one cannot assume that the scale
is interval. Thus, as we have indicated when discussing ordinal data, one
cannot assume that risk of cancer for an individual educated to middle school
level, relative to one educated only to primary school level is the
same as the risk for someone educated to college level, relative to someone
educated to middle school level. Were Xu et al (2004) simply to score the
three levels of education as 1, 2 and 3 in their subsequent analysis, then
this would imply in some way the intervals have equal weight. 1.5
How a statistician can
ideas relevant to good design and analysis are not easy and we would always advise an
investigator to seek the advice of a statistician at an early stage of an
investigation. Here are some ways the medical statistician might help.
Sample size and power
of the commonest questions asked of a consulting statistician is: How large should my study be? If
the investigator has a reasonable amount of knowledge as to the likely
outcome of a study, and potentially large resources of finance and time, then
the statistician has tools available to enable a scientific answer to be made
to the question. However, the usual scenario is that the investigator has
either a grant of a limited size, or limited time, or a limited pool of patients.
Nevertheless, given certain assumptions the medical statistician is still able
to help. For a given number of patients the probability of obtaining effects of a
certain size can be calculated. If the outcome variable is simply success or
failure, the statistician will need to know the anticipated percentage of successes in
each group so that the difference between them can be judged of potential
clinical relevance. If the outcome variable is a quantitative measurement, he
will need to know the size of the difference between the two groups, and
the expected variability of the measurement. For example, in a survey to
see if patients with diabetes have raised blood pressure the medical
statistician might say, ‘with 100 diabetics and 100 healthy subjects in this survey and
a possible difference in blood pressure of 5 mmHg, with standard deviation of
10 mmHg, you have a 20% chance of obtaining a statistically significant
result at the 5% level’. This statement means that one would anticipate that in only one
study in five of the proposed size would a statistically significant result be
obtained. The investigator would then have to decide whether it was sensible or
ethical to conduct a trial with such a small probability of success. One
option would be to increase the size of the survey until success (defined as a
statistically significant result if a difference of 5 mmHg or more does truly
exist) becomes more probable.
et al (2004), in their survey of original articles in three UK general practice journals, found
that the most common design was that of a crosssectional or questionnaire survey,
with approximately one third of the articles classified as such.
all but the smallest data sets it is desirable to use a computer for
The responses to a questionnaire will need to be easily coded for computer analysis and a
medical statistician may be able to help with this. It is important to ask for
help at an early stage so that the questionnaire can be piloted and modified
before use in a study.
Choice of sample and of
question of whether one has a representative sample is a typical problem faced by statisticians. For
example, it used to be believed that migraine was associated with
intelligence, perhaps on the grounds that people who used their brains were more
likely to get headaches but a subsequent population study failed to reveal any
social class gradient and, by implication, any association with
intelligence. The fallacy arose because intelligent people were more likely to consult
their physician about migraine than the less intelligent.
many studies an investigator will wish to compare patients suffering from a certain disease with
healthy (control) subjects. The choice of the appropriate control
population is crucial to a correct interpretation of the results.
Design of study
has been emphasised that design deserves as much
consideration as analysis, and
a statistician can provide advice on design. In a clinical trial, for example, what is known as a
double-blind randomised design is nearly always preferable, but not always
achievable. If the treatment is an
intervention, such as a surgical procedure it might be impossible to prevent
individuals knowing which treatment they are receiving but it should be
possible to shield their assessors from knowing.
investigators often appreciate the effect that biological variation has in patients, but overlook or
underestimate its presence in the laboratory. In dose–response studies, for
example, it is important to assign treatment at random, whether the
experimental units are humans, animals or test tubes. A statistician can also
advise on quality control of routine laboratory measurements and the
measurement of within- and between-observer variation.
well-chosen figure or graph can summarise the results
of a study very concisely. A
statistician can help by advising on the best methods of displaying data. For example, when
plotting histograms, choice of the group interval can affect the shape of the
plotted distribution; with too wide an interval important features of the data will be
obscured; too narrow an interval and random variation in the data may
distract attention from the shape of the underlying distribution.
Choice of summary
statistics and statistical analysis
The summary statistics used
and the analysis undertaken must reflect the basic design of the study
and the nature of the data. In some situations, for example, a median is a
better measure of location than a mean. In a matched study, it is important to
produce an estimate
of the difference between matched pairs, and an estimate of the reliability of that
difference. For example, in a study to examine blood pressure measured in a seated patient
compared with that measured when he is lying down, it is insufficient simply to report
statistics for seated and lying positions
separately. The important statistic is the change in blood pressure as the patient changes
position and it is the mean and variability of this difference that we are interested in.
This is further discussed in Chapter 8. A statistician can advise on the choice
of summary statistics, the type of analysis and the presentation of the
Every statistics book provides a listing of statistical
distributions, with their properties, but browsing through these choices can be
frustrating to anyone without a statistical background, for two reasons. First,
the choices seem endless, with dozens of distributions competing for your
attention, with little or no intuitive basis for differentiating between them.
Second, the descriptions tend to be abstract and emphasize statistical
properties such as the moments, characteristic functions and cumulative
distributions. In this appendix, we will focus on the aspects of distributions
that are most useful when analyzing raw data and trying to fit the right
distribution to that data.
confronted with data that needs to be characterized by a distribution, it is
best to start with the raw data and answer four basic questions about the data
that can help in the characterization. The first relates to whether the data
can take on only discrete values or
whether the data is continuous; whether a new pharmaceutical drug gets FDA
approval or not is a discrete value but the revenues from the drug represent a
continuous variable. The second looks at the symmetry
of the data and if there is
asymmetry, which direction it lies in; in other words, are positive and
negative outliers equally likely or is one more likely than the other. The
third question is whether there are upper
or lower limits on the data; there are some data items like revenues that
cannot be lower than zero whereas there are others like operating margins that
cannot exceed a value (100%). The final and related question relates to the likelihood of observing extreme values in the distribution; in some data, the extreme
values occur very infrequently whereas in others, they occur more often.
Is the data discrete or continuous?
The first and most obvious categorization of data
should be on whether the data is restricted to taking on only discrete values
or if it is continuous. Consider the inputs into a typical project analysis at
a firm. Most estimates that go into the analysis come from distributions that
are continuous; market size, market share and profit margins, for instance, are
all continuous variables. There are some important risk factors, though, that
can take on only discrete forms, including regulatory actions and the threat of
a terrorist attack; in the first case, the regulatory authority may dispense
one of two or more decisions which are specified up front and in the latter,
you are subjected to a terrorist attack or you are not.
With discrete data, the entire distribution can
either be developed from scratch or the data can be fitted to a pre-specified
discrete distribution. With the former, there are two steps to building the
distribution. The first is identifying the possible outcomes and the second is
to estimate probabilities to each outcome. As we noted in the text, we can draw
on historical data or experience as well as specific knowledge about the
investment being analyzed to arrive at the final distribution. This process is relatively
simple to accomplish when there are a few outcomes with a well-established
basis for estimating probabilities but becomes more tedious as the number of
outcomes increases. If it is difficult or impossible to build up a customized
distribution, it may still be possible fit the data to one of the following
distribution: The binomial distribution measures the probabilities of the number of successes over a
given number of trials with a
specified probability of success in each try. In the simplest scenario of a
coin toss (with a fair coin), where the probability of getting a head with each
toss is 0.50 and there are a hundred trials, the binomial distribution will
measure the likelihood of getting anywhere from no heads in a hundred tosses
(very unlikely) to 50 heads (the most likely) to 100 heads (also very
unlikely). The binomial distribution in this case will be symmetric, reflecting
the even odds; as the probabilities shift from even odds, the distribution will
get more skewed. Figure 6A.1 presents binomial distributions for three
scenarios – two with 50% probability of success and one with a 70% probability
of success and different trial sizes.
As the probability of success is varied
(from 50%) the distribution will also shift its shape, becoming positively
skewed for probabilities less than 50% and negatively skewed for probabilities
greater than 50%.
b. Poisson distribution: The Poisson
distribution measures the likelihood of a number of events occurring
within a given time interval, where the key parameter that is required is
the average number of events in the given interval (l). The resulting
distribution looks similar to the binomial, with the skewness
being positive but decreasing with l. Figure 6A.2 presents three
Poisson distributions, with l ranging from 1 to 10.
c. Negative Binomial distribution: Returning
again to the coin toss example, assume that you hold the number of successes
fixed at a given number and estimate the number of tries you will have
before you reach the specified number of successes. The resulting
distribution is called the negative binomial and it very closely resembles the
Poisson. In fact, the negative binomial distribution converges on the Poisson
distribution, but will be more skewed to the right (positive values) than the
Poisson distribution with similar parameters.
d. Geometric distribution: Consider
again the coin toss example used to illustrate the binomial. Rather than focus
on the number of successes in n trials, assume that you were measuring
the likelihood of when the first success will occur. For instance,
with a fair coin toss, there is a 50% chance that the first success will occur
at the first try, a 25% chance that it will occur on the second try and a 12.5%
chance that it will occur on the third try. The resulting distribution is
positively skewed and looks as follows for three different probability
scenarios (in figure 6A.3):
Note that the distribution is steepest
with high probabilities of success and flattens out as the probability
decreases. However, the distribution is always positively skewed.
e. Hypergeometric distribution: The hypergeometric distribution measures the probability of a
specified number of successes in n trials, without replacement,
from a finite population. Since the sampling is without replacement, the
probabilities can change as a function of previous draws. Consider, for
instance, the possibility of getting four face cards in hand of ten, over
repeated draws from a pack. Since there are 16 face cards and the total pack
contains 52 cards, the probability of getting four face cards in a hand of ten
can be estimated. Figure 6A.4 provides a graph of the hypergeometric
uniform distribution: This is the simplest of discrete distributions and
applies when all of the outcomes have an equal probability of
occurring. Figure 6A.5 presents a uniform discrete distribution with
five possible outcomes, each occurring 20% of the time:
The discrete uniform distribution is best reserved for
circumstances where there are multiple possible outcomes, but no information that
would allow us to expect that one outcome is more likely than the others.
With continuous data, we cannot specify all
possible outcomes, since they are too numerous to list, but we have two
choices. The first is to convert the continuous data into a discrete form and
then go through the same process that we went through for discrete
distributions of estimating probabilities. For instance, we could take a
variable such as market share and break it down into discrete blocks – market
share between 3% and 3.5%, between 3.5% and 4% and so on – and consider the
likelihood that we will fall into each block. The second is to find a
continuous distribution that best fits the data and to specify the parameters
of the distribution. The rest of the appendix will focus on how to make these
How symmetric is the data?
There are some datasets that exhibit symmetry, i.e.,
the upside is mirrored by the downside. The symmetric distribution that most
practitioners have familiarity with is the normal distribution, sown in Figure
6A.6, for a range of parameters:
normal distribution has several features that make it popular. First, it can be
fully characterized by just two parameters – the mean and the standard
deviation – and thus reduces estimation pain. Second, the probability of any
value occurring can be obtained simply by knowing how many standard deviations
separate the value from the mean; the probability that a value will fall 2
standard deviations from the mean is roughly 95%. The normal
distribution is best suited for data that, at the minimum, meets the following
There is a strong tendency for the data to take
on a central value.
Positive and negative deviations from this
central value are equally likely
The frequency of the deviations falls off rapidly
as we move further away from the central value.
The last two conditions show up when we compute the
parameters of the normal distribution: the symmetry of deviations leads to zero
skewness and the low probabilities of large
deviations from the central value reveal themselves in no kurtosis.
There is a cost we pay, though, when we use a normal
distribution to characterize data that is non-normal since the probability
estimates that we obtain will be misleading and can do more harm than good. One
obvious problem is when the data is asymmetric but another potential problem is
when the probabilities of large deviations from the central value do not drop
off as precipitously as required by the normal distribution. In statistical
language, the actual distribution of the data has fatter tails than the normal.
While all of symmetric distributions in the family are like the normal in terms
of the upside mirroring the downside, they vary in terms of shape, with some
distributions having fatter tails than the normal and the others more
accentuated peaks. These distributions are characterized as
leptokurtic and you can consider two examples. One is the logistic
distribution, which has longer tails and a higher kurtosis (1.2, as compared to
0 for the normal distribution) and the other are Cauchy distributions, which
also exhibit symmetry and higher kurtosis and are characterized by a scale
variable that determines how fat the tails are. Figure 6A.7 present a series of
Cauchy distributions that exhibit the bias towards fatter tails or more
outliers than the normal distribution.
Either the logistic or the Cauchy distributions can be used if
the data is symmetric but with extreme values that occur more frequently than
you would expect with a normal distribution.
As the probabilities of extreme values increases relative to
the central value, the distribution will flatten out. At its limit, assuming
that the data stays symmetric and we put limits on the extreme values on both
sides, we end up with the uniform distribution, shown in figure 6A.8:
When is it appropriate to assume a uniform distribution for a
variable? One possible scenario is when you have a measure of the highest and
lowest values that a data item can take but no real information about where
within this range the value may fall. In other words, any value within that
range is just as likely as any other value.