**RELATIVE VALUES AND THEIR TYPES**

For large datasets
like the WMHO survey, it is hard to detect patterns among the thousands of
cases just by looking at a list of values. By thinking instead of the
distribution as a whole, we are led to various ways to describe, summarize and
compare distributions, much as a naturalist would describe and compare
different plants or animals.

**Summaries
for distributions **

The most common
summaries for distributions are either ** numerical **or

Numerical
summaries, categorical variables:

The proportion of females
in the survey is 0.553.

The proportion of
Hispanics in the survey is 0.097.

Graphical
summaries, categorical variable:

Numerical
summaries, quantitative variables:

Average age for
married individuals is 52.

Average age for
those who have never married is 42.

Graphical
summaries, quantitative variables:

**Types of data**

Just as a farmer gathers
and processes a crop, a statistician gathers and processes data. For this
reason the logo for the UK Royal Statistical Society is a sheaf of wheat. Like
any farmer who knows instinctively the difference between oats, barley and
wheat, a statistician becomes an expert at discerning different types of data.
Some sections of this book refer to different data types and so we start by
considering these distinctions. Figure 1.2 shows a basic summary of data types,
although some data do not fi t neatly into these categories.

Every statistics book provides a listing of statistical
distributions, with their properties, but browsing through these choices can be
frustrating to anyone without a statistical background, for two reasons. First,
the choices seem endless, with dozens of distributions competing for your
attention, with little or no intuitive basis for differentiating between them.
Second, the descriptions tend to be abstract and emphasize statistical
properties such as the moments, characteristic functions and cumulative
distributions. In this appendix, we will focus on the aspects of distributions
that are most useful when analyzing raw data and trying to fit the right
distribution to that data.

When
confronted with data that needs to be characterized by a distribution, it is
best to start with the raw data and answer four basic questions about the data
that can help in the characterization. The first relates to whether the data
can take on only __discrete values or
whether the data is continuous__; whether a new pharmaceutical drug gets FDA
approval or not is a discrete value but the revenues from the drug represent a
continuous variable. The second looks at the __symmetry
of the data__ and if there is
asymmetry, which direction it lies in; in other words, are positive and
negative outliers equally likely or is one more likely than the other. The
third question is whether there are __upper
or lower limits on the data__; there are some data items like revenues that
cannot be lower than zero whereas there are others like operating margins that
cannot exceed a value (100%). The final and related question relates to the __likelihood of observing extreme values__ in the distribution; in some data, the extreme
values occur very infrequently whereas in others, they occur more often.

__Binomial
distribution__: The binomial distribution measures the probabilities of __the number of successes over a
given number of trials__ with a
specified probability of success in each try. In the simplest scenario of a
coin toss (with a fair coin), where the probability of getting a head with each
toss is 0.50 and there are a hundred trials, the binomial distribution will
measure the likelihood of getting anywhere from no heads in a hundred tosses
(very unlikely) to 50 heads (the most likely) to 100 heads (also very
unlikely). The binomial distribution in this case will be symmetric, reflecting
the even odds; as the probabilities shift from even odds, the distribution will
get more skewed. Figure 6A.1 presents binomial distributions for three
scenarios – two with 50% probability of success and one with a 70% probability
of success and different trial sizes.

As the probability of success is varied
(from 50%) the distribution will also shift its shape, becoming positively
skewed for probabilities less than 50% and negatively skewed for probabilities
greater than 50%.

b. __Poisson distribution__: The Poisson
distribution measures __the likelihood of a number of events occurring
within a given time interval__, where the key parameter that is required is
the average number of events in the given interval (l). The resulting
distribution looks similar to the binomial, with the skewness
being positive but decreasing with l. Figure 6A.2 presents three
Poisson distributions, with l ranging from 1 to 10.

c. __Negative Binomial distribution__: Returning
again to the coin toss example, assume that you hold the number of successes
fixed at a given number and __estimate the number of tries you will have
before you reach the specified number of successes__. The resulting
distribution is called the negative binomial and it very closely resembles the
Poisson. In fact, the negative binomial distribution converges on the Poisson
distribution, but will be more skewed to the right (positive values) than the
Poisson distribution with similar parameters.

d. __Geometric distribution__: Consider
again the coin toss example used to illustrate the binomial. Rather than focus
on the number of successes in n trials, assume that you were measuring
the __likelihood of when the first success will occur__. For instance,
with a fair coin toss, there is a 50% chance that the first success will occur
at the first try, a 25% chance that it will occur on the second try and a 12.5%
chance that it will occur on the third try. The resulting distribution is
positively skewed and looks as follows for three different probability
scenarios (in figure 6A.3):

Note that the distribution is steepest
with high probabilities of success and flattens out as the probability
decreases. However, the distribution is always positively skewed.

e. __Hypergeometric____ distribution__: The hypergeometric distribution measures the probability of a
specified number of successes in n trials, __without replacement__,
from a finite population. Since the sampling is without replacement, the
probabilities can change as a function of previous draws. Consider, for
instance, the possibility of getting four face cards in hand of ten, over
repeated draws from a pack. Since there are 16 face cards and the total pack
contains 52 cards, the probability of getting four face cards in a hand of ten
can be estimated. Figure 6A.4 provides a graph of the hypergeometric
distribution:

f. __Discrete
uniform distribution__: This is the simplest of discrete distributions and
applies when all of the outcomes have an equal probability of
occurring. Figure 6A.5 presents a uniform discrete distribution with
five possible outcomes, each occurring 20% of the time:

The discrete uniform distribution is best reserved for
circumstances where there are multiple possible outcomes, but no information that
would allow us to expect that one outcome is more likely than the others.

The
normal distribution has several features that make it popular. First, it can be
fully characterized by just two parameters – the mean and the standard
deviation – and thus reduces estimation pain. Second, the probability of any
value occurring can be obtained simply by knowing how many standard deviations
separate the value from the mean; the probability that a value will fall 2
standard deviations from the mean is roughly 95%. The normal
distribution is best suited for data that, at the minimum, meets the following
conditions:

a.
There is a strong tendency for the data to take
on a central value.

b.
Positive and negative deviations from this
central value are equally likely

c.
The frequency of the deviations falls off rapidly
as we move further away from the central value.

The last two conditions show up when we compute the
parameters of the normal distribution: the symmetry of deviations leads to zero
skewness and the low probabilities of large
deviations from the central value reveal themselves in no kurtosis.

There is a cost we pay, though, when we use a normal
distribution to characterize data that is non-normal since the probability
estimates that we obtain will be misleading and can do more harm than good. One
obvious problem is when the data is asymmetric but another potential problem is
when the probabilities of large deviations from the central value do not drop
off as precipitously as required by the normal distribution. In statistical
language, the actual distribution of the data has fatter tails than the normal.
While all of symmetric distributions in the family are like the normal in terms
of the upside mirroring the downside, they vary in terms of shape, with some
distributions having fatter tails than the normal and the others more
accentuated peaks. These distributions are characterized as
leptokurtic and you can consider two examples. One is the logistic
distribution, which has longer tails and a higher kurtosis (1.2, as compared to
0 for the normal distribution) and the other are Cauchy distributions, which
also exhibit symmetry and higher kurtosis and are characterized by a scale
variable that determines how fat the tails are. Figure 6A.7 present a series of
Cauchy distributions that exhibit the bias towards fatter tails or more
outliers than the normal distribution.

Either the logistic or the Cauchy distributions can be used if
the data is symmetric but with extreme values that occur more frequently than
you would expect with a normal distribution.

As the probabilities of extreme values increases relative to
the central value, the distribution will flatten out. At its limit, assuming
that the data stays symmetric and we put limits on the extreme values on both
sides, we end up with the uniform distribution, shown in figure 6A.8:

When is it appropriate to assume a uniform distribution for a
variable? One possible scenario is when you have a measure of the highest and
lowest values that a data item can take but no real information about where
within this range the value may fall. In other words, any value within that
range is just as likely as any other value.