TYPES OF CORRELATION.
MEASURING OF CORRELATION.
Correlation is a measure
of mutual correspondence between two variables and is denoted by the
coefficient of correlation.
Applications and
characteristics
a) The simple correlation coefficient, also called the Pearson's
productmoment correlation coefficient, is
used to indicate the extent that two variables change with one another in a
linear fashion.
b) The correlation coefficient can range from  1 to + 1 and is unites
(Fig. A, B, C).
c) When the correlation coefficient approaches  1, a change in one
variable is more highly, or strongly, associated with an inverse linear change
(i.e., a change in the opposite direction) in the other variable (Fig.A).
d)
When the correlation coefficient equals zero, there is no
association between the changes of the two variables (Fig.B).
(e) When the correlation
coefficient approaches +1, a change in one variable is more highly, or
strongly, associated with a direct linear change in the other variable (Fig.C).
A correlation
coefficient can be calculated validly only when both variables are subject to
random sampling and each is chosen independently.
Although useful as one of the determinants
of scientific causality, correlation by itself is not equivalent to causation.
For example, two correlated variables
may be associated with another factor that causes them to appear correlated
with each other.
A correlation
may appear strong but be insignificant because of a small sample size.
Table 2.12
Correlation
connection
There are the following types of
communication (relation) between the phenomena and signs in nature:
а) the
reasonresult connection is the connection between factors and phenomena,
between factor and result signs.
б) the
dependence of parallel changes of a few signs on some third size.
The quantitative types of connection:
functional one is the connection, at which the strictly defined
value of the second sign answers to any value of one of the signs (for example,
the certain area of the circle answers to the radius of the circle); correlation connection
is the
connection, at which a few values of one sign answer to the
value of every average size of another sign associated with the first one (for example, it is known that the height and
mass of man’s body are linked between each other; in the group of persons with
identical height there are different valuations of mass of body, however, these
valuations of body mass varies in certain sizes – round their average size).
Correlation is a concept, which means the interconnection between the
signs.
Correlative connection
foresees the dependence between the phenomena, which do not have clear
functional character.
Correlative connection is
showed up only in the mass of supervisions that is in totality. The establishment of correlative connection
foresees the exposure of the causal connection, which will confirm the
dependence of one phenomenon on the other one.
Correlative
connection
by the direction (the character) of connection can be direct and reverse.
The coefficient of correlation, that characterizes the direct communication, is
marked by the sign plus (+), and the coefficient of correlation, that characterizes
the reverse one, is marked by the sign minus ().
By the force the correlative connection can be strong,
middle, weak, it can be full and it can be absent.
THE SCHEME
OF THE ESTIMATION OF CORRELATIve CONNECTION BY THE COEFFICIENT OF CORRELATION
The force of connection 
Line (+) 
Reverse () 
Complete 
+1 

Strong 
From +1 to +0,7 
From 1 to 0,7 
Middle 
from +0,7 to +0,3 
from –0,7 to –0,3 
Weak 
from +0,3 to 0 
from –0,3 to 0 
The connection is
absent 
0 
0 
The correlative connection can be:
1. By the direction

direct (+) –
with the increasing of one sign increases the middle value of another one;

reverse () – with the increasing
of one sign decreases the middle value of another one;
2. By the character

rectilinear  relatively even changes of middle values of
one sign are accompanied by the equal
changes of the other (arterial pressure
minimal and maximal)

curvilinear
– at the
even change of one sing there can be the increasing or decreasing middle values
of the other sign.
Мethods of determination of the coefficient of correlation:
The
coefficient of correlation (r_{ху})
by one number gives the picture of the direction and force of connection
between the explored phenomena.
The
method of squares or the Pyrson’s method is frequently used for the
determination of coefficient of correlation
, where:
х
and y are the signs, between which the connection is determined
d_{x}
and d_{y} are the deviation of each variants from their middle arithmetic,
calculated in a number of sign х and in
a number of sign y (М_{х} and М_{у})
∑ is
the sum of signs.
The second method of determination of the
coefficient of correlation is the method of grades or the Speerman’s
method. It is used, when n<30 and if it is enough to have only
oriented information for the estimation of character (direction) and the force
of connection.
, where:
х
and y are the signs, between which the connection is determined,
6 is a permanent coefficient,
d
is a difference of grades,
n
is a number of supervisions.
The
determination of the error of the coefficient of grade correlation (that was
determined by the Speerman’s method) and criterion t.
and
A criterion must be 2 and more, so that Р = 95,5 % and more.
Confidence of correlation coefficient
Criteria t should be 3 that corresponds to
probability of mistakes prognosis (p) ≥ 99,7%
Student's tests (t)
based on the t distribution, which reflects greater variation dye to chance
than the normal distribution are used to analyze small samples.
The (t) distribution is a continuous, symmetrical, unimodal
distribution of infinite range, which is bellshaped, similar to the shape of
the normal distribution, but more spread out.
As the sample size increases, the t distribution closely resembles the
normal distribution. At infinite degrees of freedom, the t and normal
distributions are identical, and the t values equal the critical ratio values.
Table 2.13 Table of Critical ratio (abbreviated)
Probability
that Value 
Lies 



Critical
ration 
Within critical 
The
ratio

Within ± the
Critical ratio 
Outside ±
the Critical ratio 

1.0 
.341 

.683 
.317 


1.645 
.450 

.900 
.100 


1.96 
.475 

.950 
.050 


2.0 
.477 

.945 
.046 


2.567 
.495 
.990 
.010 

3.0 
.499 
.997 
.003 






Fig. 2. The standardized normal distribution shown
with the percentage of values included between critical ratios from the mean.
A. Student's test for a
single small sample
Student's t test for a single small sample compares a single sample with
a population.
Student's t tests are used to evaluate the null hypothesis for continuous
variables for sample sizes less than 30.
The t table.
Probability values are derived from the t value and the number of degrees
of freedom by using the t table for
each degree of freedom, a row of increasing t values corresponds to a row of
decreasing probabilities for accepting the null hypothesis the value of the t.
Confidence intervals.
In small samples, especially sample sizes less than 30, the t distribution is used to calculate
confidence intervals around the sample mean.
The (t) table (abbreviated)
Probability 

Degrees of freedom (df) 
.10 
.05 
.01 
1 
6.31 
12.71 
63.66 
2 
2.92 
4.30 
9.93 
8 
1.86 
2.31 
3.36 
9 
1.83 
2.26 
3.25 
10 
1.81 
2.23 
3.17 

1.64 
1.96 
2.58 
The method of grade correlation is used, when:
there is the small quantity of observations, the exact calculations are not
needed; the variation rows are opened, verbal expression of sign (for example,
the diagnosis of disease)
The order of
determination of grade correlation coefficient:
1)
make the variation rows from pair
signs
2)
replace every value of variant by a grade (index) number
3)
define the difference of grades: d = x – y
4)
bring difference of grades to the square
– d^{2}
5)
get the sum squares of difference
of grades  ∑d^{2}
6)
define r_{xy}
a formula
7)
define the direction and force of
connection
8)
define the error of mr_{xy} and the criterion t and estimate the
authenticity of faultless prognosis  p
1. The standard error of a measure is based on sample of a population and
is the estimate of the standard deviation of the measure for the population.
2. The standard error of a mean, one of the most
commonly used types of standard error, is a measure of the accuracy of the
sample mean as an estimate of the population mean. In comparison, the standard
deviation is a measure of the variability of the observations.
Applications
a) The standard error of the mean is used to construct confidence limits
around a sample mean.
b) Standard errors are used in Student's t test.
Confidence limits of a
mean
The upper and lower confidence limits define the range
of probability, that is, the confidence
interval, for a measure of the population based on a measure of a sample
and the measure's standard error.
1. Confidence intervals are expressed in terms of
probability based on the error.
2. The confidence limits of a mean define the
confidence interval for the population mean based on a sample mean.
For large samples, confidence limits are based
on the critical ratio for the associated probability.
For a 95% confidence interval, the estimated sampling error is multiplied
by 1.96;the chances
are 95% (19 out of 20) that the interval includes the average result of
all possible samples of the same size.
For small samples (less
than 30), confidence
limits are based on the t value for the number of degrees of freedom and the
associated probability.
(a)
Confidence limits of a mean are used to estimate a population
mean based on a sample from the population. The confidence interval is the
margin of error of the point estimate.
(b)
A repeated random sample from the population will yield
another point estimate similar to, but not necessarily the same as, the first
sample. The 95% confidence interval probably will cluster in the same area.
(c)
The most commonly used confidence limits are 95,5% confidence
limits, which indicate that there is a
95,5% probability that the population mean lies within the upper and lower
confidence limits and a 5% probability that it lies outside these limits (p =
0.05).
Screening is the initial examination of an individual to
detect disease not yet under medical care. Screening may be concerned with a
single disease or with many diseases (called multiphase screening).
1. Purpose. Screening separates
apparently healthy an individuals into groups with either a high or low
probability of developing the disease for which the screening test is used.
2. Types of diseases. Screening may be concerned
with many different types of diseases, including:
a) Acute
communicable diseases (e.g., rubella)
b) Chronic
communicable diseases (e.g., tuberculosis)
c) Acute noncommunicable diseases (e.g., lead toxicity)
d) Chronic noncommunicable diseases (e.g., glaucoma)
THE COEFICIENT OF DETERMINATION
In earlier chapters we have been
concerned with the statistical analysis of observations on a single variable. In
some problems data were divided into two groups, and the dichotomy could,
admittedly, have been regarded as defining a second variable. These twosample
problems are, however, rather artificial examples of the relationship between
two variables.
In this chapter we examine more
generally the association between two quantitative variables. We shall
concentrate on situations in which the general trend is linear; that is, as one
variable changes the other variable follows on
the average a trend which can be represented
approximately by a straight line.
The basic graphical technique for the
twovariable situation is the scatter diagram,
and it is good practice to plot the data in this form before attempting any
numerical analysis. An example is shown in Fig. 7.1. In general the data refer
to a number of individuals,
each of which provides observations on two variables. In the scatter diagram
each variable is allotted one of the two coordinate axes, and each individual
thus defines a point, of which the coordinates are the observed values of the
two variables. In Fig. 7.1 the individuals are towns and the two variables are
the infant mortality rate and a certain index of overcrowding.
The
scatter diagram gives a compact illustration of the distribution of each variable
and of the relationship between the two variables. Further statistical analysis
serves a number of purposes. It provides, first, numerical measures of some of
the basic features of the relationship, rather as the mean and standard
deviation provide concise measures of the most important features of the distribution
of a single variable. Secondly, the investigator may wish to make a prediction
of the value of one variable when the value of the other variable is known. It
will normally be impossible to predict with complete certainty, but we may hope
to say something about the mean value and the variability of the predicted
variable. From Fig. 7.1, for instance, it appears roughly that a town with 06
persons per room was in 1961 likely to have an infant mortality rate of about
20 per 1000 live births on average, with a likely range of about 1426. A
proper analysis might be expected to give more reliable figures than these
rough guesses.
Thirdly,
the investigator may wish to assess the significance of the direction of an
apparent trend. From the data of Fig. 7.1, for instance, could it safely be
asserted that infant mortality increases on the average as the overcrowding
index increases, or could the apparent trend in this direction have arisen
easily by chance?
Yet
another aim may be to correct the measurements of one variable for the effect
of another variable. In a study of the forced expiratory volume (FEV) of
workers in the cadmium industry who had been exposed for more than a certain
number of years to cadmium fumes, a comparison was made with the FEV of other
workers who had not been exposed. The mean FEV of the first group was lower
than that of the second. However, the men in the first group tended to be older
than those in the second, and FEV tends to decrease with age. The question
therefore arises whether the difference in mean FEV could be explained purely
by the age difference. To answer this question the relationship between FEV and
age must be studied in some detail.
We must be careful to distinguish
between association
and causation. Two variables are
associated if the distribution of one is affected by a knowledge of the value
of the other. This does not mean that one variable causes
the other. There is a strong association between the number of divorces made
absolute in the United Kingdom during the first half of this century and the
amount of tobacco imported (the ‘individuals’ in the scatter diagram here being
the individual years). It does not follow either that tobacco is a serious
cause of marital discontent, or that those whose marriages have broken down
turn to tobacco for solace. Association does not imply causation.
A further distinction is between
situations in which both variables can be thought of as random variables, the
individuals being selected randomly or at least without reference to the values
of either variable, and situations in which the values of one variable are
deliberately selected by the investigator. An example of the first situation
would be a study of the relationship between the height and the blood pressure
of schoolchildren, the individuals being restricted to one sex and one age
group. Here, the sample may not have been chosen strictly at random, but it can
be thought of as roughly representative of a population of children of this age
and sex from the same area and type of school. An example of the second
situation would arise in a study of the growth of children between certain
ages. The nature of the relationship between height and age, as illustrated by
a scatter diagram, would depend very much on the age range chosen and the
distribution of ages within this range.
The Regression Model; Analysis of Residuals
Before we can perform
statistical inferences in regression and correlation, we must know whether the variables
under consideration satisfy certain conditions. In this section, we discuss
those conditions and examine methods for deciding whether they hold.
The Regression Model
Let’s return to the Orion illustration used throughout Chapter 4. In
Table 14.1, we reproduce the data on age and price for a sample of 11 Orions.
With age as the predictor variable and price as the response variable,
the regression equation for these data is ˆy = 195.47 − 20.26x, as we found in Chapter 4 on page
153. Recall that the regression equation can be used to predict the price of an
Orion from its age. However, we cannot expect such predictions to be completely
accurate because prices vary even for Orions of the
same age.
For instance, the sample data in Table 14.1 include four 5yearold Orions. Their prices are $8500, $8200, $8900, and $9800. We
expect this variation in price for 5yearold Orions
because such cars generally have different mileages, interior conditions, paint
quality, and so forth.
We use the population of all 5yearold Orions
to introduce some important regression terminology. The distribution of their
prices is called the conditional distribution of the
response variable “price” corresponding to the value 5 of the predictor variable
“age.” Likewise, their mean price is called the conditional mean of the response variable “price” corresponding to the value 5 of the
predictor variable “age.” Similar terminology applies to the standard deviation
and other parameters.
Of course, there is a
population of Orions for each age. The distribution,
mean, and standard deviation of prices for that population are called the conditional distribution,
conditional mean, and conditional standard deviation, respectively, of the response variable “price” corresponding to the
value of the predictor variable “age.”
The terminology of conditional distributions, means, and standard
deviations is used in general for any predictor variable and response variable.
Using that terminology, we now state the conditions required for applying
inferential methods in regression analysis.
Note: We
refer to the line y = β0
+ β1x—on
which the conditional means of the response variable lie—as the population regression line and
to its equation as the population
regression equation.
The
inferential procedures in regression are robust to moderate violations of
Assumptions 1–3 for regression inferences. In other words, the inferential
procedures work reasonably well provided the variables under consideration
don’t violate any of those assumptions too badly.
Assumptions for
Regression Inferences
Age and Price of Orions
For Orions, with age as the predictor variable
and price as the response variable, what would it mean for the
regressioninference Assumptions 1–3 to be satisfied? Display those assumptions
graphically.
Solution
Satisfying regressioninference Assumptions 1–3 requires that there are
constants β0, β1, and σ so that for each age, x, the prices of all Orions of that age are
normally distributed with mean β0 + β1x and standard deviation σ. Thus the prices of all 2yearold Orions
must be normally distributed with mean β0 + β1 · 2 and
standard deviation σ, the prices of all 3yearold
Orions must be normally distributed with mean β0 + β1 · 3 and
standard deviation σ, and so on.
To display the assumptions for regression inferences graphically, let’s
first consider Assumption 1. This assumption requires that for each age, the
mean price of all Orions of that age lies on the line
y
= β0 + β1x, as shown in Fig. 14.1.
Assumptions 2 and 3 require that the price distributions for the various
ages of Orions are all normally distributed with the
same standard deviation, σ. Figure 14.2 illustrates those two assumptions for the price
distributions of 2, 5, and 7yearold Orions. The
shapes of the three normal curves in Fig. 14.2 are identical because normal
distributions that have the same standard deviation have the same shape.
Assumptions 1–3 for regression inferences, as they pertain to the
variables age and price of Orions, can be portrayed
graphically by combining Figs. 14.1 and 14.2 into a threedimensional graph, as
shown in Fig. 14.3. Whether those assumptions actually hold remains to be seen.
Estimating the
Regression Parameters
Suppose that we are considering two variables, x and y, for which the assumptions for
regression inferences are met. Then there are constants β0, β1, and σ so that, for each value x of the predictor variable, the
conditional distribution of the response variable is a normal distribution with
mean β0 + β1x and standard deviation σ.
Because
the parameters β0, β1, and σ are usually
unknown, we must estimate them from sample data.We
use the yintercept and slope of a sample regression line as point
estimates of the yintercept and slope, respectively, of the population
regression line; that is, we use b0 and b1 to estimate β0 and β1,
respectively. We note that b0 is an unbiased
estimator of β0 and that b1 is an
unbiased estimator of β1.
Equivalently,
we use a sample regression line to estimate the unknown population regression
line. Of course, a sample regression line ordinarily will not be the same as
the population regression line, just as a sample mean generally will not equal
the population mean. In Fig. 14.4, we illustrate this situation for the Orion
example. Although the population regression line is unknown, we have drawn it
to illustrate the difference between the population regression line and a
sample regression line.
In Fig. 14.4, the sample regression line (the dashed line) is the best
approximation that can be made to the population regression line (the solid
line) by using the sample data in Table 14.1 on page 551. A different sample of
Orions would almost certainly yield a different
sample regression line.
The statistic used to obtain a point estimate for the common conditional
standard deviation σ is called the standard error of the
estimate.
Linear regression
Suppose that observations are made on
variables x and y
for each of a large number of individuals, and that we are interested in the
way in which y
changes on the average as x assumes different values. If it is appropriate to
think of y
as a random variable for any given value of x, we can enquire how the
expectation of y
changes with x. The probability distribution of y
when x is known is referred to as a conditional
distribution, and the conditional expectation is denoted by E(yx). We make no assumption at this stage as to whether x
is a random variable or not. In a study of heights and blood pressures of
randomly chosen individuals both variables would be random; if x and y
were respectively the age and height of children selected according to age,
then only y
would be random.
The conditional expectation, E(yx), depends in general on x.
It is called the regression function
of y on x. If E(yx) is drawn as a function of x
it forms the regression curve.
Two examples are shown in Fig. 7.2. First, the regression in Fig. 7.2(b)
differs in two ways from that in Fig. 7.2(a). The curve in Fig. 7.2(b) is a
straight line—the regression line
of y on x. Secondly, the
variation of y
for fixed x is constant in Fig. 7.2(b), whereas in Fig. 7.2(a) the variation
changes as x increases. The regression in (b) is called homoscedastic,
that in (a) being hetero scedastic.
a)
b)
Fig. 7.2 Two regression curves of y
on x: (a) nonlinear and heteroscedastic; (b) linear
and homo scedastic. The distributions shown are
those of values of y
at certain values of x
The
situation represented by Fig. 7.2(b) is important not only because of its
simplicity, but also because regressions which are approximately linear and
homoscedastic occur frequently in scientific work. In the present discussion we
shall make one further simplifying assumption—that the distribution of y
for given x is normal.
The
model may, then, be described by saying that, for a given x, y
follows a normal distribution with mean
E(y
I x) = a + px
(the general equation of a straight line) and variance ct^{2} (a
constant). A set of data consists of n
pairs of observations, denoted by (xi, yi), (x_{2}, y_{2}), ..., (x_{n}, y_{n}), each y,
being an independent observation from the distribution N(a + px,, ct^{2}).
How can we estimate the parameters a, p and ct^{2},
which characterize the model?
An intuitively attractive proposal is to draw the regression
line through the n points
on the scatter diagram so as to minimize the sum of squares of the distances, yt — Y, ,
of the points from the line, these distances being measured from the yaxis
(Fig. 7.3). This proposal is in accord with theoretical arguments leading to
the least squares estimators of a
and p, a
and b, namely the values which
minimize the residual
sum of squares^ ^(yt — Yi)^{2},
where Yi
is given by the estimated regression equation

Inferences
in Correlation
Frequently, we want to decide whether two variables
are linearly correlated, that is, whether there is a linear relationship
between the two variables. In the context of regression, we can make that
decision by performing a hypothesis test for the slope of the population
regression line.
Alternatively, we can perform a hypothesis test for
the population linear correlation coefficient, ρ (rho). This parameter measures the linear correlation of
all possible pairs of observations of two variables in the same way that a
sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs.
Thus, ρ actually
describes the strength of the linear relationship between two variables; r is only an estimate of ρ obtained from sample data.
The population linear correlation coefficient of two
variables x and y always lies between −1 and 1. Values of ρ near −1 or 1 indicate a strong linear relationship between
the variables, whereas values of ρ near 0 indicate a weak linear relationship between the variables.
Note the following: _ If ρ = 0, the
variables are linearly uncorrelated, meaning that there is no linear relationship between the
variables.
If ρ > 0, the variables are positively linearly
correlated, meaning that y tends to increase linearly as x increases
(and vice versa), with the tendency being greater the closer ρ is to 1.
If ρ < 0, the variables are negatively linearly
correlated, meaning that y tends to decrease linearly as x increases
(and vice versa), with the tendency being greater the closer ρ is to −1.
If ρ ≠ 0,
the variables are linearly correlated. Linearly
correlated variables are either positively linearly correlated or negatively
linearly correlated. As we mentioned, a sample linear correlation coefficient, r , is an estimate of the
population linear correlation coefficient, ρ.
Consequently, we can use r as a basis for performing
a hypothesis test for ρ. To do so, we require the
following fact.
Correlation
and Causality
A
major goal of many statistical studies is to determine whether one factor causes
another. For example, does smoking cause lung cancer? In this unit, we will
discuss how statistics can be used to search for correlations that might
suggest a causeandeffect relationship. Then we’ll explore the more difficult
task of establishing causality.
Seeking
Correlation
What
does it mean when we say that smoking causes lung cancer? It certainly
does not mean that you’ll get lung cancer if you smoke a single
cigarette. It does not even mean that you’ll definitely get lung cancer if you
smoke heavily for many years, since some heavy smokers do not get lung cancer.
Rather, it is a statistical statement meaning that you are much more
likely to get lung cancer if you smoke than if you don’t smoke.
Let’s
try to understand how researchers learned that smoking causes lung cancer.
Before they could investigate cause, researchers first needed to
establish correlations between smoking and cancer. The process of establishing
correlations began with observations. The early observations were informal.
Doctors noticed that smokers made up a surprisingly high proportion of their
patients with lung cancer. This suggestion of a linkage led to carefully
conducted studies in which researchers compared lung cancer rates among smokers
and nonsmokers. These studies showed clearly that heavier smokers were more
likely to get lung cancer. In more formal terms, we say that there is a
correlation between the variables amount of smoking and incidence of
lung cancer. A correlation is a special type of relationship between
variables, in which a rise or fall in one goes along with a corresponding rise
or fall in the other.
Establishing a
correlation between two variables does not mean that a change in one
variable causes a change in the other. Thus, finding the correlation
between smoking and lung cancer did not by itself prove that smoking causes
lung cancer. We could imagine, for example, that some gene predisposes a person
both to smoking and to lung cancer. Nevertheless, identifying the correlation
was the crucial first step in learning that smoking causes
lung cancer.
Time
out to think
Suppose
there really were a gene that made people prone to both smoking and lung cancer.
Explain why we would still find a strong correlation between smoking and lung
cancer in that case, but would not be able to say that smoking caused lung
cancer.
Scatter
Diagrams
Table
5.6 shows the production cost and gross receipts (total revenue from ticket
sales) for the 15 biggestbudget science fiction and fantasy movies of all time
(through mid2006). Movie executives presumably hope there is a favorable
correlation between the production budget and the receipts. That is, they hope
that spending more to produce a movie will result in higher box office
receipts. But is there such a correlation? We can look for a correlation by
making a scatter diagram showing the relationship between the variables production
cost and gross receipts.
The following
procedure describes how we make the scatter diagram, which is shown in Figure
5.40:
1. We assign
one variable to each axis, and we label each axis with values that comfortably
fit the data. Here, we assign production cost to the horizontal axis and
gross receipts to the vertical axis. We choose a range of $50 to $250
million for the production cost axis and $0 to $450 million for the gross
receipts axis.
2. For each
movie in Table 5.6, we plot a single point at the horizontal position
corresponding to its production cost and the vertical position corresponding to
its gross receipts. For example, the point for the movie Waterworld
goes at a position of $175 million on the horizontal axis and $88 million
on the vertical axis. The dashed lines on Figure 5.40 show how we locate this
point.
3. (Optional) If we wish, we can label data points, as is done for
selected points in Figure 5.40.
Time out to think
By studying Table 5.6,
associate each of the unlabeled data points in Figure 5.40 with a particular
movie.
Types of Correlation
Look carefully at the scatter diagram
for movies in Figure 5.40. The dots seem to be scattered about with no apparent
pattern. In other words, at least for these bigbudget movies, there appears to
be little or no correlation between the amount of
money spent producing the movie and the amount of money it earned in gross
receipts.
Now consider the scatter diagram in
Figure 5.41, which shows the weights (in carats) and retail prices of 23
diamonds. Here, the dots show a clear upward trend, indicating that larger
diamonds generally cost more. The correlation is not perfect. For example, the
heaviest diamond is not the most expensive. But the overall trend seems fairly
clear. Because the prices tend to increase with the weights, we say that Figure
5.41 shows a positive correlation.
In contrast, Figure 5.42 shows a
scatter diagram for the variables life expectancy and infant
mortality in 16 countries. We again see a clear trend, but this time it is
a negative correlation: Countries with higher life expectancy
tend to have lower infant mortality.
Besides stating whether a correlation
exists, we can also discuss its strength. The more closely the data follow the
general trend, the stronger is the correlation.
EXAMPLE
1 Inflation and
Unemployment
Prior to the 1990s, most economists assumed that the unemployment rate
and the inflation rate were negatively correlated. That is, when unemployment
goes down, inflation goes up, and vice versa. Table 5.7 shows
unemployment and inflation data for the period 1990–2006. Make a scatter
diagram for these data. Based on your diagram, does it appear that the data
support the historical claim of a link between the unemployment and inflation
rates?
SOLUTION
We make the scatter
diagram by plotting the variable unemployment rate on the horizontal
axis and the variable inflation rate on the vertical axis. To make the
graph easy to read, we use values ranging from 3.5% to 8% for the unemployment
rate and from 0 to 6% for the inflation rate. Figure 5.43 shows the result. To
the eye, there does not appear to be any obvious correlation between the two
variables. (A calculation confirms that there is no appreciable correlation.)
Thus, these data do not support the historical claim of a negative
correlation between the unemployment and inflation rates.
EXAMPLE 2
Accuracy of Weather
Forecasts
The scatter diagrams in Figure
5.44 show two weeks of data comparing the actual high temperature for the day
with the sameday forecast (left diagram) and the three day forecast (right
diagram). Discuss the types of correlation on each diagram.
SOLUTION
Both scatter diagrams show
a general trend in which higher predicted temperatures mean higher actual
temperatures. Thus, both show positive correlations. However, the points in the
left diagram lie more nearly on a straight line, indicating a stronger
correlation than in the right diagram. This makes sense, because we expect
weather forecasts to be more accurate on the same day than three days in
advance.
Possible Explanations for a Correlation We began by
stating that correlations can help us search for causeandeffect relationships.
But we’ve already seen that causality is not the only possible explanation for
a correlation. For example, the predicted temperatures on the horizontal axis
of Figure 5.44 certainly do not cause the actual temperatures on the
vertical axis. The following box summarizes three possible explanations for a
correlation.
EXAMPLE 3 Explanation for a
Correlation
Consider the correlation between infant mortality and life expectancy
in Figure 5.42. Which of the three possible explanations for a correlation
applies? Explain.
SOLUTION
The negative correlation between infant mortality and life
expectancy is probably an example of common underlying cause. Both variables
respond to an underlying variable that we might call quality of health care.
In countries where health care is better in general, infant mortality is
lower and life expectancy is higher.
EXAMPLE 4 How to Get Rich in the
Stock Market (Maybe)
Every financial advisor has a strategy for predicting the
direction of the stock market. Most focus on fundamental economic data, such as
interest rates and corporate profits. But an alternative strategy relies on a
remarkable correlation between the Super Bowl winner in January and the
direction of the stock market for the rest of the year: The stock market tends
to rise when a team from the old, pre1970 NFL wins the Super Bowl, and tends
to fall otherwise. This correlation successfully matched 28 of the first 32
Super Bowls to the stock market. Suppose that the Super Bowl just ended and the
winner was the Detroit Lions, an old NFL team. Should you
invest all your spare cash (and maybe even some that you borrow) in the stock
market?
SOLUTION
Based on the reported correlation, you might be tempted to
invest, since the oldNFL winner suggests a rising stock market over the rest
of the year. However, this investment would make sense only if you believed
that the Super Bowl result actually causes the stock market to move in a
particular direction. This belief is clearly preposterous, and the correlation
is undoubtedly a coincidence. If you are going to invest, don’t base your
investment on this correlation.
ESTABLISHING
CAUSALITY
Suppose you have
discovered a correlation and suspect causality. How can you test your
suspicion? Let’s return to the issue of smoking and lung cancer. The strong
correlation between smoking and lung cancer did not by itself prove that
smoking causes lung cancer. In principle, we could have looked for proof with a
controlled experiment. But such an experiment would be unethical, since it
would require forcing a group of randomly selected people to smoke cigarettes.
So how was smoking established as a cause of lung cancer?
The answer involves
several lines of evidence. First, researchers found correlations between
smoking and lung cancer among many groups of people: women, men, and people of
different races and cultures. Second, among groups of people that seemed
otherwise identical, lung cancer was found to be rarer in nonsmokers. Third,
people who smoked more and for longer periods of time were found to have higher
rates of lung cancer. Fourth, when researchers accounted for other potential
causes of lung cancer (such as exposure to radon gas or asbestos), they found
that almost all the remaining lung cancer cases occurred among smokers.
These four lines of evidence made a strong case, but still
did not rule out the possibility that some other factor, such as genetics,
predisposes people both to smoking and to lung cancer. However, two additional
lines of evidence made this possibility highly unlikely. One line of evidence
came from animal experiments. In controlled experiments, animals were divided
into randomly chosen treatment and control groups. The experiments still found
a correlation between inhalation of cigarette smoke and lung cancer, which
seems to rule out a genetic factor, at least in the animals. The final line of
evidence came from biologists studying cell cultures (that is, small samples of
human lung tissue). The biologists discovered the basic process by which
ingredients in cigarette smoke can create cancercausing mutations. This
process does not appear to depend in any way on specific genetic factors,
making it all but certain that lung cancer is caused by smoking and not by any
preexisting genetic factor.
The following box summarizes these ideas about establishing
causality. Generally speaking, the case for causality is stronger when more of
these guidelines are met.
Time out to think
There’s a great deal of controversy concerning whether animal
experiments are ethical. What is your opinion of animal experiments? Defend
your opinion.
CASE STUDY Air Bags and Children
By the mid1990s, passengerside
air bags had become commonplace in cars. Statistical studies showed that the
air bags saved many lives in moderate to highspeed collisions. But a
disturbing pattern also appeared. In at least some cases, young children,
especially infants and toddlers in child car seats, were killed by air bags in
lowspeed collisions.
At first, many safety advocates found it difficult to believe
that air bags could be the cause of the deaths. But the observational evidence
became stronger, meeting the first four guidelines for establishing causality.
For example, the greater risk to infants in child car seats fit Guideline 3,
because it indicated that being closer to the air bags increased the risk of
death. (A child car seat sits on top of the builtin seat, thereby putting a
child closer to the air bags than the child would be otherwise.)
To seal the case, safety experts undertook experiments using
dummies. They found that children, because of their small size, often sit where
they could be easily hurt by the explosive opening of an air bag. The experiments
also showed that an air bag could impact a child car seat hard enough to cause
death, thereby revealing the physical mechanism by which the deaths occurred.
CASE STUDY What Is Causing Global
Warming?
Statistical measurements show that the global average
temperature—the average temperature everywhere on Earth’s surface—has risen
about 1.5°F in the past century, with more than half
of this warming occurring in just the past 30 years. But what is causing this
socalled global warming?
Scientists have for decades suspected that the temperature
rise is tied to an increase in the atmospheric concentration of carbon dioxide
and other greenhouse gases. Comparative studies of Earth and other
planets, particularly Venus and Mars, show that the greenhouse gas concentration
is the single most important factor in determining a planet’s average
temperature. It is even more important than distance from the Sun. For example,
Venus, which is about 30% closer than Earth to the Sun, would be only about 45°F warmer than Earth if it had an Earthlike atmosphere.
But because Venus has a thick atmosphere made almost entirely of carbon
dioxide, its actual surface temperature is about 880°F—hot
enough to melt lead. The reason greenhouse gases cause
warming is that they slow the escape of heat from a planet’s surface, thereby
raising the surface temperature.
In other words, the physical mechanism by which greenhouse
gases cause warming is well understood (satisfying Guideline 6 on our list),
and there is no doubt that a large rise in carbon dioxide concentration would eventually
cause Earth to become much warmer. Nevertheless, as you’ve surely heard,
many people have questioned whether the current period of global warming really
is due to humans or whether it might be due to natural variations in the carbon
dioxide concentration or other natural factors.
In an attempt to answer
these questions, the United States and other nations have devoted billions of
dollars over the past two decades to an unprecedented effort to understand Earth’s
climate. We still have much more to learn, but the research to date makes a
strong case for human input of greenhouse gases as the cause of global warming.
Two lines of evidence make the case particularly strong.
The first line of
evidence comes from careful measurements of past and present carbon dioxide
concentrations in Earth’s atmosphere. Figure 5.45 shows the data. Notice that
past changes in the carbon dioxide concentration correlate clearly with
temperature changes, confirming that we should expect a rising greenhouse gas
concentration to cause rising temperatures. Moreover, while the past data show
that the carbon dioxide concentration does indeed vary naturally, it also shows
that the recent rise is much greater than any natural increase during the past
several hundred thousand years. Human activity is the only viable explanation
for the huge recent increase in carbon dioxide concentration.
The second line of
evidence comes from experiments. We cannot perform controlled experiments with
our entire planet, but we can run experiments with computer models that
simulate the way Earth’s climate works. Earth’s climate is incredibly complex,
and many uncertainties remain in attempts to model the climate on computers.
However, today’s models are the result of decades of work and refinement. Each
time a model of the past failed to match real data, scientists sought to
understand the missing (or incorrect) ingredients in the model and then tried
again with improved models. Today’s models are not perfect, but they match real
climate data quite well, giving scientists confidence
that the models have predictive value. Figure 5.46 compares model data and real data,
showing good agreement and clearly suggesting that human activity is the cause
of global warming. If you include the effects of the greenhouse gases put into
the atmosphere by humans, the models agree with the data, but if you leave out
these effects, the models fail.
Time out to think
Check the idea that human activity causes global warming
against each of the six guidelines for establishing causality.
Confidence in Causality
If human activity is causing global warming, we’d be
wise to change our activities so as to stop it. But while we have good reason
to think that this is the case, not everyone is yet convinced. Moreover, the
changes needed to slow global warming might be very expensive. How do we decide
when we’ve reached the point where something like global warming requires steps
to address it?
In an ideal world, we would continue to study the issue until
we could establish for certain that human activity is the cause of global
warming. However, we have seen that it is difficult to establish causality and
often impossible to prove causality beyond all doubt. We are therefore
forced to make decisions about global warming, and many other important issues,
despite remaining uncertainty about cause and effect.
In other areas of mathematics, accepted techniques help us deal with uncertainty by allowing us to calculate
numerical measures of possible errors. But there are no accepted ways to assign
such numbers to the uncertainty that comes with questions of causality.
Fortunately, another area of study has dealt with practical problems of
causality for hundreds of years: our legal system. You may be familiar with the
following three broad ways of expressing a legal level of confidence.
Time out to think
Given what you know about
global warming, do you think that human activity is a possible cause, probable
cause, or cause beyond reasonable doubt? Defend your opinion. Based on your
level of confidence in the causality, how would you recommend setting policies
with regard to global warming?
References:
1.
David Machin. Medical statistics: a
textbook for the health sciences / David Machin,
Michael J. Campbell, Stephen J Walters. – John Wiley & Sons, Ltd., 2007. –
346 p.
2. Nathan Tintle.
Introduction to statistical
investigations / Nathan Tintle, Beth Chance,
George Cobb, Allan Rossman, Soma Roy, Todd Swanson,
Jill VanderStoep. – UCSD BIEB100, Winter 2013. – 540 p.
3. Armitage P. Statistical Methods in
Medical Research / P. Armitage, G. Berry, J.
Matthews. – Blaskwell Science, 2002. – 826 p.
4. Larry Winner. Introduction
to Biostatistics / Larry Winner. – Department of Statistics University of
Florida, July 8, 2004. – 204 p.
5. Weiss
N. A. (Neil A.) Elementary statistics / Neil A. Weiss; biographies by Carol A.
Weiss. – 8th ed., 2012. – 774 p.