CHAPTER 9

of A Judge's Deskbook on the Basic Philosopies and Methods of Science,
by Shirley A. Dobbin, Ph.D, and Sophia I. Gatowski, Ph.D

Data Analysis: An Introduction to Statistics

Statistics is the science and art of gaining information from data -- of collecting, organizing, and interpreting numerical facts. The field of statistics includes the methods and procedures used to summarize, analyze, and draw inferences from data.

Statistics and Measurement Scales

Measurement is essentially the assigning of numbers to observations according to certain rules. The way in which the numbers are assigned to observations determines the scale measurement being used. The choice as to which statistical test can legitimately be used for data analysis rests largely on which scale of measurement has been employed. Further, the inferences that can be drawn from a study cannot, or at least should not, outrun the data being used.

  • Nominal Scale - Categorical Data: The single property of nominal measurement is classification -- that is, the sorting of observations into different classes or categories; using numbers to label categories, sorting observations into these categories, and then noting their frequencies of occurrence. Nominal measurement reflects only differences in kind, not differences in degree or amount.
  • Ordinal Scale - Ranked Data: The distinctive property of ordinal measurement is order. Differences on an ordinal scale not only reflect differences in kind (as with nominal measurement), but also differences in degree. Information regarding greater-than or less-than status is contained in ordinal data, but information as to how much greater or less than is not -- that is, no claim can be made about the amount of difference between adjacent categories.
  • Interval Scale - Measurement Data: The distinctive property of interval measurement is equal intervals -- that is, scales in which distances between successive scale points are assumed to be equal. Interval data do contain information as to 'how much greater than' or 'how much less than.'
  • Ratio Scale: The distinctive property of ratio measurement is a true zero. This is a special form of interval scale for which an absolute zero can be determined. Ratio data allow for such ratio comparisons as 'one measure is twice as great as another.'

Learning Objectives for Chapter 9

Upon completion of this chapter, the reader should be able to:

  • understand the difference between descriptive and inferential statistics;
  • understand the difference between different levels of measurement;
  • understand the measures of central tendency, measures of variability, and measures of relationship;
  • understand the concepts of "population" and "sample" in inferential statistics;
  • understand parameter estimates and hyupothesis-testing;
  • understand the concept of error rate, including Type I and Type II, false positive and false negative errors; and
  • identify common problems associated with the presentation of statistics in court.

Levels of Measurement

Level

Properties

Observations Reflect

Examples

Ratio

true zero
equal intervals
order
classification

measurable differences in total amount

weight
income

Interval

equal intervals
order
classification

measurable differences in magnitude

Fahrenheit temperature
IQ score
GPA

Ordinal

order
classification

differences in degree

attitudes
letter grade
movie ratings

Nominal

classification

differences in kind

ethnicity
political affiliation


Measures of Central Tendency and the Effects of the Scale of Measurement Used

Interval and Ratio Data: Because with interval and ratio data the difference between scores is equal, interval and ratio data allow for the calculation of the mean, median, and mode.

Ordinal Data: Since ordinal data provides no information regarding the distance between the scale points, calculating an ordinal mean is inappropriate and misleading. When ordinal data is used a median should be calculated-- that is, ordinal data can be ranked and the median is the middle score.

Nominal Data: With nominal data, neither the mean nor the median can be used, since each of these measures implies comparisons of greater than and less than. The only measure of central tendency permissible for nominal data is the mode, the most frequently occurring score.

I. Descriptive Statistics

Descriptive statistics refers to a set of procedures used to describe and summarize samples of data.

Graphing Data

The process of graphing data usually begins with the creation of a distribution. To exact some meaning from the original data, the researcher begins by bringing order to the data. The first step is to form a distribution of scores. A distribution is the arrangement of any set of scores in order of magnitude.

Frequency distributions allow the researcher to see general trends more readily than does an ordered set of raw data. A frequency distribution is a listing, in order of magnitude, of each score achieved, together with the number of times that score occurred. Frequency distributions can be presented in both tabular and graphic form (e.g., bar graphs or line graphs).

Distribution: arrangement of any set of scores in order of magnitude

Frequency Distribution: a listing, in order of magnitude, of each score achieved, and the number of times each score occurred.

Samples of bar and line graphs

 

There are two main branches of statistical methods:

Descriptive Statistics

statistics that summarize, describe, and make understandable the numbers generated by a research study

Inferential Statistics

statistics used to draw conclusions and inferences which are based upon, but go beyond, the numbers generated by a research study

Measures of Central Tendency

Measures of central tendency are designed to give information concerning the average, or typical, score of a large number of scores - that is, which single score best represents an entire set of scores. There are three methods for obtaining a measure of the central tendency:

  1. The Mean
  2. The Median
  3. The Mode
  • The Mean

The mean is the arithmetic average of all the scores. It is calculated by adding all the scores together and then dividing by the total number of scores involved. It is important to realize that in some cases the mean can give a very distorted picture of the average value of a distribution of scores. That is, when there are extreme scores (called outliers) the average score will give a distorted picture of the distribution of scores.

  • The Median

The median is the exact midpoint of any distribution. The median is a much more accurate representation of central tendency than is the mean. To calculate the median, the scores must first be arranged in order of magnitude (e.g., from lowest to highest), the middle score is the median. In certain cases, the median is better than the mean as a typical or representative value for a group of scores. This happens when there are a few extreme scores (called outliers) that would strongly affect the mean but would not affect the median.

  • The Mode

The mode is the most common single number in the distribution; in a perfectly symmetrical unimodel distribution, the mode is the same as the mean. However, when it is not the same, the mode is not really a good representative value of the distribution. A distribution having a single mode is called a unimodal distribution. A distribution having two or more modes is called a bimodal distribution.

unimodal and bimodal distribution charts

Statistics: A Practical History of Craps and Beer

During the seventeenth century, the birth of statistics finally took place. It happened one night in France. The scene was a gambling table, and the main character was the Chevalier de Mere, a noted gambler of his time. He had been having a disastrous run of losing throws. To find out whether his losses were indeed the product of bad luck or simply of unrealistic expectations, he sought the advice of the great French mathematician and philosopher Blaise Pascal (16231662). Pascal worked out the probabilities for the various dice throws, and the Chevalier de Mere discovered that he had been making some very bad bets indeed. Thus, the father of probability theory was Pascal.

Another milestone for statistics occurred at the turn of the century in Ireland at the famous Guinness brewery, now known worldwide for the record books of the same name. In 1906, to produce the best beverage possible, the Guinness Company decided to select a sample of people from Dublin to do a little beer tasting. Since there turned out to be no shortage of individuals willing to participate in this taste test, the question of just how large a sample would be required became financially crucial to the brewery. They turned the problem over to the mathematician William Sealy Gossett. In 1908, under the pen name "Student," Gossett produced the formula for specifying how large a sample must be to generalize the results to the entire beer-drinking population.

So that's the history - craps and beer.... The point is that the hallmark of statistics is the very practicality that gave rise to its existence in the first place. The field is not an area of mysticism or sterile speculations. It is a no-nonsense area of here-and-now pragmatism.

-- Richard C. Sprinthall, Basic Statistical Analysis, 5th Edition. Allyn and Bacon (1997), pg. 13.

Measures of Variability

A measure of central tendency (i.e., mean, median, or mode) is a single number that describes a hypothetical, typical person. A statistic that describes the extent to which scores differ from one another in a distribution, and the extent to which they differ from the mean, is called a measure of variability. Just as measures of central tendency give information about similarity among scores, measures of variability give information about how scores differ or vary.

There are three major measures of variability:

1. The Range

2. The Standard Deviation

3. The Variance

  • The Range

The range is the measurement of the width or spread of an entire distribution and is found simply by calculating the difference between the highest and lowest scores. The range is a limited measure of variability. For example, distributions can have identical means and ranges and yet vary widely in terms of other important measures of variability.

  • The Standard Deviation

The standard deviation is one of the most important measures of variability and it takes into account all scores in a distribution. The standard deviation is defined as a measure of the variability that indicates by how much all of the scores in the distribution typically deviate or vary from the mean. Since the standard deviation is always calculated with reference to the mean, its calculation demands the use of interval or ratio data. The standard deviation is the typical deviation of a given distribution. The larger the value of the standard deviation, the more the scores are spread out around the mean; the smaller the value of the standard deviation, the less the scores are spread out around the mean. That is, a distribution with a small standard deviation indicates that the group being measured is homogeneous; their scores are clustered very close to the mean. A distribution with a large standard deviation indicates that the group is heterogeneous; their scores are more widely dispersed from the mean.

Normal Curve chart

Normal Curve: a theoretical distribution; a unimodal frequency distribution with scores plotted on the X axis (the horizontal axis) and frequency plotted on the Y axis (the vertical axis); most of the scores cluster around the middle of the distribution; curve is symmetrical and all three measures of central tendency (mean, median, mode) fall precisely at the middle of the distribution.

Positively Skewed chart

Positively Skewed Distribution: distribution in which scores are concentrated near the bottom of the distribution; tail of the distribution points to the top or positive end.

Negatively Skewed Curve chart

Negatively Skewed Distribution: distribution in which scores are concentrated near the top of the distribution; tail of the distribution points to the low or negative end.

 


 

Inferential Statistics: statistical procedures used to draw conclusions and inferences which are based upon, but go beyond, the numbers generated by a research study

  • The Variance

The variance of a distribution is the square of the standard deviation. It is a useful term because it reflects how much of the variability between people on one characteristic (e.g., income) can be explained by knowing where they stand on another characteristic (e.g., education).

Chart of standard normal distribution

The Normal Curve and Z-Scores

The normal curve is a theoretical distribution. However, many distributions of people-related measurements come close to approximating the normal curve and thus it is of crucial significance for describing data.

The normal curve is a unimodal frequency distribution with scores plotted on the X axis (the horizontal axis) and frequency plotted on the Y axis (the vertical axis). In a normal curve, most of the scores cluster around the middle of the distribution (where the curve is at its highest). As the distance from the middle increases, in either direction, there are fewer and fewer scores. The normal curve is symmetrical - both sides are mirror images of the other - and all three measures of central tendency (the mean, median, and mode) fall precisely at the same point, the exact middle of the distribution. In a skewed distribution, scores tend to pile up at one end or the other. The direction of skewness is indicated by the "tail" of the curve. The curve is positively skewed when most of the scores pile up near the bottom (the tail points toward the high or positive end). The curve is negatively skewed when most of the scores pile up near the top (the tail points toward the low or negative end).

The normal curve has a constant relationship with the standarddeviation. When the normal curve is marked off in units of standard deviation, a series of constant percentages under the normal curve are formed. Once the curve is plotted according to standard deviation units, it is called the standard normal curve, or z-distribution.

A z-distribution is a normally distributed set of specially scaled scores whose mean is always equal to zero and whose standard deviation must equal 1.00. Z-scores take into account both the mean of the distribution and the amount of variability, the standard deviation. Thus, z-scores can be used to assess an individual's relative performance compared to the performance of the entire group being measured. The z-score is the number of standard deviations the observed value is from the mean.

Part II. Inferential Statistics

The primary goal of inferential statistics is to measure a few and generalize to many. That is, observations are made of a small segment of the group, and then, from these observations, the characteristics of the entire group are inferred. Inferential statistics are procedures used to reach conclusions (generalizations) about larger populations from a small sample of data with a minimal degree of error.

There are usually two issues to be explored:

1. Does the mean of a sample actually reflect the mean of the larger population of interest?

2. Is a difference found between two means (e.g., between an experimental group and a control group) a real and important difference, or is it merely the result of chance?

Measures of Relationship: Correlation

Measures of central tendency and variability are basic descriptive statistics that tell us something about the distribution of a variable. Measures of relationships provide information about what relationship the variable has to other variables. The association between one variable and any other variable is described as a correlation.

If two variables have a perfect correlation (their data points fall along a straight line), then r = 1.0 (Fig 1) or r = -1.0 (Fig 2) ("r" is the correlation coefficient). The positive and negative values simply show the direction of the relationship. When two variables are positively correlated, as one increases, the other also increases. When they are negatively correlated, as one increases, the other decreases. Two variables with less than a perfect correlation will have an "r value" between 0 and 1.0 or 0 and -1.0. If no relationship exists between two variables, r = 0. Figure 1 depicts a positive correlation between Variable X and Variable Y. That is, as Variable X increases, Variable Y also increases. Figure 2 depicts a negative correlation between the two variables. That is, as Variable X increases, Variable Y decreases.

Measures of Relationship: Regression

Regression analysis predicts the extent to which the value of one or more variables can be predicted by knowing the value of other variables. A linear regression predicts the magnitude of the expected change in variable Y given a change in variable X. A simple linear regression is designed to determine whether there is a linear relationship between a response variable and a possible predictor variable. A multiple linear regression is designed to examine the relationship between a response variable and several possible predictor variables. Nonlinear regression is designed to describe the relationship between a response variable and one or more explanatory variables in a non-linear fashion.

Key Concepts of Inferential Statistics

Population (or universe): an entire group of persons, things, or events having at least one trait in common

Sample: a smaller number of observations taken from the total number making up the population; in typical applications of inferential statistics, the sample size is small relative to the population size

To make accurate predictions, the sample should be representative of the population. In a sense, a good representative sample provides the researcher with a miniature mirror with which to view the entire population. Recall that you have seen these concepts before in the chapter on surveys.


Correlation: an association between two variables; can be positive or negative

Correlation does not equal Causation.

Correlation Coefficient: a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is a perfect positive linear relationship, r = 1 (i. e., an increase (or decrease) in one variable is associated with an increase (or decrease) in the other variable); if there is a perfect negative linear relationship, r = -1 (i. e., an increase (decrease) in one variable i s associated with a decrease (increase) in the other variable; If r = 0 there is no linear relationship between the variables

Pearsons Product Moment Correlation Coefficient: Pearsons product moment correlation, usually denoted by r, is one example of a correlation coefficient; a measure of the linear association between two variables that have been measured on interval or ratio scales (e. g., the relationship between height in inches and weight in pounds)


Sampling Revisited

Sampling techniques were briefly discussed in the chapter on survey methodology. They are briefly revisited here.

  • Random Sampling

Random sampling demands that each member of the entire population has an equal chance of being included and that no member of the population may be systematically excluded. It is important to note that randomness describes the selection process, (i.e., the procedures by which the sample is selected), and not the particular pattern of observations in the sample.

  • Stratified Sampling

To obtain this kind of sampling, the researcher must know beforehand what some of the major population characteristics are and, then, deliberately select a sample that shares these same characteristics in the same proportions.

Whenever the sample differs systematically from the population of interest, a bias has occurred. Bias is a constant difference, in one direction, between the mean of the sample and the mean of the population. Bias occurs when most of the sampling error loads up on one side, so that the sample means are constantly either over- or under-estimating the population mean.

  • Sampling Distributions

Each distribution discussed so far has been a distribution of individual scores - each point in the distribution represents a measure of a characteristic or performance of an individual. In sampling distributions, each point represents a measure of a characteristic or performance of a sample of individuals. The mean increase of a sample of U.S. adults is an example; it would be one data point in the sampling distribution of mean income. Sampling distributions are important in testing hypotheses.

Part III. Parameter Estimates and Hypothesis-Testing

Criminal suspects are presumed innocent until proven guilty. Under hypothesis-testing procedures, the null hypothesis is presumed to be true until proven false. Once all the evidence has been considered, a verdict is reached, and the null hypothesis is either retained (failure to reject) or it is rejected.

Evidence for testing an hypothesis about a sample statistic is based on the relationship between the observed sample statistic and the sampling distribution of that statistic. For example, if a researcher predicts that the mean weight of rats in an experimental group is greater than the mean weight in a control group, then the statistic at issue is the difference between the two means. The experimental or research hypothesis is that the two means represent different populations and that the difference between them is dependable. The null hypothesis is that the two means come from the same population and that the difference between them would not hold up under repeated replications of the experiment. The difference between the means is compared to the sampling distribution of such differences, the mean of which is usually zero (no difference). If a difference as large as or larger than the obtained difference is very unlikely for groups coming from the same population, then the difference will be judged to be an improbable outcome under the null hypothesis of no dependable difference and the null hypothesis will be rejected. On the other hand, if the observed difference is not so large as to be highly improbable, the null hypothesis will be accepted (or the null hypothesis will not be rejected).

An observed sample statistic will qualify as a probable outcome if the difference between its value and that of the hypothesized population statistic is small enough to be attributed to chance. For example, a sample mean will qualify as a probable outcome if the difference between its value and that of the hypothesized population mean is small enough to be attributed to chance. Under these circumstances, because there is no compelling reason to reject the hypothesis, the null hypothesis is tentatively accepted.

An observed sample statistic will qualify as an improbable outcome if the difference between its value and the hypothesized value is too large to be attributed to chance. That is, a sample mean will qualify as an improbable outcome if it deviates too far from the hypothesized mean and appears to emerge from the sparse concentration of possible sample means in either "tail" of the sampling distribution. Under these circumstances, because there are grounds for suspecting the hypothesis, the hypothesis is rejected.

The decision to reject the null hypothesis involves a degree of risk. Having rejected a null hypothesis, we can never be absolutely certain whether the decision is correct or incorrect, unless, of course, the entire population was surveyed. Even if the null hypothesis is true, there is a slight possibility that just by chance, the one observed sample mean really originates from rejection regions (the tails) of the hypothesized sample distribution, thus causing the true null hypothesis to be erroneously rejected.

Regression: predicts the extent to which the value of one or more variables can be predicted by knowing the value of other variables

Linear Regression: predicts the magnitude of the expected change in variable Y given a change in variable X

Simple Linear Regression: designed to determine whether there is a linear relationship between a response variable and a possible predictor variable

Multiple Linear Regression: designed to examine the relationship between a response variable and several possible predictor variables

Nonlinear Regression: designed to describe the relationship between a response variable and one or more explanatory variables in a non- linear fashion


Bias: a constant difference, in one direction, between the mean of the sample and the mean of the population; occurs when most of the sampling error loads up on one side, so that the sample means are constantly either over- or under-estimating the population mean



Sampling Error --Whenever a sample is selected, it must be assumed that the sample measures will not precisely match those that would be obtained if the entire population were measured. The sampling error reflects, or is an index of, the difference between the sample value and the population value.

Sampling error is not a mistake. Any sample mean should be expected to deviate from the mean of the whole population, but the deviation will hopefully be random and should not be large.

Part IV. Error Rates

In determining the admissibility of expert opinion regarding a particular scientific technique, the court ordinarily should consider known or potential rates of error, and existence and maintenance of standards controlling the technique's operation.(1)

To assess known or potential rates of error, the judiciary must be prepared to carefully and critically evaluate the methodology and underlying assumptions of proffered scientific evidence. Such an evaluation would entail examination of whether the research hypothesis was appropriately articulated and tested, whether appropriate controls were utilized, whether threats to validity were controlled for, or at least severely minimized, and so forth.

The likelihood with which a measurement device or a technological procedure leads to an incorrect classification is the error rate. Whereas formal testing of hypotheses usually relies on theoretical sampling distributions for estimating the likelihood that the decision based on the data is erroneous (especially Type I error), the likelihood of an incorrect classification is usually assessed in terms of error rates. Several rates should be taken into account, typically termed "true positive," "true negative," "false positive," and "false negative" rates. For example, if a laboratory claims that a particular test reliably identifies the existence of a serious disease, it is necessary to consider the proportion of people with the disease who were correctly identified as having it (true positive) and those who were correctly identified as not having it (true negative). It is also important to consider the proportion of individuals without the disease who were incorrectly identified as having it (false positive) and the proportion of individuals with the disease who were incorrectly identified as not having it (false negative). False positives could lead to unnecessary further expense and painful medical interventions; false negatives could lead to further and perhaps fatal progression of the disease. It usually is essential to examine both types of erroneous classification rates; if proffered evidence does not include both error rates, it is likely to be of little value.

Error rates are generally stated as percentages or proportions. In the above case, for example, the data might have been drawn from people who visited their physicians because of certain bothersome symptoms, and when the physicians conducted the diagnostic test, the results for 104 patients might have been:

Actually has
disease
Actually free
of disease
Total

Test says has disease

90
10
100

Test says free of disease

2
2
4

Total

92
12
104

The true positive rate is .98 (90/92), with only two diseased patients mis-diagnosed (2/92, a false negative rate of .02). There were 12 patients without the disease, 10 of whom were mis-diagnosed as having the disease for a false positive rate of .83 (10/12).

This example illustrates two points. First, the rate of correct classifications has to be compared to the rates of both false positive and false negative classifications. The relative importance of the two types of errors will depend on what they lead to-false security, expensive or painful further intervention, and so on. Second, although proportions and percentages are very useful modes of presenting data, sometimes the raw numbers underlying the percentages are equally important. In the example above, only 12 of the 104 patients were actually free of the disease, and that base of 12 is too small to draw firm conclusions about the false positive rate. We would be much more confident if the number of disease-free patients who were tested was larger. In general, if we were told that 50% of people held a certain opinion, we would want to know if the reference was 50% of 2 people or 50% of 2,000.

Hypothesis-Testing in Statistical Terms

The purpose of a hypothesis test is to determine the likelihood that a particular sample could have originated from a population with a hypothesized characteristic.

The null hypothesis supplies the value about which the hypothesized sampling distribution is centered. It always makes a statement about a characteristic of the population, never about a characteristic of the sample.

The null hypothesis always makes the claim about a single numerical value, never a range of values.

The experimental hypothesis , asserts the opposite of the null hypothesis. A decision to accept the null hypothesis (or a failure to reject the null hypothesis) implies a lack of support for the experimental or research hypothesis, and a decision to reject the null hypothesis implies support for the experimental or research hypothesis.

A decision rule specifies precisely when the null hypothesis should be rejected.



Error Rate: the likelihood with which a measurement device or a technological procedure leads to an incorrect classification

True Positive Error: correctly classifying someone as possessing a particular characteristic or falling into a particular category (e. g., person has disease and is classified as having disease)

True Negative Error: correctly classifying someone who does not possess a particular characteristic or who does not fall into a particular category (e. g, person does not have disease and is classified as not having the disease

False Positive Error: incorrectly classifying someone without a particular characteristic as possessing that characteristic (e. g., person does not have disease, but incorrectly classified as having disease)

False Negative Error: incorrectly classifying someone who has a particul ar characteristic as someone who does not possess that characteristic (e. g., person has disease, but is incorrectly identified as not having it)

Type I and Type II Errors

The decision to reject the null hypothesis is based on probabilities rather than on certainties. The decision is made without direct knowledge of the true state of affairs in the population.

There are two possible decisions: (1) reject the null hypothesis, or (2) fail to reject (accept) the null hypothesis. There are also two possibilities that may be true in the population: (1) the null hypothesis is true, or (2) the experimental hypothesis is true. Thus, there are two kinds of correct decisions and two kinds of errors.

Most scientists begin with the assumption that the phenomenon they are studying does not cause the effect they expect -- the null hypothesis. In other words, the standard method of science is to presume 'innocence' and only with strong proof reject that assumption.

Scientific conventions have developed regarding the strength of this presumption; that is, how much evidence is needed before rejecting the null hypothesis and accepting an alternative hypothesis that the experimental manipulation caused the observed effect (this will be discussed further in this chapter). It is important to realize, however, that an attempt to decrease one type of error results in an increased likelihood of making the other type of error.

Type I Error: when the researcher rejects the null hypothesis but the null hypothesis is actually true (e. g., the researcher claims that there is a causal relationship between variable A and variable B when, in fact, there is not)

Type II Error: when the researcher fails to reject the null hypothesis (i. e., accepts the null hypothesis) when in actuality the experimental hypothesis is true (e. g., the researcher claims there is no causal relationship between variable A and variable B when, in fact, there is one)

Researcher's Decision

Reject the Null
Hypothesis
Failure to reject the
null hypothesis


Null Hypothesis is true
Type I Error
Correct Decision
True State of Nature

Null Hypothesis is false
Correct Decision
Type II Error

Consider the decision made by a juror in a criminal trial. As is the case with statistics, a decision must be made on the basis of evidence: Is the defendant innocent or guilty? However, the decision is the juror's and does not necessarily reflect the true state of affairs that the person really is innocent or guilty. Assume the null hypothesis is that the defendant is innocent. The of the null hypothesis is to decide, based upon the evidence, that the defendant is guilty. of the null hypothesis is to decide, based upon the evidence, that the defendant is innocent.

Juror's Decision

Reject the Null
Hypothesis

(find guilty)

Failure to reject the
null hypothesis

(find not guilty)


Null Hypothesis is true
(Defendant is not guilty)
Type I Error
Correct Decision
True State of Nature

Null Hypothesis is false
(Defendant is guilty)
Correct Decision
Type II Error

 

Confidence Level

An alternative indicator of the probability of a Type I error is the confidence level. It specifies the range of values around the empirically obtained result within which the "true" or population value is likely to lie. Confidence levels are frequently reported in sample surveys. For example, it might be reported that the 95% confidence level for an obtained percentage of 20% is 20% plus or minus 3%. The higher the confidence level, the lower the probability of a Type I error, but the broader the range of values within which the "true" or population value might actually lie.

Both types of error are important. For example, in a toxic tort case, a Type I error could mean that the frequency of occurrence of a symptom among workers could be accepted as indicating that the symptom was caused by a toxic substance found in the plant environment, whereas in fact that frequency of occurrence was just at the outer extreme of random fluctuation and was not a reflection of a causal link. The firm might improperly be held accountable. But with a Type II error, the frequency of occurrence of the symptom would be taken as well within the normal range of fluctuation, and no causal link between substance and symptom frequency would be inferred. The firm could be erroneously exonerated. The less likely a Type I error, the more likely a Type II error.


To what extent do judges around the country find the concept of error rate a useful criterion for critically evaluating scientific evidence?
Pie chart of error rate acceptance

All judges in the survey sample, even those not in FRE/Daubert states, were asked how useful they thought the concept of error rate is for admissibility decision- making (N= 400).

The majority (91%) indicated that a consideration of error rate was a useful when determining the admissibility of scientific evidence, with 54% of those judges rating error rate as very useful.

Focusing just on responses from judges in states which follow the FRE/Daubert standards, the vast majority of judges rated error rate as a useful guideline for evaluating the admissibility of scientific evidence.

Even though the vast majority of judges rated error rate as a useful guide, the results of the survey indicate that judges do not fully understand the scientific meaning of error rates and that, as a result, they are unsure how to utilize the concept as a guideline for determining admissibility.

When asked a question about how they would apply the concept of error rate, the majority of judges expressed some hesitancy or uncertainty. In order for a response to be coded as judge understands concept the response had to include reference to an evaluation of the variety of sources of error, or refer to a number or percent of instances in which the classification procedure was mis- classified. From the answers provided, the researchers could only infer a true understanding of the concept in 4% of the responses (N= 400).

In 86% of the responses the judges understanding of the concept was questionable. In 10% of the responses, the judge relied solely upon a low error vs. high error heuristic (or rule of thumb) when explaining how the concept of error rate is applied to admissibility (i. e., if there is a high rate of error then the judge is more likely to exclude the evidence than if there is a low rate of error).

Significance and P-Values

In order to decide whether the difference in the observed score differs significantly from the null hypothesis, a standard or criterion for deciding whether to accept or reject the null hypothesis must be established. Statisticians typically use two levels of significance: .05 and .01. These levels have been established by convention. When a significance level of p<.05 is chosen, the decision rule is that the null hypothesis will be rejected if the data are so unlikely that they could have occurred by chance less than 5 times out of 100. If a significance level of p<.01 is chosen, the probability of the observed value occurring by chance is less than 1 in 100.

The odds of making a Type I error (rejecting the null hypothesis when it is true) are exactly equal to the value chosen for the significance level. That is, if a researcher has chosen a significance value of .05, the probability of a Type I error is .05 -- 5 times out of 100 (5%) the researcher will reject the null hypothesis when it is true. That is, there will be 5 times out of 100 when extreme differences are due to chance and not to some experimental manipulation.

Can the odds of making a Type I error be minimized by choosing a more extreme significance level (e.g., p<.01)? Yes, but there is a trade-off: an increased likelihood of making a Type II error (failure to reject an hypothesis when it is false) -- the researcher concludes that the results were caused by chance and not by the experimental manipulation.

Statistical Significance and Legal Significance

The scientist's concept of statistical error does not translate directly into the judge's concept of legal error. It cannot be said, therefore, that a study that is statistically significant at the .05 level of confidence will lead judges, if they admit the evidence, to make only 5 errors (Type I errors) out of 100. There is no true correspondence between statistical confidence and legal burdens of proof.

Statistical Significance and Importance

The significance of a finding (the probability of a Type I error) does not have a clear relationship to the importance of the finding, either. A small difference, or a small correlation, could still be highly significant statistically, if the sample were large enough. A finding of small magnitude would still be reliably replicated on repeated investigations, if large portions of the population were included in each investigation - but although dependable, the finding might not have practical or theoretical importance.

Limitations of moving from statistical significance to legal significance:

  • a confidence level is a statistical statement and does not incorporate the variety of factors that judges must take into account in making a decision
  • most scientific research examines the general relationship between variables, while trial courts are usually concerned with specific effects on specific individuals


Confidence Level: Specifies a range of values around the emperically obtained result, within which the "true" or population value is likely to lie.

Endnote:

1. Daubert vs. Merrell Dow Pharmaceuticals 509 U.S. 579, 113 S.Ct., 2786 at 508.

Glossary

bias a constant difference, in one direction, between the mean of the sample and the mean of the population; occurs when most of the sampling error loads up on one side, so that the sample means are constantly either over- or under-estimating the population mean

bimodal distribution a distribution of scores with two modal scores (two commonly occurring scores)

confidence level specifies a range of values around the empirically obtained result within which the "true" or population value is likely to lie

correlation an association between two variables; can be positive or negative; correlation does not equal causation

correlation coefficient a number between -1 and 1 which measures the degree to which two variables are linearly related; if there is a perfect positive linear relationship, r = 1 (i.e., an increase in one variable is associated with an increase (or decrease) in the other variable); if there is a perfect negative linear relationship, r = -1 (i.e., an increase (decrease) in one variable is associated with a decrease (increase) in the other variable; if r = 0 there is no linear relationship between the variables

decision rule specifies precisely when the null hypothesis should be rejected

descriptive statistics statistics that summarize, describe, and make understandable the numbers generated in a research study

distribution the arrangement of any set of scores or values in order of magnitude

error rate the likelihood with which a measurement device or a technological procedure leads to an incorrect classification

false negative error incorrectly classifying someone who has a particular characteristic as someone who does not possess that characteristic (e.g., person has disease, but is incorrectly identified as not having it)

false positive error incorrectly classifying someone without a particular characteristic as possessing that characteristic (e.g., person does not have disease, but incorrectly classified as having disease)

frequency distribution a listing, or order of magnitude, of each score and how many times that score occurred

inferential statistics statistics used to draw conclusions and inferences which are based upon, but go beyond, the numbers generated by a research study

interval scale a unit of measurement characterized by equal intervals; measures differences in amount (e.g., I.Q. score)

linear regression predicts the magnitude of the expected change in variable Y given a change in variable X

mean the arithmetic average of all the scores; calculated by adding all the scores together and then dividing by the total number of scores involved

measures of central measures that provide information about the average, or typical, score of a large

tendency number of scores; which single score (mean, median, mode) best represents an entire set of scores

measures of variability procedures used to describe the extent to which scores differ from one another in a distribution; range, standard deviation, and variance statistics

median the exact midpoint of any distribution; much more accurate representation of central tendency than the mean; to calculate the median, the scores must first be arranged in order of magnitude (e.g., from lowest to highest), the middle score is the median

mode a measure of central tendency; the most common single number in the distribution; in a perfectly symmetrical unimodel distribution, the mode is the same as the mean; when it is not the same, the mode is not really a good representative value of the distribution

multiple linear designed to examine the relationship between a response variable and several regression possible predictor variables

negatively skewed distribution in which scores are concentrated near the top of the distribution;

distribution tail of the distribution points to the low or negative end

nominal scale a unit of measurement based on classification; measures differences in kind (e.g., ethnicity)

nonlinear regression designed to describe the relationship between a response variable and one or more explanatory variables in a non-linear fashion

normal curve a theoretical distribution; a unimodal frequency distribution with scores plotted on the X axis (the horizontal axis) and frequency plotted on the Y axis (the vertical axis); most of the scores cluster around the middle of the distribution; curve is symmetrical and all three measures of central tendency (mean, median, mode) fall precisely at the middle of the distribution

ordinal scale unit of measurement characterized by order and classification; measures differences in degree (e.g., attitudes)

pearson's product a measure of the linear association between two variables that have been measured

moment correlation on interval or ration scales (e.g., the relationship between height in inches and coefficient weight in pounds); usually denoted by r, is an example of a correlation coefficient

population an entire group of persons, things, or events having at least one trait in common; the larger group of all people of interest from which the sample is selected

positively skewed distribution in which scores are concentrated near the bottom of the distribution;

distribution tail of the distribution points to the top or positive end

range a measure of variability; the width or spread of an entire distribution; found simply by calculating the difference between the highest and lowest scores

regression predicts the extent to which the value of one or more variables can be predicted by knowing the value of other variables

ratio scale a unit of measurement characterized by a true zero and equal intervals; measures differences in total amount (e.g., income)

sample a smaller number of observations taken from the total number making up the population; in typical applications of inferential statistics, the sample size is small relative to the population size

simple linear regression designed to determine whether there is a linear relationship between a response variable and a possible predictor variable

skewed distribution a distribution of scores where the majority of scores in the distribution bunch up at one end of the distribution

standard deviation a measure of variability; a measure of the variability that indicates by how much all of the scores in the distribution typically deviate or vary from the mean

standard normal curve the normal curve is marked off in units of standard deviation; a normally distributed set of scaled scores whose mean is always equal to zero and whose standard deviation equals 1.00

true positive error correctly classifying someone as possessing a particular characteristic or falling into a particular category (e.g., person has disease and is classified as having disease)

true negative error correctly classifying someone who does not possess a particular characteristic or who does not fall into a particular category (e.g, person does not have disease, and is classified as not having the disease)

type I error when the researcher rejects the null hypothesis when the null hypothesis is true

type II error when the researcher fails to reject the null hypothesis when the null hypothesis is false

unimodal distribution a distribution of scores with a single modal score

variance measures how much of the variance between people on one characteristic can be explained by where they stand on another characteristic

Common Problems with the Use of Statistical Evidence in Court

Statistics in court are not presented in their natural form

Statistics presented in court are rarely presented in a single, complete presentation (i. e., one side presents statistical evidence that is challenged on cross examination and then, at a later point in the trial, the other side proffers opposing statistical conclusions). In court, statistics are often presented in graphic form and there is rarely a detailed discussion of the statistical techniques and models used, their assumptions and shortcomings.

Improper inferences are drawn

When statistics are presented in court, improper inferences are often drawn about what the data mean and what conclusions can be drawn. This problem typically occurs in three ways: (1) by extrapolating results of a statistical analysis to a population that is different from the population defined in the study; (2) by inferring, within the correct population, something beyond what is statistically correct given the available data and analysis; and (3) by misinterpreting statistical significance and the burden of proof.

Improper methodologies used

Methodological problems that undermine the scientific validity or relevance of statistical results occur at many stages of the research: study design, data collection, and data analysis. Iancu, C. A., and Steitz, P. W. (1997). Guide to Statistics,



Questions to consider when evaluating scientific evidence...
  • Were the appropriate statistical tests conducted?
  • Were proper statistical inferences drawn from the data?
  • Were there methodological problems that undermined the scientific validity, reliability, or relevancy of the statistical results?

Suggested Readings:

Barnes, D.W. (1983). Statistics as Proof: Fundamentals of Quantitative Evidence. Boston: Little, Brown and Company.

DeVore, J. and Peck, R. (1997). Statistics: The Exploration and Analysis of Data, 3rd Edition. San Francisco, CA: Duxbury.

Hagg, R.V. and Craig, A. T. (1995). Introduction to Statistics, 5th Edition. Englewood Cliffs, N.J.: Prentince Hall.

Iancu, C.A., and Steitz, P.W. (1997). "Guide to Statistics." In Expert Evidence: A Practitioner's Guide to Law, Science, and the FJC Manual. Washington, D.C.: Government Printing Office. pgs. 298-310.

Kaye, D.H., and Freedman, D.A. (1994). "Reference Guide to Statistics." In the Federal Judicial Center's

Reference Manual on Scientific Evidence. St. Paul, Minnesota: West., pgs. 331-414.

Saville, D.J. and Wood, G.R. (1996). Statistical Methods: A Primer. New York: Springer

--- CHAPTER 8 --- FRONT PAGE --- CHAPTER 10 ---