Shiken: JALT Testing & Evaluation SIG NewsletterVol. 9 No. 2. Oct. 2005. (p. 29 - 30 [ISSN 1881-5537]
Statistical Analyses for Language Assessment
by Lyle F. Bachman (2004)
Mahwah, NJ: Cambridge University Press (364 + xiv pages)
This three-part work highlights key concepts and methods used in statistics for the benefit
of language teachers and researchers interested in test development. This work represents something of a hybrid: some
of it deals with the technical details needed to churn out statistical data, but it is also deeply concerned with the principles behind hard figures.
Part One lays the groundwork by covering elementary statistical concepts. Instead of going through lots of jargon,
the author shows how a few dozen terms are rooted in a limited number of fundamental concepts. One of the first key
concepts introduced is test usefulness, which presupposes reliability, validity, authenticity, interactiveness,
impact, and practicality. Though many tests attend to some of these features, Bachman emphasizes tests should seek
systematically address all of those factors. Variables limiting effective measurement such as underspecification,
subjectivity, and relativeness are mentioned along with ways to limit their deleterious influence.
Next Bachman addresses the issue of how to interpret test scores. After describing the qualities of univariate scores
(supposedly measuring a single variable), ways to describe scores measuring two or more things are considered.
Basic types of score distributions and measures of centrality and variance are clarified, along with correlation
measures widely used in classical testing theory. The advice about overcoming common correlation measurement
errors on pages 93-94 is especially helpful.
Part Two delves into ways to analyze and improve tests. The first chapter in that section explains what sort of
correlation studies to use to ascertain the item discrimination of a test – in other words, how well it
distinguishes between different groups of test takers. Though classroom teachers can safely pass over much of
the information about norm-referenced tests in this chapter, the information about how to analyze items for
criteria-referenced tests is likely to be helpful.
Attention then turns to issues of estimates of reliability. After explaining how reliability estimates are performed for NRTs,
ways of estimating this for CRTs are accounted. One of the key concepts outlined is measurement error,
which can be loosely conceived of as the antithesis of reliability.
The way that measurement error relates to other common reliability estimates is also elucidated.
Bachman covers not only the most common reliability statistics from classical test theory,
but also mentions how G-theory, and IRT can contribute to our understanding of reliability.
Addressing the quandary of how to tell whether differences in test performance are 'real' or due to random chance,
the first chapter in the final section of this work discusses the role of hypothetical inferences. A key concept
is the distinction between theoretical, operational, and research hypotheses, all of which
illuminate different layers of analysis. After clarifying the distinction between these various types of hypotheses,
the author turns his attention to data collection designs and assumptions that can be made about various sampling distributions.
"[A] nice feature is the way it focuses on the rationale and appropriate use of most formulas, rather than merely listing the formulas themselves."
The next chapter considers ways to check the statistical significance of observed differences among variables.
Fairly detailed discussions about the standard error of the mean, confidence intervals, and t-ratios are offered.
Classroom teachers will find the discussion of how to use t-tests particularly helpful. This chapter concludes
by discussing ways to investigate differences through ANOVA procedures
The book then explores test validation. The importance of linking theoretical aspects of validation
with practical procedures is well underscored. Adopting legal metaphors, the author mentions how, "validation
can be seen as the process of building a case – articulating an argument and collecting evidence –
in support of a particular interpretation of test scores." (p. 262) Concrete ways to link test scores to
generalized 'universe scores' and extrapolated 'target scores' are discussed.
The book concludes by discussing how to report and interpret test results. Pointing out the need to report
test scores in ways that are meaningful to test users, Bachman highlights some common problems test takers have
in interpreting raw scores. The question of what to do if a test taker receives a high score on one part of a test,
but a low score on another part is discussed at length. Not surprisingly, the author asks us to reflect on broader test purposes.
This leads naturally into the question of weighting (how many points to assign to a particular score).
Weighting can be nominally assigned by test developers, or on the basis of statistical procedures that estimate the
self-weight of each component and how much it should contribute to an overall test score.
There are a number of small things to quibble about with this book. For example, the list of abbreviations on page
xiv seems too incomplete. And at times this work attempts to cover more terrain than it can adequately explain. For example,
the discussions about effect size and non-parametric tests of significance on page 254 are so
cursory that they are unfulfilling. Despite these limitations, Bachman basically succeeds in accomplishing an
ambitious task. This volume shows how to apply and interpret a wide range of statistical theory. One of the best
features of this work is its accompanying CD-ROM and workbook. These allow readers to practice using many of
the statistical procedures outlined in the book. Another nice feature is the way it focuses on the rationale
and appropriate use of most formulas, rather than merely listing the formulas themselves. In short, this book
is not always an easy read – but it is a useful one for those seeking a better grasp of the language testing literature.
- Reviewed by Tim Newfields