Test Collection (Information retrieval)
A test collection is a
set of documents and a matching set of queries and relevance judgments. It is
thus determined which documents are relevant to which queries, why different
algorithms and retrieval strategies can be tested using
recall and precision as measures. Among
the best known test collections are those used by
Cranfield and TREC.
Shaw; Burgin & Howell (1997a) found that typical levels of
operational cluster-based retrieval can be explained on the basis of
chance. Indeed, most operational results in retrieval test collections are lower
than those predicted by random graph theory. A tentative explanation for the
poor performance of cluster-based retrieval reveals weaknesses in both
fundamental assumptions and operational implementations. The cluster hypothesis
offers no guarantee that relevant documents are naturally grouped together,
clustering algorithms may not reveal the inherent structure in a set of
documents, and retrieval strategies do not reliably retrieve the most effective
cluster or clusters of documents. That most cluster-based retrieval
implementations implicitly rely on topical relatedness to be equivalent to a
relevance relationship contributes to the poor performance.
Shaw; Burgin & Howell (1997b) compared operational levels
of performance for vector-space, ad-hoc-feature-based, probabilistic, and other
retrieval models. The effectiveness of these techniques in small, traditional
test collections can be explained by retrieving a few more relevant documents
for most queries than expected by chance, and the effectiveness of retrieval
techniques in the large TREC test collections can only be explained by
retrieving many more relevant documents for most queries than expected by
chance. The discrepancy between deviations from chance in traditional and TREC
text collections is due to a decrease in performance standards for large test
collections, not to an increase in operational performance. Retrieving a few
more relevant documents than expected by chance leads to mediocre levels of
performance; recall and precision are rarely greater than
0.50 for any retrieval strategy in any test collection. However, marginal
improvements to expectations based on chance may be sufficient to initiate
successful interactions between an end-user and the next generation of retrieval
systems, in which relevance judgments will be automatically translated into
progressively improving estimates of the capacity of terms and other features to
discriminate between relevant and non-relevant documents. Realization of such
systems would be enhanced by abandoning uninformative performance summaries and
focusing on effectiveness and improvements in effectiveness of individual
queries.
Lin & Katz (2006) contrasts traditional information retrieval systems, which return ranked lists of documents that users must manually browse through with question answering systems, which attempt to directly answer natural language questions posed by the user. Although such systems possess language-processing capabilities, they still rely on traditional document retrieval techniques to generate an initial candidate set of documents. The authors argue that document retrieval for question answering represents a task different from retrieving documents in response to more general retrospective information needs. Thus, to guide future system development, specialized question answering test collections must be constructed. They show that the current evaluation resources have major shortcomings; to remedy the situation, they have manually created a small, reusable question answering test collection for research purposes. They describe their methodology for building this test collection and discuss issues they encountered regarding the notion of "answer correctness."
Literature:
Bailey, P.; Craswell, N. & Hawking, D. (2003). Engineering
a multi-purpose test collection for Web retrieval experiments. Information
Processing & Management, 39(6), 853-871.
Hersh, W. R; Hickam, D. H.; Haynes, R. B. & McKibbon, K. A. (1994). A
Performance and failure analysis of SAPHIRE with a medical test collection.
Journal of the American Medical Informatics Association, 1(1), 51-60.
Lin, J. & Katz, B. (2006). Building a reusable test collection for question
answering. Journal of the American Society for Information Science and
Technology, 57(7), 851-861.
Markkula, M.; Tico, M.; Sepponen, B.; Nirkkonen, K. & Sormunen, E. (2001). A
test collection for the evaluation of content-based image retrieval algorithms -
A user and task-based approach. Information Retrieval, 4(3-4), 275-293.
Robertson, S. E.; Walker, S. & Hancock-Beaulieu, M. M. (1995). Large test
collection experiments on an operational, interactive system: OKAPI at TREC.
Information Processing & Management, 31(3), 345-360.
Shaw, W. M.; Burgin, R. & Howell, P. (1997a). Performance standards and
evaluations in IR test collections: Cluster-based retrieval models.
Information Processing & Management, 33(1), 1-14.
Shaw, W. M.; Burgin, R. & Howell, P. (1997b). Performance standards and
evaluations in IR test collections: Vector-space and other retrieval models.
Information Processing & Management, 33(1), 15-36. (Review).
Soboroff,I. (2002). Test collections. http://www.csee.umbc.edu/~ian/irF02/lectures/10Test-Collections.pdf
Sparck-Jones, K. (1975). Performance yardstick for test collections. Journal of Documentation, 31(4), 266-272.
Sparck-Jones, K. & van Rijsbergen, C. J. (1976). Information retrieval test collections. Journal of Documentation, 32(1), 59-75.
Weeber, M.; Mork, J. G. & Aronson, A. R. (2001). Developing
a test collection for biomedical word sense disambiguation. Journal of the
American Medical Informatics Association, S , 746-750.
See also: Experiment in Information Science
Birger Hjørland
Last edited: 17-09-2006