Test Collection (Information retrieval)

A test collection is a set of documents and a matching set of queries and relevance judgments. It is thus determined which documents are relevant to which queries, why different algorithms and retrieval strategies can be tested using recall and precision as measures.  Among the best known test collections are those used by Cranfield and TREC.
 

Shaw; Burgin & Howell (1997a) found that typical levels of operational cluster-based retrieval can be explained on the basis of
chance. Indeed, most operational results in retrieval test collections are lower than those predicted by random graph theory. A tentative explanation for the poor performance of cluster-based retrieval reveals weaknesses in both fundamental assumptions and operational implementations. The cluster hypothesis offers no guarantee that relevant documents are naturally grouped together, clustering algorithms may not reveal the inherent structure in a set of documents, and retrieval strategies do not reliably retrieve the most effective cluster or clusters of documents. That most cluster-based retrieval implementations implicitly rely on topical relatedness to be equivalent to a relevance relationship contributes to the poor performance.
 

Shaw; Burgin & Howell (1997b) compared operational levels of performance for vector-space, ad-hoc-feature-based, probabilistic, and other retrieval models. The effectiveness of these techniques in small, traditional test collections can be explained by retrieving a few more relevant documents for most queries than expected by chance, and the effectiveness of retrieval techniques in the large TREC test collections can only be explained by retrieving many more relevant documents for most queries than expected by chance. The discrepancy between deviations from chance in traditional and TREC text collections is due to a decrease in performance standards for large test collections, not to an increase in operational performance. Retrieving a few more relevant documents than expected by chance leads to mediocre levels of performance; recall and precision are rarely greater than
0.50 for any retrieval strategy in any test collection. However, marginal improvements to expectations based on chance may be sufficient to initiate successful interactions between an end-user and the next generation of retrieval systems, in which relevance judgments will be automatically translated into progressively improving estimates of the capacity of terms and other features to discriminate between relevant and non-relevant documents. Realization of such systems would be enhanced by abandoning uninformative performance summaries and focusing on effectiveness and improvements in effectiveness of individual queries.

 

 

Lin & Katz (2006) contrasts traditional information retrieval systems, which return ranked lists of documents that users must manually browse through with question answering systems, which attempt to directly answer natural language questions posed by the user. Although such systems possess language-processing capabilities, they still rely on traditional document retrieval techniques to generate an initial candidate set of documents. The authors argue that document retrieval for question answering represents a task different from retrieving documents in response to more general retrospective information needs. Thus, to guide future system development, specialized question answering test collections must be constructed. They show that the current evaluation resources have major shortcomings; to remedy the situation, they have manually created a small, reusable question answering test collection for research purposes. They describe their methodology for building this test collection and discuss issues they encountered regarding the notion of "answer correctness."

 

 

 

 

 

Literature:

 

Bailey, P.; Craswell, N. & Hawking, D. (2003). Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing & Management, 39(6), 853-871. 

Hersh, W. R; Hickam, D. H.; Haynes, R. B. & McKibbon, K. A. (1994). A Performance and failure analysis of SAPHIRE with a medical test collection. Journal of the American Medical Informatics Association, 1(1), 51-60. 

Lin, J. & Katz, B. (2006). Building a reusable test collection for question answering. Journal of the American Society for Information Science and Technology, 57(7), 851-861.

Markkula, M.; Tico, M.; Sepponen, B.; Nirkkonen, K. & Sormunen, E. (2001). A test collection for the evaluation of content-based image retrieval algorithms - A user and task-based approach. Information Retrieval, 4(3-4), 275-293. 

Robertson, S. E.; Walker, S. & Hancock-Beaulieu, M. M. (1995). Large test collection experiments on an operational, interactive system: OKAPI at TREC. Information Processing & Management, 31(3), 345-360.
 
Shaw, W. M.; Burgin, R. & Howell, P. (1997a).  Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing & Management, 33(1), 1-14. 

Shaw, W. M.; Burgin, R. & Howell, P. (1997b).  Performance standards and evaluations in IR test collections: Vector-space and other retrieval models. Information Processing & Management, 33(1), 15-36.  (Review).
 

Soboroff,I. (2002). Test collections. http://www.csee.umbc.edu/~ian/irF02/lectures/10Test-Collections.pdf

 

Sparck-Jones, K. (1975). Performance yardstick for test collections. Journal of Documentation, 31(4), 266-272.

 

Sparck-Jones, K. &  van Rijsbergen, C. J. (1976). Information retrieval test collections. Journal of Documentation, 32(1), 59-75.

 

Weeber, M.; Mork, J. G. & Aronson, A. R. (2001). Developing a test collection for biomedical word sense disambiguation. Journal of the American Medical Informatics Association, S , 746-750. 
 

 

 

 

 

See also: Experiment in Information Science

 

 

Birger Hjørland

Last edited: 17-09-2006

Home