“Bag of words approach”
Treatment of search and index terms as independent, atomic units; Each document may be preprocessed using a stop list and a stemming algorithm, and represented as a bag of processed terms.
As opposed to systems based on, for example, thesauri or ontologies.
It is a "knowledge-poor" approach and thus stuck with the mechanical components of text, and it takes ontologies and thesauri, and deep understanding of text structure, to advance into semantic, syntactic or discourse-level understanding. Eduard Hovy pointed out a program will scarcely understand "that the sequence enter + order + wait + eat + pay + leave can be summarized as restaurant visit."
Literature:
Hovy, E., & Lin, C. -Y. (1999). Automated text summarization in Summarist. In I. Mani and M. Maybury (Eds.), Advances in automatic text summarization (pp. 81-94). Cambridge, MA: MIT Press.
Scheffer, T. & Wrobel, S. (2002). Text Classification Beyond the Bag-of-Words Representation. In: Proc. of ICML’02 Workshop on Text Learning. http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/26772/http:zSzzSzwww-ai.ijs.sizSzDunjaMladeniczSzTextML02zSzpaperszSzScheffer.pdf/text-classification-beyond-the.pdf
Sparck Jones, K. (2005). Revisiting classification for retrieval. Journal of Documentation, 61(5), 598-601. http://www.db.dk/bh/Core%20Concepts%20in%20LIS/Sparck%20Jones_reply%20to%20Hjorland%20&%20Nissen.pdf
Birger Hjørland
Last edited: 31-01-2007