“Bag of words approach”

Treatment of search and index terms as independent, atomic units; Each document may be preprocessed using a stop list and a stemming algorithm, and represented as a bag of processed terms. As opposed to systems based on, for example, thesauri or ontologies.

 

It is a "knowledge-poor" approach and thus stuck with the mechanical components of text, and it takes ontologies and thesauri, and deep understanding of text structure, to advance into semantic, syntactic or discourse-level understanding. Eduard Hovy pointed out a program will scarcely understand "that the sequence enter + order + wait + eat + pay + leave can be summarized as restaurant visit."

 

 

 

Literature:

 

Hovy, E., & Lin, C. -Y. (1999). Automated text summarization in Summarist. In I. Mani and M. Maybury (Eds.), Advances in automatic text summarization (pp. 81-94). Cambridge, MA: MIT Press.

 

Scheffer, T. & Wrobel, S. (2002). Text Classification Beyond the Bag-of-Words Representation. In: Proc. of ICML’02 Workshop on Text Learning. http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/26772/http:zSzzSzwww-ai.ijs.sizSzDunjaMladeniczSzTextML02zSzpaperszSzScheffer.pdf/text-classification-beyond-the.pdf

 

Sparck Jones, K. (2005). Revisiting classification for retrieval. Journal of Documentation, 61(5), 598-601. http://www.db.dk/bh/Core%20Concepts%20in%20LIS/Sparck%20Jones_reply%20to%20Hjorland%20&%20Nissen.pdf

 

 

 

Birger Hjørland

Last edited: 31-01-2007

Home