Retour vers liste

Détail de la contribution

Auteur: Michael PIOTROWSKI

Towards Concept-Based Information Retrieval for Historical Texts

Abstract/Résumé: Standard information retrieval (IR) methods as used by Web search engines assume that queries and documents use standardized orthography and that they agree in their verbalization of concepts. Historical text, however is characterized by a high degree of spelling variation as well as lexical and semantic differences to modern language. Consequently, the effectiveness of standard IR methods designed for modern-language text drops significantly when applied to historical texts. There have been numerous attempts to mitigate this problem by normalizing the spelling of historical words, so that they can then be matched to modern words. Spelling canonicalization can help to improve retrieval results if the language of the historical texts is relatively homogeneous and close to modern language, i.e., if spelling differences and spelling variation are indeed the main problem. For older texts and for diachronic collections, spelling canonicalization is not sufficient, as the vocabulary and the meaning of individual words, even of cognates, may differ widely from modern language. We propose to tackle the problem by using concept-based IR. In concept-based IR, documents and queries are represented using semantic concepts, instead of (or in addition to) keywords, and retrieval is then performed by matching in the concept space. The basic idea is that the use of high-level concepts to represent documents and queries eliminates the requirement that the same words be used in queries and documents: A concept-based IR system can therefore find relevant documents even when the words used in the query do not occur in the documents, i.e., when the verbalization of a concept differs between the query and the target documents. Concept-based IR systems require a thesaurus for mapping words to concepts. This requirement has long been the main bottleneck in the construction of concept-based IR systems. More recently, however, Wikipedia has been discovered as a resource that can be used for to automatically create a concept thesaurus, eliminating the need for manual creation of thesauri. There is no Wikipedia for historical languages, though. We therefore propose to exploit the information contained in critical editions of historical texts (summaries, apparatuses, indices, etc.) in a similar way to automatically construct thesauri for historical languages. The idea is thus to solve the problems posed by spelling variation and lexical and semantic changes by using concept-based retrieval, and the need for historical-language thesauri by automatically extracting them from critical editions.