Detail of contribution
Auteur: Marcos ZAMPIERI
Co-Auteur(s): Sascha DIWERSY, University of Cologne, Germany Binyam GEBRE, University of Cologne, Germany
Titre:
Automatic Classification of Language Varieties: What does it tell us about grammar?
Abstract/Résumé: The automatic identification of language varieties is an emerging field of research in which computational methods are applied in order to classify pluricentric languages automatically. Recent studies include Zampieri/Gebre (2012) for Brazilian and European Portuguese and Trieschnigg et al. (2012) for Dutch. More than the classification itself, these methods can provide a quantitative overview on the differences between languages and varieties. For the moment, they use, however, knowledge-poor features such as character and word n-grams that do not allow researchers to look in more detail for convergences and divergences between these varieties on a more abstract level of grammar. In this study we aim to go one step further and use knowledge-rich features to perform classification with respect to different national varieties of three Romance languages: French, Portuguese and Spanish. Based on automatically annotated samples of journalistic texts, we carried out classification tasks involving sequences of morphosyntactical features. The results of our first experiments should be explored in two main aspects. On the one hand, the classification models using morphosyntactical n-grams are generally outperformed in accuracy by knowledge-poor feature models such as word n-grams. Nevertheless, newspaper corpora are to some extent thematically biased, comprising many occurrences of named entity items referring to regionally specific places, personalities and institutions. On the other hand, classification models based on sequences of morphosyntactical features still yield satisfactory results with respect to the identification of texts representing different national varieties. For instance, most of our classification experiments taking into account sequences of 3 or more morphosyntactical features have provided accuracy scores outperforming the results obtained on the classification of texts exclusively taken from subcorpora representing one variety. This seems to suggest that the analysis of the highest ranked discriminating morphosyntactical n-grams in classification experiments can provide useful information on grammatical divergences of national varieties and constitute an important resource for empirical exploration. REFERENCES Trieschnigg, D.; Hiemstra, D.; Theune, M.; de Jong, F. and Meder, T. (2012) An exploration of language identification techniques for the Dutch Folktale Database. Proceedings of LREC2012. Istanbul, Turkey. Zampieri, M. and Gebre, B. (2012) Automatic Identification of Language Varieties: The Case of Portuguese. Proceedings of KONVENS2012. Vienna, Austria.
Co-Auteur(s): Sascha DIWERSY, University of Cologne, Germany Binyam GEBRE, University of Cologne, Germany
Titre:
Automatic Classification of Language Varieties: What does it tell us about grammar?
Abstract/Résumé: The automatic identification of language varieties is an emerging field of research in which computational methods are applied in order to classify pluricentric languages automatically. Recent studies include Zampieri/Gebre (2012) for Brazilian and European Portuguese and Trieschnigg et al. (2012) for Dutch. More than the classification itself, these methods can provide a quantitative overview on the differences between languages and varieties. For the moment, they use, however, knowledge-poor features such as character and word n-grams that do not allow researchers to look in more detail for convergences and divergences between these varieties on a more abstract level of grammar. In this study we aim to go one step further and use knowledge-rich features to perform classification with respect to different national varieties of three Romance languages: French, Portuguese and Spanish. Based on automatically annotated samples of journalistic texts, we carried out classification tasks involving sequences of morphosyntactical features. The results of our first experiments should be explored in two main aspects. On the one hand, the classification models using morphosyntactical n-grams are generally outperformed in accuracy by knowledge-poor feature models such as word n-grams. Nevertheless, newspaper corpora are to some extent thematically biased, comprising many occurrences of named entity items referring to regionally specific places, personalities and institutions. On the other hand, classification models based on sequences of morphosyntactical features still yield satisfactory results with respect to the identification of texts representing different national varieties. For instance, most of our classification experiments taking into account sequences of 3 or more morphosyntactical features have provided accuracy scores outperforming the results obtained on the classification of texts exclusively taken from subcorpora representing one variety. This seems to suggest that the analysis of the highest ranked discriminating morphosyntactical n-grams in classification experiments can provide useful information on grammatical divergences of national varieties and constitute an important resource for empirical exploration. REFERENCES Trieschnigg, D.; Hiemstra, D.; Theune, M.; de Jong, F. and Meder, T. (2012) An exploration of language identification techniques for the Dutch Folktale Database. Proceedings of LREC2012. Istanbul, Turkey. Zampieri, M. and Gebre, B. (2012) Automatic Identification of Language Varieties: The Case of Portuguese. Proceedings of KONVENS2012. Vienna, Austria.