Beth Dissertation Award

Conference on Formal Grammar


Weather Forecast |  Weather Maps |  Weather Radar

Language and computation

Week 1 Time: 11:00 - 12:30 Aula A

Building large corpora from the web

Roland Schäfer, Felix Bildhauer


Content of the course:

The world wide web most likely constitutes the hugest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. This has several advantages: (i) It obviates the problems encountered when using internet search engines in quantitative linguistic research, such as non-transparent ranking algorithms. (ii) Creating a corpus from web data is free. (iii) The size of corpora compiled from the WWW may exceed by several magnitudes the size of (usually expensive) language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. We will address a number of theoretical and practical issues in the steps of creating a web corpus up to giga token size, namely:

The course aims at enabling participants to build large-scale ad-hoc web corpora on their own beyond the level of using one-click tools (Baroni et al. 2004). Equally important, as a result of the course participants should become aware of theoretical implications imposed by the techniques they use in the creation of the corpus.

The subject of the course is highly relevant to ESSLLI. Over the last decades, many theoretical linguists have become increasingly interested in backing up their theories with substantial amounts of empirical data. Alongside experimental methods, corpus linguistics is one of the two major cornerstones of this empirical turn. Large corpora are being used in traditional areas of linguistic research like (gradual) productivity and grammaticalization, but they also provide relevant evidence in psycholinguistic research and have even led to diverse and entirely new data-driven approaches in syntax and semantics. Huge corpora are required for such research, as empirical approaches usually imply statistical methods. Since researchers should not (and sometimes cannot) rely entirely on available corpora of the required size, knowledge of how such resources are built is an essential skill for any one working on the interface between theory and empiricism.

Tentative outline

Session 1: Search engines, crawlers, types and formats of the retrieved documents

Session 2: HTML stripping, conversion of character encodings, paragraph detection

Session 3: Boilerplate recognition, language classification and filtering

Session 4: Duplicate and near-duplicate filtering

Session 5: Corpus evaluation, limitations of the resulting corpora

References (selected)

M. Baroni and S. Bernardini. Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004, pages 1313-16, 2004.

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta. The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209-26, 2009.

A. Kilgarriff and G. Grefenstette. Introduction to the special issue on the web as corpus. Computational Linguistics, 29:333-47, 2003.

C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999.

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. CUP, Cambridge, 2008.


As this is a foundational course, participants are not expected to have any specific technical skills such as programming experience. However, knowledge of basic Unix/GNU command line tools, pipes, and some text editor capable of handling huge files (like VIM, Emacs or Notepad++) is an advantage. They should also bring their own laptop if possible, as the course will be in part practical.
Compared to the standard textbook closest to the subject of the course (Manning et al. 2008), participants are not expected to have as high mathematics skills. Even near-duplicate detection in the form of shingling will be explained in a way that makes it accessible to undergraduate students of general or theoretical linguistics.

Minister of Science and Higher Education - Professor Barbara Kudrycka Marshall of Opole Province (Voivodeship) - Mr Józef Sebesta Rector of Opole University - Professor Krystyna Czaja Mayor of Opole - Mr Ryszard Zembaczyński RADIO OPOLE TVP Opole Polish Association for Logic and Philosophy of Science The INFTY Research Networking Programme The Association for Symbolic Logic European Network for Social Intelligence ZAK S.A. Cement Plant ODRA

Springer Princeton University Press Cambridge Scholars Publishers Oxford University Press Birkhauser