Talks & posters

Click on a title to read the abstract. See the complete programme.


Induced lexical categories enhance cross-situational learning of word meanings
Afra Alishahi (Tilburg University) and Grzegorz Chrupała (Tilburg University)

Room 4, 10.30-10.50

In this paper we bring together two sources of information that have been proposed as clues used by children acquiring word meanings. One mechanism is cross-situational learning which exploits co-occurrences between words and their referents in perceptual context accompanying utterances. The other mechanism is distributional semantics where meanings are based on word-word co-occurrences.

We propose an integrated incremental model which learns lexical categories from linguistic input as well as word meanings from simulated cross-situational data. The co-occurrence statistics between the learned categories and the perceptual context enable the cross-situational word learning mechanism to form generalizations across word forms.

Through a number of experiments we show that our automatically and incrementally induced categories significantly improve the performance of the word learning model, and are closely comparable to a set of gold-standard, manually-annotated part of speech tags. We perform further analyses to examine the impact of various factors, such as word frequency and class granularity, on the performance of the hybrid model of word and category learning.

Furthermore, we simulate guessing the most probable semantic features for a novel word from its sentential context in the absence of perceptual cues, an ability which is beyond the reach of a pure cross-situational learner.


Rule learning in humans
Raquel G. Alhama, Remko Scha & Willem Zuidema (Institute of Logic, Language and Computation; Universiteit van Amsterdam)

Poster session, 12.15-13.20

Peña et al. (2002) studied people's tendency to induce "grammar rules" from a continuous sequence of syllable-triplets that consistently display non-adjacent dependencies. After 100 exposures to a sequence of triplets sampled from an artificial language, human adults show no preference for recognizing either unattested triplets that fit the grammar (‘rule-words’), or attested triplets that cross word-boundaries (‘part-words’). After 300 exposures, however, subjects prefer part-words (time effect). These results are usually interpreted as evidence for two different processes: a statistical mechanism that tracks transitional probabilities, and a rule mechanism for structure detection.

We investigate through computational modelling whether the presented empirical results really rule out a single mechanism account. We propose a probabilistic model that computes subjective counts of the subsequences found in the familiarization stream, and we quantify the subject's willingness to generalize using the Simple Good-Turing formula.

Figure 1 shows that for a range of parameter settings the willingness to generalize decreases with the number of exposures. Figure 2 illustrates that the gap between subjective frequencies of part-words and rule-words initially increases, before decreasing again.


Figure 1. Willingness to generalize.


Figure 2: Words and partwords in memory.

The interplay between these two effects yields the observed time effect. Thus, our model suggests that claims about multiple mechanisms underlying artificial language learning are premature.

References

Peña, M., Bonatti, L. L., Nespor, M., & Mehler, J. (2002). Signal-driven computations in speech processing. Science, 298(5593), 604-607.


Using Syntactic Roles in Hierarchical Phrase-Based SMT
Sophie Arnoult (ILLC) and Khalil Simaan (UvA)

Room 3, 14.20-14.40

Hierarchical Phrase-Based (HPB) models are appealing for Statistical Machine Translation as, being Synchronous CFGs, they can readily model a bitext; that in contrast with Phrase-Based models, which separately model phrase translation and reordering. HPB models are however highly ambiguous, which subjects them to constraints that limit their expressiveness; for instance, one limits the length their rules may span, which negatively affects their reordering scope. Much research on HPB models is thus concerned with disambiguation and/or improving reordering, often through the use of linguistic information. This information generally consists of syntactic parses or predicate-argument structures, and as it is monolingual, it must be adapted for use in a synchronous model. This is a difficult task, that requires balancing linguistic constraints with model expressiveness and complexity.

We will present ongoing work on the use of syntactic roles for HPB. Restricting linguistic information to heads, adjuncts and complements may guide a HPB model, with little added complexity compared with syntactic labelling. Besides, syntactic roles may be informative at the clause level as they reflect predicate-argument structure. We will present heuristics to identify adjuncts and complements given dependency parses, discuss ideas to integrate syntactic roles in a HPB model, and possibly report on experiments, depending on progress.


Dat we ons daar nog mee kunnen bezig houden. Looking for cluster creepers in Dutch treebanks
Liesbeth Augustinus (CCL, KU Leuven) and Frank Van Eynde (CCL, KU Leuven)

Room 5, 14.20-14.40

In Dutch V-final clauses the verbs tend to form a cluster which cannot be split up by nonverbal material, as in (1).

(1) ... dat hij het haar gisteren had verteld.
*... dat hij het haar had gisteren verteld.
"... that he had told her that yesterday."

However, the Algemene Nederlandse Spraakkunst (1997), as well as other studies on the phenomenon list several cases in which the verb cluster may be interrupted by so-called cluster creepers. The most common examples are constructions with separable verb particles, but examples with nouns, adjectives, and adverbs are attested as well, cf. (2).

(2) ... dat we ons daar nog mee kunnen bezig houden.
"... that we can still keep ourselves busy with that."

Since most of the data in previous studies are collected by introspection and elicitation, it is interesting to compare those findings to corpus data. The corpus analysis is based on data from Dutch treebanks (CGN, LASSY, SoNaR), which allow to take into account regional and/or stylistic variation. This is an important aspect for the analysis, since cluster creeping is reported as a typical aspect of spoken and regional variants of Dutch.

The goal of this corpus-based investigation is on the one hand to provide insight in the frequency of the phenomenon, and on the other hand to classify the types of cluster creepers. Besides the linguistic analysis, methodological issues regarding the extraction of the relevant data from the treebanks will be addressed as well.


Generating from Meaning Representations: Lexical Choice and Inflection Realisation
Valerio Basile (University of Groningen) and Johan Bos (University of Groningen)

Room 3, 10.30-10.50

Natural Language Generation is the area of Computational Linguistics that investigates how to produce natural language utterances to express information encoded in abstract, formal formats.

An NLG system has to deal with the problem of “finding the right words” to express the concepts or data given as input; in the case of a robust, open-domain NLG component, where concepts and events are represented as WordNet synsets (Fellbaum 1998), the lexicalization task is not trivial. The problem of lexicalizing WordNet concepts can be subdivided into two steps, namely, i) choosing the right lemma among the synset corresponding to the concept, and ii) generating morphological information such as word inflections. The former task can be cast as a lexical choice problem, and could be viewed as the inverse of Word Sense Disambiguation, although it needs different context features to resolve it.

We present the lexical choice and the inflection generation problems in the framework of surface realisation from the the logical forms collected in the Groningen Meaning Bank (Basile et al. 2012). We investigate how the alignment between the Discourse Representation Graphs of the GMB and their respective surface strings (Basile and Bos 2013) can be exploited to automatically predict the most appropriate word forms. Furthermore, we present the results of a pilot study in which four possible English inflection particles (-s, -ed, -ing and -en) are predicted from a model built on (deep) semantic features.


Araneum Nederlandicum Maius: A New Family Member
Vladimír Benko (Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics, Slovakia)

Poster session, 12.15-13.20

Web corpora are generally recognized as valuable source of authentic language data that can be utilized in various areas of linguistics and NLP. Moreover, due to the multilingual nature of Internet, building large-scale comparable corpora is much easier from web data than from any other source.

Our presentation describes the Aranea* Project, in the framework of which we build a family of medium-sized comparable web corpora for languages spoken and/or taught in Slovakia and the neighbouring countries. Having already covered the "inner circle" and the main foreign languages, we are taking the opportunity to present Aranea in Leiden by creating a billion-token Dutch corpus.

The software tools used include BootCaT for harvesting the seed URLs, SpiderLing web crawler that integrates a language recognition module, a boilerplate removal utility and a document deduplication procedure. A series of filters clean the data from several types of "noise", and the resulting text is tokenized and morphosyntactically annotated by Tree Tagger. The data is further sentence-level deduplicated by our own tool and made accessible under the NoSketch Engine corpus management system. The source corpus data is also provided for download. Note, however, that the copyright status of the data is not clear and users from countries where this might cause legal problems will have to solve this issue themselves.

All the tools mentioned are either open-source or free for academic use. In addition to this, users having an account in the Sketch Engine site can enjoy Aranea processed by compatible sketch grammars that facilitate their use in bilingual lexicography and contrastive studies.

* Araneum (pl. aranea, n.) is the Latin expression for (cob)web

Towards a Parallel Meaning Bank
Johannes Bjerva, Kilian Evang & Johan Bos (University of Groningen)

Poster session, 12.15-13.20

Arguably, meaning ought to be language independent. Hence, the meaning representation of a translated text should be consistent with that of the original text. The Groningen Meaning Bank (Basile et al, 2012b) is a corpus of English texts annotated with meaning representations and also provides tools to annotate new English texts. We can utilize this resource to obtain meaning representations for languages for which we have parallel texts. This addition will allow us to experiment with, e.g., transferring semantic annotation to other languages, and resolving semantic ambiguities by exploiting the fact that, e.g., polysemic patterns often do not overlap exactly between languages.

Parallel text for such a multilingual resource should be of sufficient size, and preferrably also be composed of as many languages as possible. We plan to include the New Testament in nearly 1,000 languages, as well as the Europarl corpora which are approximately two orders of magnitude larger, but are only available for a handful of (European) languages.

We need to align sentences between the parallel texts in order to correctly map the non-English texts to the corresponding meaning. This will be done by combining automatic alignment tools with the extension of our wiki-like corpus editor GMB Explorer (Basile et al, 2012a) with an interface for parallel text. It will display the alignment of sentences within a document, using English as a pivot language. The community will be able to correct the sentence alignment simply and intuitively. The corpus editor and first results will be presented and exemplified with Dutch - English examples.


Large-scale analysis of order variation in Dutch verbal clusters
Jelke Bloem, Arjen Versloot and Fred Weerman (University of Amsterdam)

Room 5, 16.10-16.30

In this work, we take a quantitative approach to explaining word order variation in Dutch verbal clusters. In Dutch subordinate clauses, the auxiliary verb can be positioned before or after the past participle:

(1) Ik denk dat ik het [begrepen heb]
(2) Ik denk dat ik het [heb begrepen]

Choice between these structures is considered to be free, but speakers are forced to choose between them, and appear to choose in systematic ways. Several explanations of this variation have been proposed in the linguistic literature (Coussé et al., 2008), including semantic, discourse, mode, and regional factors, based on analysis of texts annotated for this purpose.

We conduct a systematic analysis of some factors they proposed, using the existing Lassy Large automatically annotated corpus. Requiring no additional manual annotation, we can analyze far more clusters, but with a limited feature set. This large scale lets us examine cluster features down to the lexical level, and provide a more detailed semantic analysis.

Using logistic regression models, we find significant correlations between verbal cluster order and auxiliary type in the corpus, confirming previous findings of De Sutter (2005). We also extend these findings to main clause verbal clusters, which previous work has not included, and find different distributions of orders there which are not simply explained by the different frequencies of auxiliary types in main clauses. Some other previously neglected cluster types such as pure infinitival clusters and clusters with particle verbs are also discussed.


Personal dictionary for text prioritization and selection
Francisco Bonachela Capdevila, Maribel Montero Perez & Stefan De Wannemacker (K.U. Leuven)

Poster session, 12.15-13.20

Finding high quality texts of an appropriate difficulty level has always been a bottleneck in language learning. In this project, we present the use of a personal dictionary as a source of information for text prioritization and eventually for text classification and selection.

The personal dictionary has been designed in the context of the iMinds iRead+ ICON project, a joint effort between industry and academic partners. This project, among other objectives, was devoted to the development of a language learning application where user interaction through automatically generated exercises is needed.

The personal dictionary, a user specific and dynamic vocabulary, is used by the application to select new texts of an appropriate difficulty for every user. It contains a set of words along with metadata that will be used for text prioritization and selection by an algorithm that compares the dictionary vocabulary with every text of the pool.

Dictionary words can be added by the user at any time, by the application based on the errors of the learner and by the teacher to guide the progress of the student. Different weights are used to balance the prioritization algorithm and to reflect the progressive learning of the student.


The first results of the Groot Nationaal Onderzoek Woordenschat
Marc Brysbaert, Emmanuel Keuleers, Pawel Mandera and Michael Stevens (Ghent University)

Room 1, 13.20-13.40

In this talk we will present the design and the first results of the nation wide Dutch vocabulary test we designed. Close to 400 thousand people participated for a total of 600 thousand test. As the different tests covered nearly the entire Dutch vocabulary, we have recognition data for 52 thousand words with some 750 observations per word, coming from the Netherlands and Belgium.


Language-agnostic processing of microblog data with text embeddings
Grzegorz Chrupała (Tilburg University)

Poster session, 12.15-13.20

A raw stream of posts from a microblogging platform such as Twitter contains text written in a large variety of languages and writing systems, in registers ranging from formal to internet slang. A significant amount has been expended in recent years to adapt standard NLP processing pipelines to be able to deal with such content. In this paper we suggest a less labor intensive approach to processing multilingual user generated content.

We induce low-dimensional distributed representations of text by training a recurrent neural network on the raw bytestream of a microblog feed. Such representations have been recently shown to be effective when used as learned features for sequence labeling tasks such as word, sentence and text segmentation (Chrupała 2013, Evang et al. 2013).

In the current work we propose two new scenarios for using such representations. Firstly we employ them in a sequence transduction setting for tweet normalization. Secondly we propose a simple way to build a distributed bag-of-words analog using byte-level text embeddings, and apply it in a hashtag recommendation model.

References

Kilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. 2013. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP.

Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language Processing.


Computing ellipsis
Crit Cremers (Universiteit Leiden LUCL)

Room 5, 10.30-10.50

Elliptic sequences like "Nobody knows when and where" and "Mary probably near Cleveland" can only contribute to the meaning of a text if the parser is able to complete these sequences semantically. These sequences act as propositions without providing all the lexical material necessary to establish that proposition. Computing the meaning(s) of a full fledged sentence requires already smart grammar, extensive lexical data and deep semantics. Computing the meaning of elliptic sentences requires even more - for the resolution of ellipsis to be effective, precise and reliable, a few additional procedures must converge. First, the sequences that require elliptic resolution, must be identified and delimited. Second, the functional type of the linguistic information that is to resolve the sequence, has to be determined. Thirdly, the context must be explored to liberate the proper antecedent for the resolution.

In this talk, I will present the outlines of algorithms to do the job. They connect ellipsis resolution to generalized coordination, dynamic binding and to the semantic algebras of types. Because ellipsis resolution is almost deterministic, the strategy explored by the procedures is grammar based, post-derivational and type-driven. The strategy accounts for polarity and veridicality effects in determining antecedents. Since it is parasitic on non-local analysis, I will also discuss the consequences of ellipsis resolution procedures for the architecture of semantic processing and analysis. These consequences apply to every branch of propositional semantics, and pass beyond what distributional models can offer.


Identifying the MT error types that are most problematic for student post-editors
Joke Daems (LT3, Ghent University), Lieve Macken (LT3, Ghent University) and Sonia Vandepitte (Ghent University)

Room 3, 16.30-16.50

Post-editing machine translation (MT) is an important step towards high quality translations. In order to better understand the post-editing process, a better understanding of the relationship between MT output and the post-editing of this output is necessary. While automatic evaluation metrics such as the widely used BLEU can be used to compare the overall quality of different MT systems, a more detailed error analysis is necessary to identify the typical errors that appear in MT output and the subsequent post-editing.

By using a fine-grained Translation Quality Assessment approach and grouping translation errors into source text-related error sets, we link MT errors to errors after post-editing to examine their relationship. We are mainly interested in answering the following questions: What (and how many) MT errors are solved by post-editors, what (and how many) problems occur in post-edited MT and which (and how many) of these originate from MT?

We present the results of a pilot study in which student translators post-edited newspaper articles and user documentation from English into Dutch. We found that the MT errors that student post-editors most easily correct are grammatical errors, whereas e.g. wrong collocations, word sense disambiguation errors and the misspelling of compounds prove to be more challenging. As such, we can identify the types of errors that post-editor training should pay more attention to.


Robustness of state-of-the-art Dutch text processing tools on Twitter data
Orphée De Clercq and Véronique Hoste (Ghent University)

Room 1, 16.50-17.10

In the current society, where people continuously write about their personal life and opinions online, understanding how products, companies, etcetera are perceived in these social media creates a wealth of opportunities for companies and other organisations. One major challenge is how social media can be mined to send individual users personalized adds, which is also the main goal of the PARIS project.

One way to find out what individual users are interested in is to look at the text they produce online. This requires deep linguistic text processing. In this presentation we present a Dutch Twitter dataset on which a full NLP pipeline, comprising Dutch state-of-the-art tools, was tested. This pipeline consists of both preprocessing (tokenization, lemmatization and part-of-speech tagging) and deep semantic processing (named entity recognition, coreference resolution and semantic role labeling). The results of this test revealed that the preprocessing seems quite robust to user-generated content data whereas the deep processing requires some adaptations. We therefore retrained our modules and compared the output when first normalizing our data using an SMT-based normalisation system.


Unsupervised learning of structure in large datasets
Karlijn Dinnissen, Chris Emmery, Kim Groeneveld, Ákos Kádár, Sander Maijers, Judith Rooswinkel, Lucas Vergeest and Menno van Zaanen (Tilburg University)

Room 2, 9.30-9.50

The task of grammatical inference (GI) is to identify a finite, compact representation of a language, given a set of sentences from the language to be learned. Empirical GI (in contrast to formal GI, which deals with mathematical learnability proofs) designs and implements practical learning systems with the aim to advance our knowledge of learnability in tasks where the generative power of the language is unknown.

In the area of empirical GI, Alignment-Based Learning (ABL) (van Zaanen, 2002) is one of the state-of-the-art systems. Given a corpus, a collection of unstructured sentences, ABL identifies regularities, which results in a tree structure for each sentence.

The framework used in the ABL system forms the basis for a range of alternative approaches. Initial experiments have shown that using different approaches results in quite different treebanks (van Zaanen & van Noord, 2012). Learning curves show that the amount of training data also has a large impact on the results.

We take a look at algorithms that can practically handle larger amounts of data. The output of the GI systems is still evolving if they are trained with the amounts of training data used in previous publications. We will show how the systems react when more training data is available. Additionally, we will present a qualitative analysis of the learned treebanks, which indicates what kind of structures are learned.

References

van Zaanen, M. (2002). Bootstrapping Structure into Language: Alignment-Based Learning. PhD thesis, University of Leeds, Leeds, UK.

van Zaanen, M., & van Noord N. (2012). Model Merging versus Model Splitting Context-Free Grammar Induction. (Heinz, Jeffrey, de la Higuera, Colin, Oates, Tim, Ed.). Proceedings of the Eleventh International Conference on Grammatical Inference; Washington: DC, USA. 224–236.


Named Entity Recognition and Resolution for Literary Studies
Jesse de Does (Institute for Dutch Lexicology), Maarten Marx (Informatics Institute, University of Amsterdam) and Karina van Dalen-Oskam (Huygens ING)

Room 3, 17.10-17.30

The project Namescape: Mapping the Landscape of Names in Modern Dutch Literature was funded by CLARIN-NL. The aim of the project was on the one hand to adapt existing named entity recognition software for modern Dutch fiction, and on the other hand to finetune named entity resolution by linking the names to Wikipedia entries. The background of the project is research into literary onomastics, the study of the usage and functions of proper names in literary (i.e. fictional) texts. For the named entity recognition, existings tools had to be trained on literary texts and the standard list of name categories had to be extended, since the analysis of the usage of proper names in literature needs to distinguish e.g. between first names and family names, which is not usually done in standard NER tools. The named entity resolution task was done to explore the possibilities of labelling the names in fiction in another way: it could help to categorize a name as referring to persons or locations that only ‘exist’ in the fiction (‘plot internal names’), or names referring to persons or locations in the ‘real’ world (‘plot external names’). This distinction is linked to the hypothesis that plot internal and plot external names can have different (stylistic and narratological) functions. Marking them up is the first step toward verifying that hypothesis. In the paper we will describe the results of these main tasks.


Activating Qualified Thesaurus Terms for Automatic Indexing with taxonomy-based WSD
Marius Doornenbal, Inga Kohlhof, Boris Kozlov (Elsevier)

Room 2, 14.20-14.40

All thesauri contain a number of descriptors consisting of the term proper plus a suffix in brackets meant to explain the intended interpretation of the term. Examples include:

  1. UV-B (ultraviolet radiation) (NALT)
  2. files (tools) (NASA)
  3. Vital Energy (Philosophy) (MeSH)

For automatic indexing, these terms (1%-5% of all thesaurus descriptors) are practically lost. Matching these terms with their qualifiers stripped off (the ‘bare’ terms) results in frequent wrong interpretations. Without decent constraints, ambiguity is rife.

We investigated to what extent these terms can be disambiguated during text annotation by looking for concepts in their textual environment which are ontologically related to the concepts they represent and/or to the qualifying concepts.

Targeting the agricultural domain, we created a set-up with the NAL thesaurus (NALT, agclass.nal.usda.gov) to test a set of 30 qualified NAL thesaurus terms of different types indexed in approximately 1500 Scopus (www.Scopus.com) scientific abstracts. Our framework is the Elsevier Fingerprint Engine, primarily used to compute concept vector representations of scientific abstracts, which provides a toolkit of NLP functionalities including a developing system of WSD techniques.

The tested technique is a knowledge intensive but unsupervised WSD algorithm by the classification of [1]. We compare its design and performance with prior approaches of taxonomy-based WSD (cf. [1], [2], [3]), discuss lessons learnt and future directions.

References

  1. Roberto Navigli. Word sense disambiguation: ACM Computing Surveys, 41(2), ACM Press, 2009, 1-69.
  2. Roman Prokofyev, Gianluca Demartini, Alexey Boyarsky, Oleg Ruchayskiy, and Philippe Cudré-Mauroux. 2013. Ontology-Based word sense disambiguation for scientific literature. In Proceedings of the 35th European conference on Advances in Information Retrieval (ECIR'13), Moscow, Russia, March 2013.
  3. Antonio Jimeno Yepes and Alan R. Aronson. 2012. Knowledge-based and knowledge-lean methods combined in unsupervised word sense disambiguation. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium (IHI '12). ACM, New York, NY, USA, 733-736.

Analyzing reactive digital genres
Denise Eraßme, Bianka Trevisan & Eva Maria Jakobs (Textlinguistics and Technical Communication, Aachen, Germany)

Room 4, 16.30-16.50

From a linguistic perspective, the Web can be described as collections of genre ecologies. Parts of them are “reactive” genres (e.g. the genre pair news article - news comment). The communicative purpose of a news article is to introduce a topic. The purpose of a news comment is to offer users the opportunity to react on the news article by commenting on it. In doing so, the news article serves as starting point for (first order) news comments or comment threads.

This article presents a methodological approach to process large corpora focusing on the topic-relation between article and related comments: are all comments reacting on the article topic; how are they reacting (explicitly or implicitly, positive, negative, or neutral).

The approach is based on linguistic-driven Text Mining methods. It uses a linguistic multi-level annotation model (Trevisan et al. 2012) that was extended for our research questions. The extension covers categories for the annotation of topic, relation-types between article topic - comment topic, how a comment reacts (explicitly vs. implicitly) and with which attitude (positive, negative, or neutral commenting). The corpus contains 13 articles and 39 comments (7548 token).

The results show that the approach works. It delivers valuable insights, e.g., not all news comments are reacting directly on the news article topic. Methodological challenges concern the topic recognition and the automation.

References

Trevisan, B./Neunerdt, M./Jakobs, E.-M. (2012): A Multi-level Annotation Model for Fine-grained Opinion Detection in Blog Comments. In: KONVENS 2012, 179-188


A Text Denormalization Algorithm Producing Training Data for Text Segmentation
Kilian Evang, Valerio Basile & Johan Bos (University of Groningen)

Poster session, 12.15-13.20

As a first step of processing, text often has to be split into sentences and tokens. We call this process segmentation. It is often desirable to replace rule-based segmentation tools with statistical ones that can learn from examples provided by human annotators who fix the machine’s mistakes. Such a statistical segmentation system is presented in Evang et al. (2013).

As training data, the system requires the original raw text as well as information about the boundaries between tokens and sentences within this raw text. Although raw as well as segmented versions of text corpora are available for many languages, this required information is often not trivial to obtain because the segmented version differs from the raw one also in other respects. For example, punctuation marks and diacritics have been normalized to canonical forms by human annotators or by rule-based segmentation and normalization tools. This is the case with e.g. the Penn Treebank, the Dutch Twente News Corpus and the Italian PAISA corpus. This problem of missing alignment between raw and segmented text is also noted by Dridan and Oepen (2012). We present a heuristic algorithm that recovers the alignment and thereby produces standoff annotations marking token and sentence boundaries in the raw test. The algorithm is based on the Levenshtein algorithm and is general in that it does not assume any language-specific normalization conventions. Examples from Dutch and Italian text are shown.


BiographyNet: Provenance and text mining for historic research
Antske Fokkens, Niels Ockeloen, Serge Ter Braake, Piek Vossen, Guus Schreiber and Susan Legêne (VU University Amsterdam)

Room 3, 10.10-10.30

BiographyNet is a multidisciplinary project bringing together history, computational linguistics and the Semantic Web. The role of NLP in BiographyNet is to extract information from biographical dictionaries. The extracted information is represented in RDF triples. If historians want to use this information for their research, they need (1) to be able to go back to the original source, (2) get an indication of the overall performance of the NLP pipeline and (3) be able to check if the NLP pipeline introduced biases. In other words, historians need insight into the full process of information extraction.

We use the PROV-DM (Moreau et al 2012) to model provenance of extracted information. This allows us to represent the data in various stages, the processes applied to the data and the programs and people responsible for creating data as well as processes. We combine our provenance model with a representation from the P-PLAN ontology (Garijo and Gil, 2013). This allows us what is supposed to happen at each step. A model of what is supposed to happen together with a model of what actually happened can be useful in debugging and error analysis. The information they provide is useful for replicating results. As such, they also contribute to a clean experimental setup of our NLP pipeline. We will illustrate the idea through the provenance information of a basic supervised machine learning model to illustrate.


A standard for prioritised and dynamic hyphenation definitions
Sander van Geloven (Stichting OpenTaal)

Poster session, 12.15-13.20

This presentation describes a standard for hyphenation definitions enabling the generation of prioritised and dynamic hyphenation patterns.

In the early nineteen-eighties, automatic hyphenation of lexical items has been made possible by a hyphenator using language-specific hyphenation patterns. These patterns are generated by the hyphenation software community from hyphenated word lists.

The initial design was based on the English orthography and limited character encoding. Support for extended encodings was added in the 1990s mostly for Western languages. However, the hyphenated word list format remained rather unchanged. This complicated the support of specific morphological or phonological structures, requiring hyphenation priority in compounds or dynamic hyphenation resulting in altered spelling.

Although over 70 languages are supported now, hyphenation is suboptimal and impossible for languages relying on a universal character encoding. This limited method of hyphenation has been catering to digital typesetting over three decades. Unfortunately, recently implemented hyphenation in layout engines for web page rendering is built upon the same outdated technology.

An improved hyphenator and extended hyphenation patterns are necessary to overcome current limitations and support a wider range of languages. To achieve this, the software community needs a standard format for hyphenation definitions in universal human-readable hyphenated word lists. A context-free grammar was developed with unambiguous and fine-grained control allowing enhanced hyphenation. All language-specific cases are illustrated with examples and lexicological theory.

Our standard for hyphenation definitions enables improved automatic hyphenation for printed media and web documents.


Translation assistance - translating L1 fragments in an L2 context
Maarten van Gompel (Radboud University Nijmegen)

Room 3, 14.00-14.20

We present new research in translation assistance. We describe a system capable of translating native language (L1) fragments to foreign language (L2) fragments in an L2 context. Practical applications of this research can be framed in the context of second language learning.

We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines. A classification-based approach is presented that is indeed found to improve significantly over these baselines by making use of a contextual window spanning a small number of neighbouring words.


The efficacy of terminology-extraction software for the translation of documentaries
Sabien Hanoulle (University of Antwerp)

Room 2, 14.00-14.20

This research investigates whether the terminology used in documentaries is specific enough to be detected accurately by automatic terminology-extraction systems. Furthermore, it aims to determine whether the integration of the resulting term lists into the translation process reduces the translator’s workload, while contributing to a qualitatively appropriate end product.

The first part of the paper examines the results of two pilot tests aiming to check the reliability of the automatic terminology-extraction systems. In pilot test 1 three annotators manually labelled the terminology while pilot test 2 concerned the automatic terminology extraction by three systems: Trados Multitermextract 2011, Similis and TExSIS, being developed at the University of Ghent. The output of both tests were compared using F-score.

The second part of the paper describes the set-up and results of a proof of concept experiment conducted with master students in translation at the University of Antwerp. For the experiment the students were divided into two groups and asked to translate a selection of documentary texts containing domain-specific terminology from English into Dutch. Both groups were first asked to translate without terminology support. Two months later, the first group translated the same texts with manually labelled terminology lists while the second group performed the task with the automatically extracted terminology lists. The translation process was monitored with Inputlog, a research tool for logging and analysing writing processes: www.inputlog.net. Both a quantitative and a qualitative analysis of the results will be presented, focusing on pause time before terms, process time, research behaviour and terminology errors.


TermWise: A Computer Assisted Translation Tool with Context-Sensitive Terminological Support
Kris Heylen (QLVL, Linguistics Department, KU Leuven), Stephen Bond (QLVL, Linguistics Department, KU Leuven), Dirk De Hertog (QLVL, Linguistics Department, KU Leuven), Ivan Vulić (LIIR, Department of Computer Science, KU Leuven) & Hendrik Kockaert (QLVL, Linguistics Department, KU Leuven / University of the Free State, Bloemfontein)

Poster session, 12.15-13.20

This poster with demo presents TermWise, a prototype for a Computer Assisted Translation (CAT) tool that offers additional terminological support for domain-specific translations. Compared to existing CAT-tools, TermWise has an extra database, a Term&Phrase Memory, that provides context-sensitive suggestions of translations for individual terms and domain-specific expressions. The Term&Phrase Memory has been compiled by applying newly developed statistical knowledge acquisition algorithms to large parallel corpora. Although the algorithms are language- and domain-independent, the tool was developed in a project with translators from the Belgian Federal Justice Department (FOD Justitie/SPF Justice) as end-user partners. Therefore the tool is demonstrated in a case study of bidirectional Dutch-French translation in the legal domain. On the poster, we first describe the specific needs that our end-user group expressed and how we translated them into the new Term&Phrase Memory functionality. Next, we summarize the term extraction and term alignment algorithms that were developed to compile the Term&Phrase Memory from large parallel corpora. In our case-study we worked on the online available official Belgian Journal (Belgisch Staatsblad/Moniteur Belge). The poster then describes the server-client architecture that integrates the Term&Phrase Memory’s server database with a CAT user-interface to provide context-sensitive terminological support. In conclusion we also discuss the evaluation scheme that was set up with two end-user groups, viz. students of Translations Studies at KU Leuven (campus Antwerp) and the professional translators at the Belgian Federal Justice Department. The demo will show the use of the TermWise tool for the translation of Belgian legal documents from Dutch to French and vice versa.


Natural Language Processing for Internet Security: the AMiCA project
Véronique Hoste (LT3, Ghent University), Walter Daelemans (CLiPS, University of Antwerp), Guy De Pauw (University of Antwerp), Els Lefever (LT3, Ghent University), Bart Desmet (LT3, Ghent University), Sarah Schulz (LT3, Ghent University), Ben Verhoeven (CLiPS, University of Antwerp) & Cynthia Van Hee (LT3, Ghent University)

Poster session, 12.15-13.20

The AMiCA project is a project sponsored by the Flemish government that targets monitoring and maintaining the internet security of children and young people. In this project, research teams from the Universities of Antwerp, Ghent and Leuven cooperate in the development of multimodal detection of threatening situations in social networks by means of text and image analysis. We focus on three critical situations: cyberbullying, sexually transgressive behaviour, and depression and suicidal behaviour.

In this paper, we introduce the Natural Language Processing issues encountered in the text analysis part of the project. More specifically, we focus on social media language normalisation, frame-based detection of temporal and other aspects of the critical situations, adaptation of computational stylometry techniques to the goals of the project, and deep text analysis to provide useful features for detection.


Corpora and Features for Web Query Intent Classification
Max Ionov (Moscow State University, Russian Federation), Svetlana Toldova (National Research University Higher School of Economics, Russian Federation)

Poster session, 12.15-13.20

Web search queries can be differentiated by underlying intent or user goal. According to Broder ([Broder 2002], Rose [Rose and Levinson 2004]) search queries can be divided into three types: informational, navigational or transactional (resource as referred by Rose). Informational intent is “to acquire some information assumed to be present on one or more web pages”, navigational intent is “to reach a particular web site” and transactional intent is “to perform some web-mediated activity” (as defined in [Broder 2002]).

Studies showed that system performance can be significantly increased when query intent is considered. The aim of the presented research is to suggest and to test two types of methods for query corpus construction and annotation and to investigate the contribution of various features into the quality of classification.

Our work consists of three parts: 1) building corpora of web queries annotated by their intent, 2) analyzing structural and lexical features of queries with different intents and 3) classifying queries by their intents using several ML techniques.

It was showed in [Lewandowski 2013] that annotating queries by their intent by hand leads to a low level of inter-annotator agreement, so two methods for acquiring corpora is proposed to overcome this problem.

For task two we compare different types of features: lexical, morphological and simple syntactic features, analyzing their impact on overall result.

Using two ML techniques for classifying, DT and Naive Bayes, we show that classifiers trained on our corpora and using proposed features shows state-of-the-art results.


Mining the Twentieth Century's History from the TIME Magazine Corpus
Folgert Karsdorp (Meertens Institute) and Mike Kestemont(University of Antwerp)

Room 1, 9.30-9.50

In this paper we report on quantitative research conducted on the complete archive of "TIME Magazine", containing over 260.000 articles. This well-known American weekly news magazine has had a continuous publication history since 1923, making this collection an exceptionally rich and balanced textual resource for the study of the history of twentieth century. Because of the sheer size of this so-called "Big Data", we must resort to automated, computational analyses. We apply state-of-the-art techniques from language technology and text mining, for instance from the recent "deep learning" movement. Among researchers, there is widespread acceptance that cultural evolution is somehow reflected in language use; yet, there exists no standard methodology to study such phenomena beyond the naive plotting of individual word frequencies through time. In this paper we attempt to move to more advanced analysis techniques for the computational study of history based on a large, diachronic textual corpus. Although TIME's archive naturally offers a strongly America-centric view on history, we will demonstrate how large-scale events such as World Wars I and II, the Moon Landing, or the rise of the Internet, have found an interesting and complex reflection in the evolution of TIME's vocabulary. Of particular interest to us is the notorious "TIME100", a highly mediatized list of the most influential people in the world which the magazine brings out yearly. Moreover, in 2003, TIME published such a list for the entire twentieth century, singling out the well-known theoretical physicist Albert Einstein as the single most influential “Person of the Century”. In our research we have paid special attention to the intriguing interplay between this list of influential personalities and the manner in which they are discussed in the magazine's own archive.


Time-Aware Chi-squared for Document Filtering over Time
Tom Kenter (University of Amsterdam), David Graus(University of Amsterdam), Edgar Meij (Yahoo! Research, Spain) and Maarten de Rijke (University of Amsterdam)

Room 4, 16.10-16.30

Traditional ad hoc information retrieval systems focus on producing a ranked list of documents relevant to a user’s query. Users might, however, persist in being interested in a concept and might want to track it over time. In this paper we propose a method for filtering a stream of documents for the ones relevant to a certain topic. The task is modeled as a multi-class classification task. Topics may evolve over time, making a classical approach of training a classifier on a set of examples and keeping it fixed at testing time unsuitable. We present a topic filtering algorithm that is capable of adapting to drift in topics by periodically selecting features for a multinomial Naive Bayes classifier. Two versions of Pearson’s χ2 test are proposed for feature selection. Both incorporate the notion of time, which is not considered in the original formula. This is achieved by incorporating a time-dependent function in the χ2 equations which provides an elegant method for applying different weighting and windowing schemes. Experiments on the TREC KBA 2012 data set show improvements of our approach over a non-adaptive baseline, in a realistic settings with limited amounts of training data.


Are They True to Their Style? Determining Characteristic Elements in Authorship and Evaluating Their Felicity
Carmen Klaussner (Trinity College Dublin, Ireland), John Nerbonne (University of Groningen) & Çağrı Çöltekin (University of Groningen)

Poster session, 12.15-13.20

Detection of stylistic elements in authorship studies is hampered by the lack of a gold standard that would otherwise enable us to clearly evaluate our findings. In absence thereof, one generally resorts to choosing items for which an author shows a characteristic usage compared to other writers. In this line of work, we present both a measure for determining characteristic elements of an author along with methods for evaluation of those elements.

In order to select an author's consistent features, we propose the measure of Representativeness & Distinctiveness that seeks to identify those elements that are both representative for an author over a given set of his texts as well as distinctive with respect to an opposing author's sample. The method thus bears similarities with both Burrow's Delta and Zeta in favouring consistent terms that are irregular in the opposing author's set.

For evaluation, we test the separation ability of the selected features by clustering the two authors' documents followed by computing the Adjusted Rand Index given the ideal clustering result. Further, we measure the consistency of the author profiles over different document subsets with a high degree of overlap being indicative of the method's ability to detect stylistic elements.

Applying both of the above criteria to our results of a comparison between Charles Dickens and Wilkie Collins showed both a fair agreement over different profiles and a high degree of separation ability in clustering.


Social, geographical, and lexical inuences on dialect pronunciations in Dutch
Vinnie Ko (Utrecht University), Ernst Wit (University of Groningen), Wim Krijnen (University of Groningen), John Nerbonne (University of Groningen) and Martijn Wieling (University of Groningen)

Room 4, 9.30-9.50

Wieling, Nerbonne, et al (2011) [1] combined generalized additive modeling (GAM) with mixed-effect models to find social, lexical, and geographical effects on linguistic variation of different Dutch pronunciations. The conclusion of their study was: the phonetic distance from standard Dutch gets greater if: location has smaller population, location has higher average age, the word has higher frequency of use, and the word has relatively many vowels. When they performed this quantitative study in 2011, they couldn't analyze the dataset in one time in programming language R because of the big size of the dataset. What they did instead was first using GAM and then applying mixed models with the result from GAM. In June 2013, there was a new method available called BAM in R which made possible to analyze the dataset in one time. In addition, we also brought some modifications to the model of Wieling. The biggest adjustment we made was that we created a seperated geographical map for verbs. fREML score [1] showed that this modified model explained the data better. Figure 2 shows these two geographical distributions. It is interesting that although Frisian pronunciations are in general the least like standard Dutch, the pronunciations of verbs are most unlike standard Dutch in Drenthe, where a Dutch dialect is spoken. Further, the new model produced some new findings that differ from the general linguistic beliefs. In figure 1, as vowel/consonant ratio of a word gets higher, the phonetic distance from standard Dutch gets greater, but if a word contains extremely few vowels, the opposite holds. Age of the speaker shows a very generation specific and nonlinear curve.

Original thesis available at: scripties.fwn.eldoc.ub.rug.nl


Figure 1: Effect of independent variables on the dependent variable. Pho_Dist stands for phonetic distance from the standard Dutch pronunciation. Dotted lines indicate 95% confidence interval.



Figure 2: Left: geographical distribution of verb, Right: geographical distribution of other word categories The more blackish, the smaller value of Pho_Dist, which is the dependent variable of the model. Pho_Dist stands for phonetic distance from the standard Dutch pronunciation.

References

  1. Martijn Wieling, John Nerbonne, and R. Harald Baayen. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS ONE, 6(9):e23613, 09 2011.
  2. S.N. Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 73(1):3{36, January 2011.

Application transfer between close languages using HLT-methods
Oele Koornwinder (GridLine)

Room 2, 17.10-17.30

In this presentation, we will inform you about our NTU/DAC-funded HLT demonstration project "Automatic writing support for South African languages: examples of a demonstrator and resources based on proven Dutch HLTs". For this project main-applicant GridLine has formed an international project consortium with CTeXT (North West University, SA), SASNEV (Dutch/Afrikaans oriented NGO for educational projects) and prof. Willy Martin (VU Amsterdam /Unieboek, chief editor of ANNA).

Our aim is to build a number of HLT-demonstrators for South Africa, departing from existing Dutch HLT-applications of GridLine and the bilingual dictionary ANNA. Doing this, we will demonstrate HLT-based language conversion and extraction methods for the language pair Dutch-Afrikaans and some Bantu-languages. Apart from knowledge development, educational applications and dissemination of best practices, we will investigate the South African market for these products.

Our main deliverable is a Klinkende Taal-demonstrator for Afrikaans (KTA), providing online writing support for officials, teachers and students. This demonstrator will be extended with a special module for wordchoice-improvement. We will also produce a QuickScan-report for a subdomain of the government-site. Further, we will realize a number of reusable termlists for Afrikaans, Setswana, Siswati and Xitsonga by applying GridLine’s term extractor on the government-site.

The demonstrator will be based on the dictionary ANNA, extracting reusable wordlists with cognates, near-cognates and non-cognates for each word-meaning. These wordlists will be used to enrich the existing language converter of CTexT, making it much more reliable. We will evaluate the translation system by using it for some planned conversion tasks with human supervision.


Opinion lexicon organisation in a rule-based sentiment analysis system
Sergey Kulikov (Institute of Linguistics, Russian Academy of Sciences, Russian Federation)

Poster session, 12.15-13.20

Sentiment analysis has long focused on opinion lexicon, including its POS characteristics, compositionality and strength. The majority of these studies [cf. Liu, 2012] concern word classes instead of their place within an NLP system. In this paper we discuss where each type of opinion words should be placed in a rule-based general domain sentiment analysis system of Russian language.

Studies of Russian language within the Meaning-Text approach [e.g. Mel’chuk, 1974] have revealed that compositional properties between verbs and noun phrases, verbs and adverbs, nouns and adjectives are of the same type. However, the frequencies of some of these groups differ much from others. For a rule-based system this means that the number of rules dealing with one group is not the same as with another.

Some compositional types such as polarity intensifiers are more common than polarity diminishers. Polarity shifters including negation words are the least common of all the opinion word types. These word groups, i.e. shifters and diminishers, roughly correspond to functional words and are known to be a closed word class and should be hand-coded into the rules.

This differentiation between polarity words coded into the rules and existing as separate word lists or other dictionary organization types helps to minimize the number of compositional rules and allows to maintain fewer expandable dictionaries.


Leveraging anticipatory tweets to dynamically build a social event calendar
Florian Kunneman and Antal van Den Bosch (Radboud University Nijmegen)

Room 1, 17.10-17.30

We aim to continuously update a calendar of social events, such as openings, parties, demonstrations, and strikes, providing an overview of future events for target groups such as tourists and journalists. While a lot of these events can be found on dedicated web pages, we argue that event information on Twitter forms a generic and strong basis for this task. Twitter is a source in which a lot of events are mentioned in a highly dynamic fashion; though implicitly, it also harbours interesting information such as sentiment, opinions, and quantifiable 'trending topics'.

In order to extract event information from the Twitter stream, tweets referring to the same event are filtered, after which information about the event is extracted from these tweets. In this research, we skip the first step and experiment with methods to extract specifically the start time of an event based on a close to optimal input of tweets (almost exclusively referring to the same event) in time. The most favourable method determines the start date of events most accurately and the soonest possible.

We compare different approaches and their combinations. The first approach is to focus on temporal expressions in tweets and use these to deduce the date of an event in a rule-based fashion. The second approach is to look at the frequency of tweets in time, signalling significant increases of tweet frequencies. The third and last approach is to apply machine learning to map the language in tweets to a time-to-event, which is then to be translated to a calender date.


Evaluation of Automatic Hypernym Extraction from Technical Corpora in English and Dutch
Els Lefever (LT3, Ghent University), Marjan Van de Kauter (LT3, Ghent University) & Veronique Hoste LT3, Ghent University)

Poster session, 12.15-13.20

In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. Starting point for the hypernym detection system are all the domain-specific terms and Named entities that are contained by the technical text at hand. The extracted hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data.

We investigated three different hypernym extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for the automatic hypernym extraction from technical and domain-specific text collections.


Parsing with the Recovery of the Empty Nodes
Dongchen Li (Peking University, China)

Poster session, 12.15-13.20

Parsing natural language is difficult because that several sources of ambiguities are mixed up. The sources of ambiguities lie in different level. Ambiguities in low level include ellipsis, interjection, polysemous words, and so on. Ambiguities in high level include collaboration ambiguity, bracketing ambiguity, and so on. The ellipses are quite common in natural language, and make the other ambiguities even more complex. On one hand, the recovery of ellipses would make the analysis of other ambiguities easier. On the other hand, the resolution of other ambiguities would make the recovery of ellipses easier. We propose a method to train a grammar with the ellipses, and parse with integration of recovery of ellipses and resolution of other ambiguities. The ellipses are labeled as empty nodes in Penn Treebank and Penn Chinese Treebank. We devise a representation to reserve the information of empty nodes in adjacent entity nodes, and integrate the recovery of empty nodes and parsing together. Experiments are implemented on Penn Treebank 2.0 and Penn Chinese Treebank 5.0. which illustrates that the recovery of empty nodes can improve parsing performance significantly.


Automatic generation of sentiment lexicons in five languages
Isa Maks, Ruben Izquierdo and Piek Vossen (VU University Amsterdam)

Room 1, 9.30-9.50

Many approaches to opinion mining rely on lexicons including words that are used to express negative and positive sentiments. As manual lexicon building is a time-consuming task, most of these lexicons are automatically built.

In this presentation we focus on the automatic creation of general-purpose sentiment lexicons in five languages, i.e. French, Italian, Dutch, English and Spanish. We apply WordNet propagation which is a commonly used method to generate these lexicons as it gives high coverage of general purpose language. Moreover, the semantically rich WordNet seems to fit well the identification of positive and negative words. Although many variants of the propagation method are developed for English, little is known about how they perform with WordNets of other languages.

WordNets may differ, however, in various ways such as the way they are compiled, and their number of synsets, synonyms, and semantic relations. We investigated whether this variability translates into differences of performance by applying the propagation algorithm on different WordNets. We implemented a propagation algorithm, created seed lists for each language and evaluated the results of the propagation against gold standards. Both seed lists and gold standards are developed following common methods in order to achieve as less variance as possible between the different languages.

This study is carried out as part of the EU funded project OpeNER which aims at making available open-source, ready-to-use and easy-to-adapt tools , lexicons and methods for opinion mining and named entity recognition.


Computational modeling of second language acquisition: An integrative usage-based account
Yevgen Matusevych, Afra Alishahi and Ad Backus (Tilburg University)

Room 4, 10.10-10.30

Substantial variability in second language learners (e.g., their mother tongue, age, learning history, etc.) makes it difficult for empirical methods to generalize any results over the whole population. Corpus methodologies only partially solve the problem, since the number of available corpora is also limited. We apply computational cognitive modeling, which allows us to handle large data and to manipulate certain variables while controlling for others. Our probabilistic computational model is informed by usage-based theories of language, in particular construction grammar, which claims that people learn associations between form and meaning not only at the level of words, but also at the level of abstract constructions. These develop gradually in learners' language, and the exact repertoire is shaped by learners' linguistic input. By carefully designing the input and analyzing the output of a computational learner, we can conclude which processes guide second language acquisition (SLA). Testing the model on a small manually annotated dataset showed that it replicated general SLA developmental patterns and two specific effects found in empirical studies – construction priming and facilitatory effect of skewed input frequencies. Currently we are extracting larger amounts of data for testing more general hypotheses. We aim at providing an integrative account of SLA/bilingualism which would both describe sources of second language knowledge and its development over time.


Detecting stances in Dutch ideological debates on Twitter, a case study of the Black Pete debate
Lars Meijers (VU University), Isa Maks (VU University), Mark Hoogendoorn (VU University) and Peter Kampstra (RTreporter)

Room 2, 16.50-17.10

Twitter is an ever-growing medium through which millions of people everyday voice their thoughts, passions, experiences and opinions. Therefore, debates regarding current and often populist topics are very common on Twitter. This work focuses on differentiating and classifying the different stances taken within these sociopolitical debates.

As a case study we used the recent discussion regarding the phenomenon of Black Pete, a black man that helps the Dutch Santa Claus deliver his presents. This debate revolves around the question whether black Pete is an innocent cultural tradition or a racist and degrading symbol, portraying black people as slaves. A challenge in this debate is that a negative opinion on the subject could both indicate a pro stance i.e. keep the tradition, or a con state , i.e. Black Pete is racism.

For our analysis, we created a manually annotated corpus of tweets regarding the Black Pete debate. From this a debate lexicon was built. Additionally, a NLP pipeline was used to tokenize, and POS tag the tweets. This resulted in a number of interesting observations and features. With this, a supervised system employing sentiment analyses and debate specific opinions was built.

We applied this method at RTreporter, a company that specializes in early detection of newsworthy events on Twitter. Classification of the different stances within debates on twitter can help journalists when gathering data from which polls, viewpoints and important arguments can be extracted.


Data-powered Domain-specific Translation Services on Demand
Bart Mellebeek, Miriam Huijser and Khalil Sima'An (University of Amsterdam (ILLC))

Room 2, 16.30-16.50

We present ongoing research in DatAptor, an STW-funded research project between the ILLC (UvA) and a number of industrial partners (European Commission, Intel, Symantec, TAUS). The main goals of the project are to search for novel solutions to the Domain Adaptation problem for Statistical Machine Translation (SMT).

Over the past years, research in SMT has rapidly advanced to the point of providing users with workable automatic translation solutions. This success is partly due to the ever growing amount of (parallel) training data, but this increase in available data now poses new challenges as improvements stagnate and it has become clear that blindly training systems on concatenated data sets does not always lead to better results. Unlike traditional Domain Adaptation techniques for SMT, we blur the line between in-domain and out-of-domain texts and investigate novel ways to automatically adapt model parameters to incoming texts, depending on their location in a map of (bilingual) training documents.

In this work, we will provide details of ongoing research into two branches of the project, namely:

  1. How can we create a document map in which training data is clustered based on bilingual surface or semantic representations, while maintaining an efficient search capacity?
  2. When considering a single massive translation repository, how can we automatically update model parameters, without costly retraining, depending on the similarity of the input text to the training data?

Using of part-of-speech patterns to automatically construct adjectival scales
Emiel van Miltenburg (Universiteit Utrecht)

Poster session, 12.15-13.20

Lobanova (2012) shows how it is possible to automatically extract pairs of opposites from a corpus using part-of-speech patterns. We propose an extension of her work to the domain of adjectival scales: sets of expressions ordered by their expressive strength along a certain semantic dimension. As an example, observe the (simplified) goodness scale: <decent, good, excellent>. Such scales can be constructed from a corpus in three steps:

  1. Find patterns in which scalemates (items that occur on the same scale) are likely to occur.
  2. Use those patterns to find new pairs of scalemates.
  3. Take those pairs and combine them into lexical scales.

The resulting collection of scales is useful for information mining (having this information allows us to compare statements containing scalar terms based on the relative strength of the terms used) and lexicography (thesaurus data might be improved by showing how near-synonyms relate to each other).

Of course there are some challenges in implementing the algorithm sketched above. For example, to implement step (3), we need to automatically deduce from the corpus how the items should be ordered on their respective scales. I propose a method using ordered seed pairs to find out which patterns typically match <low,high> pairs and vice versa. We can then compute the likely ordering for each novel pair of scalemates.

We will discuss our implementation of the full algorithm, and its results with the UMBC WebBase corpus (Han et al. 2013).


A Neural Probabilistic Model of Script Knowledge
Ashutosh Modi (Saarland University) and Ivan Titov (University of Amsterdam)

Room 2, 10.10-10.30

Induction of script knowledge (common sense knowledge about prototypical sequence of events) from text has recently received much attention. Previous approaches (e.g., Chambers and Jurafsky (2008); Regneri et al. (2010)) represent this knowledge in the form of graphs (e.g., chains or more general directed acyclic graphs). These graphs can then be used to inform NLP applications (e.g., question answering) by providing information whether one event is likely to precede or succeed another, or predicting an implied event given a partial description of a situation. Instead of constructing a graph and using it to provide information to applications, in this work we advocate for constructing a statistical model which is capable to 'answer' at least some of the questions these graphs can be used to answer, but doing this without explicitly representing the knowledge as a graph. In our method, distributed representations of event realizations are computed based on distributed representations of predicates and their arguments, and then these representations are used to predict prototypical event orderings. The parameters of the compositional process and the ranking component of the model are jointly estimated from unlabeled texts. We show that this approach results in a substantially boost in performance on the event ordering task with respect to the previous approaches, both on natural (Gigaword) and crowdsourced texts.


First-order linear logic as a general framework for logic-based computational linguistics
Richard Moot (LaBRI (CNRS) and Bordeaux University)

Room 4, 14.20-14.40

Since its introduction in 1958, the Lambek calculus has served as an inspiration for using logic to give a simple and elegant treatment of natural language. However, the problems with the Lambek calculus are well-known: Lambek grammars generate only the context-free languages, and, though the syntax-semantics interface of the Lambek calculus is one of its appealing properties, the string-meaning relations which can be defined using the Lambek calculus is too limited.

A large number of extensions to the Lambek calculus have been proposed to remedy these problems. In this talk, I will focus on the tuple-based extensions - which include the Displacement calculus, lambda-grammars, abstract categorial grammars and hybrid type-logical grammars - and I will show that all of these systems can be seen as fragments of first-order linear logic.

This embedding result has several important consequences: first, it allows us to compare the analyses proposed in different frameworks, to see where they agree but also to clarify their formal limitations. Second, it greatly simplifies the proof theory for each of the systems, allowing us to give simple, alternative proofs for many known results. But most importantly, it also allows us to immediately propose new and improved parsing algorithms for each of these frameworks.


Using distributional thesauri for sorting synonyms
François Morlane-Hondère (CLLE-ERSS, France)

Poster session, 12.15-13.20

Dictionaries of synonyms are valuable resources for helping people to avoid repetition or to find the right word to use in a sentence. A dictionary with a high number of synonyms per word can be useful to find a substitute word in a large variety of contexts, but this profusion may make it difficult to use. In this study, we show that an automatically generated distributional thesaurus can be used to improve the sorting of the synonyms contained in a French online dictionary of synonyms, the Dictionnaire Électronique des Synonymes (DES).

Distributional analysis puts together words that share the same contexts in a given corpus: car and motorcycle are objects of drive, hit, etc. This method has been widely used to automatically extract semantically related words, particularly synonyms. Distributional thesauri being the reflect of the way words are used in texts, we assume that adding corpus-based distributional information to a static dictionary is a good way to adapt the results to users' needs.

Thus, we compare the DES with two distributional thesauri automatically generated – respectively – from corpora of Wikipedia articles and Le Monde newspaper in French. The results show that the nature of the synonyms extracted for a given word varies between the two thesauri. We assume a correlation between the distributional similarity score of two synonyms and their relevance regarding users’ needs: we predict that a journalist will use the DES more efficiently if the synonyms are ranked according to their distributional similarity score in a thesaurus generated from a journalistic corpus.


Authorship of Two Religious Medieval Dutch Texts. Vanden Tempel Onser Sielen and Die Evangelische Peerle
Renske van Nie (Universiteit Antwerpen)

Room 5, 13.20-13.40

In 1543, the Middle Dutch mystical treatise Temple of Our Souls was published. Up until today, the author remains unknown, although the frontispiece of the book states that the text was written by the same female author who wrote another religious text, the immensely popular Evangelical Pearl (published 1535 – 1542). Although the editor of the Pearl shares some details about the writer, her name remains unmentioned. In the last century, there have been several attempts to identify the author, but these have not been successful. The Temple and the Pearl show certain similarities regarding subject and content, but because of strong stylistic differences, up until today scholars are left with the question if the texts were indeed written by one and the same author or by several authors sharing the same background. This problem is one of the main questions of this research, which we hope to answer with both traditional literary methodology and stylometrical analysis. In my presentation I will show and elaborate on the first stylometrical experiments carried out with the stylo R package developed by the Computational Stylistics Group, as well as address some of the problems encountered during my analyses. Finally, I will discuss my plans for future research.

Shooting ducks in a pond: hunting for patterns with BlackLab
Jan Niestadt and Jesse de Does (Instituut voor Nederlandse Lexicologie)

Room 5, 13.40-14.00

BlackLab is an open source corpus search engine that gives you full, fast and easy access to large annotated text corpora. You can search for complex patterns, determine word frequencies in a subset of documents, group hits by many different properties, iterate through all tokens in a document, or even retrieve the original XML.

The main design goals for BlackLab were ease of use, flexibility and speed. It supports several input formats and several ways of querying. You can use it through the Java API or through its own webservice, allowing you to use any programming language you like (including R, Perl and Javascript, for example).

BlackLab is available now on GitHub. It is being actively developed and used by INL and others in several high-profile projects (IMPACT, CLARIN-NL, Nederlab, OpenSonar). We welcome your contributions and suggestions for the future direction of the project. One feature we're thinking about is distributed search, so you can search bigger datasets faster.

Of course, the best way to convince you that BlackLab is worth checking out is to show how easy it is to start searching and analyzing your corpus. So, in addition to showing you our search interface (built with BlackLab), we’ll do some live BlackLab scripting. Don't worry, no waterfowl will be harmed.


T-Scan: analyzing text complexity for Dutch
Henk Pander Maat (Universiteit Utrecht), Rogier Kraf (Universiteit Utrecht), Antal van den Bosch (Radboud Universiteit), Maarten van Gompel (Radboud Universiteit), Suzanne Kleijn (Universiteit Utrecht), Ted Sanders (Universiteit Utrecht) and Ko van der Sloot (Tilburg University)

Room 4, 17.10-17.30

The automatic analysis of text complexity is a growing field of interest. A number of applications, both commercially and freely available, have been developed to determine the European Framework ‘language level’ for Dutch text. It turns out that these applications achieve a low level of agreement (Kraf et al. 2011). This is unsurprising, since the reading comprehension data to base readability level predictions upon are not yet available for Dutch. Hence our T-Scan application, while concerned with linguistic complexity, does not aim to produce language levels. It sticks to the more modest task of providing values for a large number of features that potentially affect text comprehension(Kraf & Pander Maat 2009). Drawing on resources created by two decades of computational linguistic work in the Netherlands, T-Scan offers seven groups of text features:

  • Word complexity (e.g. word frequencies, word lengths, abbreviations)
  • Sentence complexity (e.g. sentence lengths, dependency lengths, NP complexity)
  • Referential coherence (e.g. argument overlap, anaphora, LSA similarities)
  • Information density (e.g. type-token-ratio, lexical density)
  • Concreteness (concrete nouns, concrete adjectives)
  • Personal style (e.g. person references, emotional words)
  • Situation model vocabularies (i.e. causal, spatial and temporal words)

T-Scan offers output on different text levels (word, sentence, paragraph, text), as determined by the user. Various kinds of research are under way using T-Scan. In corpus studies, we seek to set genre-bound ‘benchmark’ levels for different text features. In a readability study, we use T-Scan features to predict cloze comprehension test scores for educational and health information texts.


Formalizing Fregean Concepts
Tanja Osswald (Universität Düsseldorf)

Poster session, 12.15-13.20

Theories of concepts explain what concepts are. In order to talk about concepts, each theory of concepts needs a way to describe concepts. Often, this is done in an informal way. Other theories of concept develop formal means of concept description. Some theories, as formal concept analysis (FCA) even put a focus on this part.

My claim is that we can regard theories of concepts and means of concept descriptions independently and thus use the formal developement for theories that do not have their own formal approach to concept description.

In my talk, I exemplify this with regard to Frege's theory of concepts. Frege gives a formal account of what concepts are. But when he gives an example for a concept or discusses the intension of a particular concept, he stays with natural language descriptions.

I discuss two formal theories that include concept description, namely FCA and frame theory. FCA gives a purely set-theoretical analysis of concepts while frame theory was developed as a formalization of Barsalou's approach (Barsalou 1992) to model frames as mental representations. My argument is that although FCA is a mathematical theory that sees concepts as abstract entities, the means to represent concepts with FCA are too restricted to capture concepts understood in a Fregean way. On the other hand, although Frege would not follow Barsalou's theory of concepts, frames can be interpreted in a way to match Frege's own theory of concepts.


Building an NLP pipeline within a digital publishing workflow
Hans Paulussen (K.U.Leuven), Pedro Debevere (UGent), Francisco Bonachela Capdevila (K.U.Leuven), Maribel Montero Perez (K.U.Leuven), Martin Vanbrabant (K.U.Leuven), Wesley De Neve, (UGent) & Stefan De Wannemacker (K.U.Leuven)

Room 2, 13.40-14.00

Outside the laboratory environment, NLP tool developers have always been obliged to use robust techniques in order to clean and streamline the ubiquitous formats of authentic texts. In most cases, the cleaned version simply consisted of the bare text discarded of all typographical information, tokenised in such away that even the reconstruction of a simple sentence resulted in unacceptable layout. In order to integrate the NLP output within the production workflow of digital publications, it is necessary to keep track of the original layout. In this talk, we present an example of an NLP pipeline developed to meet the requirements of real-world applications of digital publications.

The NLP pipeline was developed within the framework of the iRead+ project, a cooperative research project between several industry and academic partners in Flanders. The pipeline aims at enabling automatic enrichment of texts with word-specific and contextual information in order to create an enhanced reading experience on tablets and to support automatic generation of grammatical exercises. The enriched documents contain both linguistic annotations (part-of-speech and lemmata) and semantic annotations based on the recognition and disambiguation of named entity references (NER). The whole enrichment process, provided via a webservices workflow, can be integrated into an XML-based production flow. The input of the NLP enrichment engine consists of two documents: a well-formed XML source file and a control file containing XPath expressions describing the nodes in the source file to be annotated and enriched. As nodes may contain a pre-defined set of mixed data, reconstruction of the original document (with selected enrichments) is enabled.


The effect of word similarity on n-gram language models in Northern and Southern Dutch
Joris Pelemans (KU Leuven), Bruno De Laet (KU Leuven), Kris Demuynck (UGent), Hugo Van Hamme (KU Leuven) & Patrick Wambacq (KU Leuven)

Poster session, 12.15-13.20

In this paper we examine several combinations of classical n-gram language models with more advanced and well known techniques based on word similarity such as cache models, Latent Semantic Analysis, probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation. We compare the efficiency of these combined models to a model that combines n-grams with the recently proposed, state-of-the-art neural network-based hierarchical softmax skip-gram. We discuss the strengths and weaknesses of each of these models, based on their predictive power of the Dutch language.

In addition, we investigate whether and in what way the effect of Southern Dutch training material on these combined models differs when evaluated on Northern and Southern Dutch material. Experiments on Dutch news paper and magazine material show that topics extend well over these languages: the addition of topic models trained on Southern Dutch achieves a substantial improvement compared to an n-gram baseline, when evaluated on Northern Dutch data. On the other hand, n-gram language models trained on Southern Dutch perform worse on Northern Dutch data than they do on Southern Dutch data. This leads us to conclude that Southern and Northern Dutch differ mostly in local, more syntactic phenomena than in global phenomena like word usage and topic.


Building search-friendly biomedical text databases based on topic models
Rivindu Perera & Udayangi Perera (Informatics Institute of Technology, Sri Lanka)

Poster session, 12.15-13.20

Search-friendliness is undoubtedly considered to be one important feature for any text database. When this is applied in biomedical text, dimensions that are applied in general text databases need more attention. Therefore, approaches that can be employed to build search friendly biomedical text databases have subjected to diverse research attempts springing up various new heuristics. But none of these have considered that underlying text patterns in these databases or no effort has been taken to utilize such patterns in these newly developed heuristics. Therefore, we propose novel and soundly developed heuristic for fill the gap between search friendliness and biomedical text databases. Our approach starts with an initial topic modelling schema which is developed based on a selected dataset from the biomedical texts and confirmed that it is representing all major sub domains. Then based on the initially developed topic models, we run the repetitive theme allocation process for the biomedical text resulting a text collection where all major biomedical related topics are identified and labelled and identified abstract topics are stored separately. With this annotation schema label based clustering is also scheduled to cluster the document collection as it can be advantageous in the search phase. During search phase, terms used in the search query are backed by the stored topic model using sematic distance identification algorithm. Next results gained from search are presented to the user with the semantic mapping diagram of topic and search terms.


Predicting end-of-turn in face-to-face interaction based on speakers’ and listeners’ head movements
Bayu Rahayudi, Ronald Poppe and Dirk Heylen (Universiteit Twente)

Room 4, 9.50-10.10

Predicting the end of a speaker’s turn in multiparty face-to-face interaction is challenging. This paper describes an end-of-turn prediction approach based on the head movements of the participants. In multiparty dialog, head movements of the participants can be treated as important features as they are strongly related to turn-taking . We used several head movement features, including tilt, yaw, roll, shift-x, shift-y, and shift-z of the participant. We analyzed the influence of speaker’s and the listeners’ head movement alone, and also the effect of their combination on the prediction accuracy. We used the Twente Debate Corpus that contains 5 groups of three-person face-to-face debates. By composing window size to calculate the features, we also varied the length of windows over how many features were used, and measured the prediction accuracy using a Support Vector Machine (SVM) classifier. We found that predicting the end-of-turn using speaker’s head movements features yielded better results than using head movements of both of the listeners. The performance improved by combining head movement features of all the participants, both speaker and listeners. It indicates that when analyzing human behaviors in conversations, it is useful to take into account the behaviors of the other participants.


Diachronic TICCL or Text-Induced Corpus Clean-up in Nederlab
Martin Reynaert (CLST, Radboud University Nijmegen and TiCC, Tilburg University)

Poster session, 12.15-13.20

NWO project Nederlab hopes to eventually make available online large source corpora representing Dutch from it earliest days to now. The strategy is to linguistically enrich the corpora, making them fit for researchers to explore and analyze within their personalized Nederlab research environment, equipped with tools tailored to the researchers' requirements.

In so far that most of the large source corpora today digitally exist only in a degraded state compared to their far less accessible paper versions in archives and libraries, Optical Character Recognition or OCR post-correction currently offers the best chances for automatically restoring the collections' overall quality.

We present new TICCL results on the demonstrator 1789 book by Martinet which underlies the CLARIN-NL web service TICCLops. TICCL has now been equipped with a new feature-rich correction candidate ranking mechanism. Also it now benefits from the availability of the Impact historical lexicon and name list developed at Nederlab partner INL. We also experiment with lexicons induced from well-edited digitized books of the Martinet era from the Nederlab version of the DBNL or Digital Library of Dutch Literature.


Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Giuseppe Rizzo (Università degli studi di Torino, Italy), Marieke Van Erp (VU University Amsterdam) and Raphael Troncy (EURECOM, France)

Room 3, 16.50-17.00

Named entity recognition and disambiguation are important for information extraction and populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources, such as DBpedia and GeoNames, has been the domain of the Semantic Web community. As these tasks are treated in different communities, it is difficult to assess the performance of these tasks combined.

We conduct a thorough evaluation of the NERD-ML approach[1], an approach that combines named entity recognition (NER) from the NLP community and named entity linking (NEL) from the SW community. We present experiments and results on the CoNLL 2003 English 2003[2] dataset for NER and the AIDA CoNLL-YAGO2[3] dataset for NEL in the newswire domain and on the MSM’13[4] corpus for NER and the Dercynski et al.[5] dataset for NEL in the twitter domain.

Our results indicate that combining approaches from the NLP and SW communities improves the performance for NER. As the NEL task is more recent, there is as yet no common agreement on the annotation level to adopt, which makes it difficult to assess the performance of different systems.

  1. Marieke van Erp, Giuseppe Rizzo and Raphaël Troncy (2013) Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning. #MSM2013 Concept Extraction Challenge.
  2. www.cnts.ua.ac.be
  3. www.mpi-inf.mpg.de
  4. oak.dcs.shef.ac.uk
  5. Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Niraj Aswani, Marieke van Erp, Raphaël Troncy, and Kalina Bontcheva. Named Entity Recognition and Linking for Microblogs. Currently under review, 2013

Workbench for Data analysis in the Hebrew Bible
Dirk Roorda (DANS), Martijn Naaijer (VU University Amsterdam) and Gino Kalkman (VU University Amsterdam)

Room 1, 10.10-10.30

The Hebrew Bible is an object of linguistic, literary and historical research [1,2]. The text resides in a database with careful linguistic markup [3,4,5,9]. Lately the database has been exported to Linguistic Annotation Framework [6,7]. But what tooling is there for LAF resources?

We discuss a few candidates: eXist [8], POIO [10], but since they do not perform well on the resource in question (2GB of annotations), we present a new workbench [11] that enables programmers to write simple Python scripts to extract qualitative and quantitative data from big LAF resources. The workbench has been designed for performance and the result is that users can perform analysis on this material with ordinary laptop capabilities.

As an example we will show how to extract data from the Hebrew Bible for cluster analysis. We gave a concrete example of how this bears on the relationship between the book of Esther and other books in the Hebrew bible, based on the clustering of common nouns.

The combination of a high quality, comprehensive data source and an open source workbench to easily get at the data inside opens fresh opportunities for theologians to undertake data driven research to literary style and linguistic variation in the Hebrew Bible.

References:

  1. Peursen, W.Th. van, Thoutenhoofd, E.D. and Weel, A.H. van der. 2010. Text Comparison and Digital Creativity. The Production of Presence and Meaning in Digital Text Scholarship. Scholarly Communication 1; Leiden: Brill.
  2. Peursen, W.Th. van, Keulen, P.S.F. van. 2006. Corpus Linguistics and Textual History. Leiden: Brill. ISBN 9789023241942.
  3. BHS. 1968-1997 Biblia Hebraica Stuttgartensia. Deutsche Bibelgesellschaft. bibelwissenschaft.de/biblia-hebraica
  4. Talstra, E. and Sikkel, C.J., Genese und Kategorienentwicklung der WIVU-Datenbank, in C. Hardmeier et al. (eds.). 2000. Ad Fontes! Quellen erfassen – lesen – deuten. Wat ist Computerphilologie? Ansatzpunkte und Methodologie – Instrument und Praxis (Applicatio 15; Amsterdam 2000) 33–68.
  5. Talstra, E., Sikkel, C., Glanz, O., Oosting, R., Dyk, J.W. (2012). Text Database of the Hebrew Bible. Dataset available from Data Archiving and Networked services after permission of the depositor. www.persistent-identifier.nl.
  6. LAF. 2012. Linguistic Annotation Framework. ISO standard 24612:2012. Edition 1, 2012-06-15. Official link: iso.org/?csnumber=37326. Unofficial link to the standard text: cs.vassar.edu/laf.pdf.
  7. Hebrew Text Database in LAF. To be deposited (2013) in the DANS-EASY archive easy.dans.knaw.nl. Persistent identifier will follow.
  8. eXist: XML database. www.exist-db.org.
  9. Talstra, E. and Sikkel, C.J., Genese und Kategorienentwicklung der WIVU-Datenbank, in C. Hardmeier et al. (eds.). 2000. Ad Fontes! Quellen erfassen – lesen – deuten. Wat ist Computerphilologie? Ansatzpunkte und Methodologie – Instrument und Praxis (Applicatio 15; Amsterdam 2000) 33–68.
  10. Bouda, Peter, Ferreira, Vera, and Lopes, António. Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing. Bouda-Ferreira-Lopes.pdf In Proceedings of The Second Workshop on Annotation of Corpora for Research in the Humanities, Lisbon, 2012. ISBN 978-989-689-273-9. ACRH-2_FINAL.pdf
  11. Dirk Roorda. LAF-Fabric. Workbench for analysing LAF resources. demo.datanetworkservice.nl, github.com/dirkroorda.
  12. Naaijer, Martijn. The Common Nouns in the Book of Esther. A New Quantitative Approach to the Linguistic Relationships of Biblical Books. Master Thesis, Radboud University Nijmegen. 2012

Standardization of Phrases in Technical Documentation
Rouven Röhrig (Germany Technische Universität Darmstadt, Germany)

Room 2, 16.10-16.30

Well written documentation is crucial to deal with the growing complexity of technical components. The objective of technical documentation is to create precise and comprehensible documents of the specification and usage of a product. However, content creators commonly do not have the required insights while technical experts lack expertise in writing techniques. Moreover, people from various countries have different spelling styles and it is complicated to maintain and follow style guides.

In Rouven Röhrig's master’s thesis, we focus on the standardization of technical documentation. In particular, we propose an approach to identify standard phrases in order to implement a proposal system which supports editors. The term “standard phrase” refers to a sentence that is recurring in technical documentation.

At first, we identify standard phrases using clustering techniques along with similarity measures. Then, we propose a metric to select an adequate representative for every standard phrase. Subsequently, we introduce a system which identifies sentences that deviate from the standard and which creates a list of corrections. In addition, we generalize phrases to exploit them for the standardization across multiple domains. Therefor, we use a glossary to identify and replace domain-specific proper nouns. For this purpose, we implement and compare different methods for the automatic generation of a technical glossary. Furthermore, we implement a simple stand-alone authoring tool and an Eclipse plug-in to demonstrate the standardization systems. Finally, we show that the system performs well for software products on the basis of Software AG's error messages.


An unsupervised model for recognizing user activities
Maya Sappelli (TNO/Radboud University Nijmegen), Suzan Verberne (Institute for Computing and Information Sciences, Radboud University Nijmegen) and Wessel Kraaij (Radboud University Nijmegen)

Room 2, 9.50-10.10

The project SWELL aims to improve wellbeing at work by means of smart ICT tools. One aspect is to give users insight in how they spend their time. For that purpose, we need a method for recognizing what the user is doing, e.g. which project they are working on. Additionally, if we can recognize the topic a user works on, this information can be used to find relevant documents or to prevent distractions that are not relevant to the current activities.

We propose to model these user activities using a neural network approach based on the Interactive Activation Model by Rumelhart and McClelland (1981). Low level computer events such as key strokes and clicks are used as input to the network. The network has an 'information object' level, consisting of the documents on the user's pc and an intermediate 'context' level containing information about location, entities, topics, time elements and anything else that may be of interest. Both the input level and the 'information object' level are connected to this context level. A special level, containing so called 'nodes of interest', is used to determine which context is active. These nodes correspond to the projects a user is working on. The raw activation on the 'nodes of interest' is used to visualize the user's activities over time. The construction of the network and the selection of 'nodes of interest' can be done unsupervised, for example by using the organization of documents in folders as basis.


Normalizing Dutch user-generated content
Sarah Schulz, Bart Desmet, Orphée De Clercq & Véronique Hoste (Ghent University)

Poster session, 12.15-13.20

In this presentation, we discuss our ongoing work in the field of normalization of user-generated content (UGC). We currently focus on Dutch data pertaining to three UGC genres (text messages, message board posts and tweets), and plan to experiment with normalizing English text as well.

The core of our approach is phrase-based machine translation. We perform machine translation from noisy data to standard data on different levels: we experiment with token-based, character-unigram based and character-bigram based translation. This provided us with a robust baseline system that achieved 63% word error rate reduction on the SMS test corpus. We further include other normalization methods like spell-checking and grapheme-to-phoneme-to-grapheme conversion in order to cover a wide range of problems that can be observed in UGC.

We combine those approaches into a modular system. The performance of each module is analyzed both quantitatively and qualitatively, in an attempt to cover all possible errors. We are aiming at an improvement by weighting the modules, and allocating different normalization problems to the appropriate modules.


The linguist is always right! Oops ... Our experiences with User Interface Testing
Ineke Schuurman (K.U.leuven), Liesbeth Augustinus (CCL, K.U. Leuven), Vincent Vandeghinste (KU Leuven) & Frank Van Eynde (University of Leuven)

Poster session, 12.15-13.20

GrETEL is a query engine in which linguists can use a natural language example as a starting point for searching a treebank with limited knowledge about tree representations and formal query languages. By allowing traditional linguists to search for constructions which are similar to the example they provide, we hope to bridge the gap between traditional and computational linguistics. We want GrETEL to be easy to use, and it is used in courses in Flanders and the Netherlands, apparently to everyone's satisfaction.

In the follow-up CLARIN project GrETEL 2.0 we had the interface professionally tested. Surprisingly, it turned out that new users, old and young, faced several difficulties when trying to solve some of the linguistic scenarios we designed for these tests. Even some basic actions turned out to be problematic: the users did not have the knowledge and the know-how we (unintentionally) expected them to have. User Interface testing was a very valuable experience: when designing a tool you really need to put yourself in the shoes of the user. A disillusioned user ("this tool is far to complicated") may never give it a second try, which is not what you would like.

In our presentation we will discuss some of the unforseen issues we were confronted with and show how we changed our approach.


Automatic Induction of Word Classes in Swedish Sign Language
Johan Sjons (Stockholm University)

Poster session, 12.15-13.20

Research on word classes in Swedish Sign Language (SSL) is not very extensive, and, as pointed out by Ahlgren and Bergman (2006, p. 18), no corpus-based study has been carried out. The present study is a first attempt at addressing this problem using unsupervised learning.

Sign languages typically lack a written form, due to problems of representing signs orthographically (Johnston and Schembri, 1999), which often constrains quantitative research to the use of written glosses in a form similar to that used for a spoken language (Hopkins, 2008). One source to such glosses is The Swedish Sign Language Corpus (SSLC), a corpus containing videos in which deaf signers (n=42) sign with each other in pairs, and where all signs have been transcribed using glosses (token frequency=29686).*

The aim of this study was to investigate whether automatic induction of word classes in SSL is a feasible approach. The glosses in the SSLC were mapped to utterances based on Swedish translations in the corpus, and the glosses in these utterances were clustered with the Brown algorithm (Brown et al., 1992). The clusters were then compared to a gold standard tag dictionary created by a native speaker of SSL. The results show that the Brown algorithm performs significantly better in inducing word classes than a random baseline. This indicates that utilizing unsupervised learning is a feasible approach for doing research on word classes in SSL. In future research, it should be investigated whether the method is applicable to other sign languages, such as Dutch sign language. Additionally, other algorithms should be tested.

References

  • Ahlgren, I. and Bergman, B. (2006). ”Det svenska teckenspråket." Teckenspråk och teckenspråkiga. Kunskaps och forskningsöversikt : Betänkande av utredningen översyn av teckenspråkets ställning, 11-70.
    • Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). “Class-based n-gram models of natural language." Computational linguistics, 18(4), 467-479.
    • Hopkins, J. (2008). “Choosing how to write sign language: a sociolinguistic perspective." International Journal of the Sociology of Language, 2008(192), 75-89.
    • Johnston, T. and Schembri, A. (1999). “On defining lexeme in a signed language." Sign language & linguistics, 2(2), 115-185.

    * The token frequency refers to when this study was carried out, i.e. in early spring 2013. The SSLC has been extended since.


Gender Recognition on Dutch Tweets
Nander Speerstra and Hans Van Halteren (Radboud University Nijmegen)

Room 1, 16.30-16.50

In the Netherlands we have a rather unique resource in the form of the TwiNL dataset: a daily updated collection that probably contains at least 30% of the Dutch public tweet production since 2011. However, as any collection that is harvested automatically, its usability is reduced by a lack of reliable metadata. In this case, the Twitter profiles of the authors are available, but these consist of freeform text rather than fixed information fields. And, obviously, it is unknown to which degree the information that is present is true. In this paper, we attempt to derive at least the gender of the authors automatically, using authorship profiling techniques.

When accessing the data through its official web interface (http://www.twiqs.nl), it is possible to get statistics on word use by male and female users. However, these statistics can only be seen as a rough approximation, as the authors are recognized as being male or female by keyword spotting in their profiles. Still, to create an experimental data set for our investigation, we started with the TwiQS gender assignment for all authors who had an average tweet production in 2011 and 2012 between 1460 and 7300 tweets (i.e. 2 to 10 per day). We then manually checked the suggested gender and selected 248 males and 248 females for whom we were practically certain of a correct gender label.

With the tweets of the authors of whose gender we were certain, we experimented with several authorship profiling techniques, in order to determine how well they could distinguish between male and female authors of tweets. For the best working system, we also compared its judgement to that of TwiQS, at least for those authors who are assigned a gender bij TwiQS.


Evaluating (Long Range) Reordering with Permutation-Forests
Milos Stanojevic and Khalil Sima'An (ILLC, UvA)

Room 3, 13.20-13.40

Automatically evaluating the quality of word order of MT systems is challenging yet crucial for MT evaluation. Existing approaches employ string-based metrics, which are computed over the permutations of word positions in system output relative to a reference translation. We introduce a new metric computed over Permutation Forests (PEFs), tree-based representations that decompose permutations recursively. Relative to string-based metrics, PEFs offer advantages for evaluating long range reordering. We compare the present PEFs metric against five known reordering metrics on WMT13 data for ten language pairs. The PEFs metric shows better correlation with human ranking than the other metrics almost on all language pairs. None of the other metrics exhibits as stable behavior across language pairs.


Soothsayer: word completion trained on idiolects and sociolects
Wessel Stoop (Centre for Language & Speech Technology, Radboud University Nijmegen) and Antal van den Bosch (Centre for Language Studies, Radboud University Nijmegen)

Room 1, 16.10-16.30

We present the word prediction system Soothsayer. Soothsayer predicts what a user is going to write as he/she is keying it in. In its first version, Soothsayer uses idiolects, the language of one individual person, as its source of knowledge. That is, we train on texts the user of the system has written earlier. For data collection and experimentation, we use Twitter; our idiolect models are based on individual Twitter feeds.

Furthermore, we investigate sociolect models. These models are not only based on the tweets of a particular person, but als on the tweets of the people he/she often communicates with. The idea behind this is that people who often communicate start to talk alike; in other words, the language of the friends of person x can be helpful in trying to predict what person x is going to say.

The results show that using an idiolect model leads considerably to considerably more keystrokes saved than a general language model, despite using substantially less material. Using the sociolect model lead to even better results. For a number of users, more than 55% of the keystrokes could have been saved if they had used Soothsayer.


Dependency-tuned word clusters for Dutch
Simon Suster and Gertjan Van Noord (University of Groningen)

Room 4, 14.00-14.20

We present a syntactic extension of the Brown et al. 1992 algorithm, one of the most widely used word clustering algorithms in Natural Language Processing. The standard Brown clustering builds its model representation in the context of a bigram language model, where only adjacent words can be taken into account. We argue that more informative and more precise word clusters can be obtained by considering dependency contexts. We induce the clusters for Dutch on the SoNaR corpus. We evaluate our method in a Cornetto-based similarity task. In experiments adjusting parameters such as frequency threshold, number of clusters and amount of learning data, we show that dependency clustering yields better word representations than the standard Brown clustering. Our method also requires less data for a desired clustering quality.


An Iterative approach to Language model adaptation for Hand-written Text Recognition.
Jafar Tanha, Jesse de Does and Katrien Depuydt (INL)

Room 1, 10.30-10.50

The tranScriptorium (http://www.transcriptorium.eu) project aims to develop innovative, efficient and cost-effective solutions for transcription of historical handwritten document images, using modern Hand-written Text Recognition (HTR) technology.

Language modeling is an essential component of the HTR system. In tranScriptorium, we investigate how to obtain good language models from a combination of language resources, consisting of larger out-of- domain data sets and smaller in-domain corpora. In this study, we develop language models for a small subcollection of the available transcripts from the Transcribe Bentham project, using the complete set of transcripts and the public part of the Eighteenth Century Collectons Online corpus as background corpora. We approach the resulting domain adaptation problem from the point of view of intelligent sample selection. We propose an iterative algorithm to select, with high confidence, a relevant subset from the ECCO corpus, based on the agreement between two language models Mi (i=1,2) : the first one constructed from a small set of in-domain Bentham data for a specific topic, and the second one constructed from the remaining part of the Bentham collection. For each i, we define a metric di based on perplexity and the number of out of vocabulary words w.r.t. Mi . Documents are selected and added to the LM training when there is sufficient agreement between the values of the two metrics. Our empirical results on HTR system show that the proposed method samples relevant data and improves the performance of the recognition system.


BasiLex: presentation of a new corpus of children’s written input, and some first analyses regarding text cohesion throughout elementary school
Agnes Tellings (BSI), Micha Hulsbosch (BSI) and Anne Vermeer (DCiS)

Room 1, 13.40-14.00

We will discuss BasiLex, a recently finalized corpus comprising Dutch written documents aimed at children aged 6 to 12: school and children’s novels and comic books, assessment tests, news texts, websites and subtitles. The corpus documents are analyzed by Frog, a Dutch morpho-syntactic analyzer and dependency parser. The corpus documents are in FoLiA, an XML-based annotation format. The corpus also comprises information for word properties such as family size, neighbourhood size, bigram frequency, et cetera. From the corpus, and in comparison to the SoNar corpus, a lexicon is extracted consisting of 20.000 words, forming a target list for vocabulary knowledge of children at the end of elementary school. To a part of this lexicon, we applied word sense disambiguation, partly automatic and partly by hand. For both we used software that resulted from the DutchSemCor project. We will also discuss the outcomes of a first study we did with BasiLex on relational coherence, often expressed by connectives like but, then, until, before, because, and while. We examined the expected increase of cohesion in texts throughout elementary school by measuring the absolute and the relative number of connectives present in the different texts. We compared these with the actual use by children in these grades. We also compared subcorpora (e.g., novels versus school text materials) on these connectives.


Towards a Semantic Model for Textual Entailment Annotation
Assaf Toledo (Utrecht University)

Room 5, 9.30-9.50

The RTE challenges (Dagan et al., 2006) aim to automatically determine whether an entailment relation obtains between naturally occurring premises and composed hypotheses. The RTE corpus, which is currently the only available resource of textual entailments, marks entailment candidates as valid/invalid. For example:

T: The head of the Italian opposition, Romano Prodi, was the last president of the European Commission.
H: Romano Prodi is a former president of the European Commission.
Entailment: valid

This categorization contains no indication of the linguistic processes that underlie entailment. In the lack of a gold standard of inferential phenomena, entailment systems can be compared based on their performance, but the inferential processes they employ are not directly accessible for analysis.

We introduce a new formal semantic model for annotating textual entailments, that describes restrictive, intersective and appositive modification. This model contains a formally defined interpreted lexicon, which specifies the inventory of symbols and the supported semantic operators, and an informally defined annotation scheme that instructs annotators in which way to bind words and constructions from a given pair of premise and hypothesis to the interpreted lexicon.

We explore the applicability of the proposed model to the RTE 1-4 corpora and describe a first-stage annotation scheme based on which manual annotation work was carried out. The constructions we annotated were found to occur in 80.65% of the entailments in RTE 1-4 and were annotated with cross-annotator agreement of 68% on average. The annotated RTE corpora are publicly available to the research community.

* This is a joint work with several colleagues in a project.


English to Dutch MT experiments using a hybrid MT build and runtime infrastructure
Joachim Van den Bogaert (CrossLang/CCL Leuven), Kim Scholte (CrossLang), Arda Tezcan (CrossLang), Koen Van Winckel (CrossLang), Sara Szoc (CrossLang) and Joeri Van de Walle (CrossLang)

Room 3, 13.40-14.00

In this paper, we present the results of a continuous improvement effort for English to Dutch Machine Translation. Starting from off-the-shelf Moses SMT, we discuss data remodeling strategies (domain adaptation, compound splitting, improved tokenization, syntax-based pre-ordering) and linguistic strategies (NP and VP chunking for applying reordering constraints) for MT development. A brief overview of the architecture and methodology illustrates how (non-computational) linguists can be involved in the development process and how the system can be adapted to cover new domains and genres.


A Neural Network Approach to Selectional Preference Acquisition
Tim Van de Cruys (CNRS & IRIT)

Room 2, 10.30-10.50

In this talk, we investigates the use of neural networks for the acquisition of selectional preferences. Inspired by recent advances of neural network models for NLP applications, we propose a neural network model that learns to discriminate between felicitous and infelicitous arguments for a particular predicate. The model is entirely unsupervised - preferences are learned from non-annotated corpus data. We propose two neural network architectures: one that handles standard two-way selectional preferences and one that is able to deal with multi-way selectional preferences. The model's performance is evaluated on a pseudo-disambiguation task, on which it is shown to achieve state of the art performance.


Machine Translation between Dutch and Pictographs for People with Communicative Disabilities
Vincent Vandeghinste & Ineke Schuurman (K.U.leuven)

Poster session, 12.15-13.20

People with communicative disabilities are facing a difficult time online. Lots of communication is text-based, and text is often a major stumbling block. The WAI-NOT online environment is similar to a social network, but in a closed (no external links) and safe environment, and provides communication assistance, such as text-to-speech, and text-to-pictograph conversion. As a baseline, text was converted into pictographs when words in the text matched the filenames of the pictographs. While this is a logical first step, the coverage of such a system is too low. Last year we presented the improvements made by adding part-of-speech tagging and lemmatization.

This year we present the next improvements. To improve the coverage, we use the Cornetto synsets as a kind of interlingua. We have (manually) linked each pictograph to one or more Cornetto synsets (some pictographs represent complex concepts which require more than one synset). The incoming text is mapped onto these synsets (either directly or through the hyperonym or xpos_synonym relations), and through an A* search algorithm we find the optimal path (least number of pictographs, with the least distance between the synsets of the pictographs and the synsets of the text).

We also present an evaluation of our system in which we show very large improvements over the baseline approach, mainly in recall.


Fine-grained Analysis of Explicit and Implicit Sentiment in Financial News
Marjan Van de Kauter (LT3, Language and Translation Technology Team, Ghent University), Véronique Hoste LT3, Language and Translation Technology Team, Ghent University) & Diane Breesch (Faculty of Economic, Political and Social Sciences and Solvay Business School, Vrije Universiteit Brussel)

Poster session, 12.15-13.20

The goal of the SentiFM project (Sentiment mining for the Financial Markets) is to develop a system for the automatic analysis of explicit as well as implicit sentiment in English and Dutch financial news articles. A significant amount of research in the financial domain has been dedicated to the impact of news on the stock markets. A system that automatically labels news items as positive or negative for the market can be of great use for researchers in this field.

In this talk, we present a fine-grained approach to sentiment analysis which seeks to capture both implicit and explicit sentiment related to a given stock of interest. To test the suitability of our approach for the financial domain, we evaluated its performance in comparison to that of a baseline coarse-grained method which makes use of a subjectivity lexicon. For this purpose, we used an evaluation corpus of Dutch news articles about four companies listed in the BEL20 index. In this corpus, each sentence about the companies of interest was labeled as positive, negative or neutral by four annotators. We discuss the performance of both the coarse- and fine-grained approach to sentiment analysis on this labeled data set.


Weakly supervised semantic frame induction: effects of using background knowledge
Janneke van de Loo, (CLiPS, University of Antwerp), Jort F. Gemmeke(ESAT, KU Leuven), Guy De Pauw (CLiPS, University of Antwerp), Walter Daelemans(CLiPS, University of Antwerp) and Hugo Van Hamme (ESAT, KU Leuven)

Room 5, 9.50-10.10

In previously reported research, we developed a framework for weakly supervised semantic frame induction, based on hierarchical hidden Markov models. We now present results of a detailed analysis of the inner workings of the framework, and the effect of incorporating background knowledge in the models. The semantic frame induction system was designed to be used in an assistive vocal interface for people with physical impairments, which is specifically trained to adapt itself to each individual user (project ‘ALADIN’). The system’s task is to induce frame-based semantic representations of spoken or written utterances, based on a training set of utterances and associated semantic frames. The weak supervision lies in the fact that there is only supervision at the utterance level; no relations between parts of the utterances and parts of the semantic frames are specified in advance. We show how the incorporation of background knowledge in the system can facilitate and accelerate the process of learning the relations between structures in the commands and in the semantic frames. The experimental results presented here were produced with textual command input.


Modelling Syntax and Morphology for Parsing
Paul Van Eecke (KU Leuven)

Room 2, 13.20-13.40

Natlanco bvba, a Ghent based speech and language technology company, has developed a very efficient parser for phrase structure grammars [1]. Along with the parser, the company has developed LingBench IDE, an environment in which the language models can be designed and tested.

The language models consist of a morphological component and a syntactic component. Both components are designed in the same user-friendly graphical way and can be enriched with feature structures and probabilistic information.

We present how we designed and implemented language models for French and German syntax and morphology. The models were designed in the same way for both languages, but some crucial differences can be observed. Especially the variable word order and the morphological case marking of noun phrases in German posed a challenge.

The final models allow a fast parsing process, avoid spurious ambiguity and are relatively easy to expand. Possible applications of the models include automatic text summarization software and different kinds of dialogue systems.

References:

  1. De Brabander, F. A language modelling system and a fast parsing method. Patent. EP 1209560 A1. May 2002.

Linear Average Time Extraction of Phrase-structure Fragments
Andreas van Cranenburgh (UvA)

Room 4, 13.20-13.40

We present a method and implementation for extracting tree fragments from treebanks. Using a tree kernel method the largest overlapping fragments are extracted from each pair of trees (cf. figure 1 & 2), resulting in a set of fragments, each of which occurs at least twice in the treebank. The algorithm presented is able to find these fragments 70 times faster (cf. table 1) than the previously available implementation (Sangati et al., 2010). This substantial speedup is due to the incorporation of the Fast Tree Kernel (Moschitti, 2006), and opens up the possibility of handling much larger corpora. Furthermore, the tool supports trees with discontinuous constituents, used in inter alia the Ger- man and Dutch treebanks. The resulting fragments can be used for statistical parsing in a tree-substitution grammar, as in Data-Oriented Parsing (Sangati and Zuidema, 2011; van Cranenburgh and Bod, 2013). Another application is in classification problems such as authorship attribution (van Cranenburgh, 2012) and native language detection (Swanson and Charniak, 2012).


A treebank-grounded investigation of the agreement between predicate nominals and their target
Frank Van Eynde (University of Leuven)

Room 5, 14.00-14.20

Predicate nominals canonically show number agreement with their target, as illustrated by the contrasts in (1-2).

  1. My brother is a plumber/*plumbers.
  2. My brothers are plumbers/*a plumber.

Mismatches, however, are not excluded, as shown in (3-4).

  1. He is friends with Elio di Rupo.
  2. These hooligans are a danger on the road.

The challenge is to model the agreement in a way that excludes the starred combinations in (1-2) but that allows the combinations in (3-4).

A first step is to find out under which conditions the mismatches are allowed. For this purpose I will employ the CGN treebank. Checking for the NUMBER values of the predicate nominals and their target, it turns out that appr. 90 % show agreement, as in (1-2). The mismatches will be classified in a number of types and generalizations will be formulated that subsume the relevant types.

The second step involves the integration of the results in a formal model of agreement. For that purpose I will employ the distinction between morpho-syntactic agreement and index agreement that is standardly made in Head-driven Phrase Structure Grammar.

The aim of the talk is twofold: on the descriptive level, it aims to contribute to an empirically motivated treatment of agreement in copular constructions; on the methodological level, it aims to show how treebanks can be used for grounding empirical investigations of linguistic phenomena.

References:

  • Kathol (1999), Agreeemnt and the syntax-morphology interface in HPSG. Levine & Greene (eds.) Studies in contemporary phrase structure grammar. Cambridge UP.
  • Pollard & Sag (1994), Head-driven Phrase Structure Grammar. CSLI Publications.
  • Van Eynde (2012), On the agreement between predicative complements and their target. S. Mueller (ed.), Proceedings of the 19th International Conference on HPSG.
  • Wechsler & Zlatic (2003), The many faces of agreement. CSLI Publications.

Harnessing Projection: A Formal Implementation of Projective Discourse Representation Theory
Noortje Venhuizen and Harm Brouwer (University of Groningen)

Room 5, 10.10-10.30

Presuppositions are a fundamental aspect of linguistic meaning since they relate the content of an utterance to the unfolding discourse. A definite description like "the man", for example, introduces a male discourse entity that is salient in the current discourse context. Yet, in many formal semantic accounts, these phenomena have been treated as deviations from standard meaning construction, e.g., by resolving them only in a second step of processing (van der Sandt, 1992; Geurts, 1999). This is unsatisfactory for a number of reasons, mainly because it precludes an incremental compositional treatment of projection phenomena such as presuppositions. Therefore, we have recently proposed Projective DRT (PDRT), which builds upon the widely used Discourse Representation Theory (Kamp, 1981; Kamp & Reyle, 1993). PDRT deals with projected content during discourse construction, without losing any information about where information is introduced or interpreted (Venhuizen et al., 2013).

In order to evaluate its properties and applications, we implemented PDRT as a Haskell library, called "PDRT-sandbox". This library includes machinery for representing PDRSs, as well as traditional DRSs, a translation from PDRT to DRT and FOL, and the different types of merge operations for (P)DRSs. Unresolved structures with lambda-abstractions (cf. Muskens, 1996) are implemented as pure Haskell functions, thereby exploiting Haskell's lambda-theoretic foundations. Several (P)DRS representations can be chosen as target output, including the well-known box representations, a set-theoretic notation and flat (i.e., non-recursive) table structures. PDRT-sandbox provides a comprehensive toolkit for use in NLP applications, and paves way for a more thorough understanding of the behaviour of projection phenomena in language.


CLiPS Stylometry Investigation (CSI) Corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text
Ben Verhoeven & Walter Daelemans (CLiPS, University of Antwerp)

Poster session, 12.15-13.20

Research in computational stylometry has always been constrained by the limited availability of training data since collecting textual data with the appropriate meta-data requires a large effort. We present the CLiPS Stylometry Investigation (CSI) corpus, a new Dutch corpus containing reviews and essays written by university students. It is designed to serve multiple purposes: detection of age, gender, authorship, personality, sentiment, deception and genre. Another major advantage is its planned yearly expansion with each year’s new students. The corpus currently contains about 305,000 tokens spread over 749 documents. The average review length is 128 tokens; the average essay length is 1126 tokens. The corpus will be made available on the CLiPS website (www.clips.ua.ac.be) and can freely be used for academic research purposes.

An initial deception detection experiment was performed on this data. Deception detection is the task of automatically classifying a text as being either truthful or deceptive, in our case by examining the writing style of the author. This task has never been investigated for Dutch before. We performed a supervised machine learning experiment using the SVM algorithm in a 10-fold cross-validation setup. The only features were the token unigrams present in the training data. Using this simple method, we reached a state-of-the-art F-score of 72.2%.


Extracting cause-effect relations in Natural Language Text
Giuseppe Vettigli, Antonio Sorgente and Francesco Mele (Institute of Cybernetics "Eduardo Caianiello" of the National Research Council, Italy)

Room 3, 9.50-10.10

The extraction of causal relations from English sentences is an important step for the improvement of many Natural Language Processing applications such as question answering, document summarization and, in particular, it enables the possibility to reason about the detected events.

The automatic extraction of causal relations is also a very difficult task because the English presents some hard problems for the detection of causal relation. Indeed, there are few explicit lexico-syntactic patterns that are in exact correspondence with a causal relation while there is a huge number of cases that can evoke a causal relation not in a uniquely way.

Most of the existing approaches for discovering causal relations are centered on the extraction of pairs of words without discriminating from causes and effects and, mainly, are focused on particular application domains.

In this talk we will present a brief overview of the existing works on this topic and a novel approach which combines rule based and Machine Learning methodologies in order to identify in a sentence a set of word pairs that are in cause-effect relation. The rules are based on the relations in the dependency tree of the sentence and are supported by a set of lexico-syntactic patterns to detect the sentences that contain causal relations. The result of the rules is filtered using a statistical classifier trained with lexical, semantic and dependency features. The performances of our method on an independent domain testset will be discussed and compared with the ones of other existing methods.


Fryske Staveringshifker: spelling correction for the Frisian language
Dennis de Vries (GridLine BV), Anne Dykstra (Fryske Akademy) & Hindrik Sijens (Fryske Akademy)

Poster session, 12.15-13.20

The Fryske Akademy and GridLine developed a spell checker for the Frisian language. The spell checker is part of the Frisian Language Web (FLW), a website that provides Frisian language tools. Apart from the spell checker, the FLW also contains a machine translator and a dictionary portal.

The spell checker is dictionary based and uses the standard Frisian word list for determining spelling corrections. To increase the quality of the spelling suggestions, we incorporated knowledge about common typos and Frisian phonetics. The phonetic rules provide a preference for words that sound like the misspelled word.

The word list also contains non-standard words, linked to their standard Frisian spelling variants. This allows the spell checker to give directly mapped standard spelling suggestions to stimulate users to use standard spelling.

The spell checker has a webservice based architecture, which means that it can be easily connected to other interfaces and integrated with other applications. At the moment the spell checker can be used through a web interface in the FLW and with a plugin in Microsoft Word. During the poster presentation, there will be a live demonstration of the Frisian Language Web.


How to aid the dialectal dictionary making progress by means of Linked Open Data
Eveline Wandl-Vogt (ICLTT, Austria) and Thierry Declerck (DFKI, Germany)

Room 2, 14.20-14.40

We present work on the transformation of a traditional, 100 years old dialectal dictionary project (Dictionary of Bavarian dialects in Austria -- WBÖ) into a machine readable and processable dictionary in order to be compiled and published in the linked open data (LOD) cloud.

One motivation of this work is the possibility to automatically cross-link dictionary data. We take advantage here of a property of dialectal dictionaries concerning the expression of meanings of entries: Although conceived as monolingual reference work, dialectal dictionaries share with bilingual dictionaries the fact that they express the meanings of their entries in a different language. In our case, the meaning of Bavarian words is expressed using standard German forms. Those forms can then be used for linking to other language resources available in the LOD. For this purpose we developed a SKOS model to encode the whole dictionary structure. Additionally we are using the lemon model, which is compatible with SKOS, for encoding linguistic and semantic information that can be associated to dictionary entries.

In the paper, we will present in details the SKOS concept scheme as well as the representation of the linguistically analysed meanings of the WBÖ entries, and how we link those to the DBpedia instantiation of Wiktionary, now available in the LOD.

We will also discuss changes in lexicography workflow that are connected to keywords like collaboration, interdisciplinarity, standardisation, data reusability and open data.

Links:


Towards a Lexicologically Informed Parameter Evaluation of Distributional Modelling in Lexical Semantics
Thomas Wielfaert, Kris Heylen, Jocelyne Daems, Dirk Speelman and Dirk Geeraerts (QLVL, University of Leuven)

Room 1, 14.00-14.20

Distributional models of semantics have become the mainstay of large-scale modelling of word meaning statistical NLP (see Turney and Pantel 2010 for an overview). In a Word Sense Disambiguation task, identifying semantic structure is usually seen as a clustering problem where occurrences of a polysemous word have to be assigned to the ‘correct’ sense. As linguists however, we are not interested solely in performance evaluation against some gold standard; rather, we want to investigate the precise relation between a word's distributional behaviour and its meaning. Given that distributional models are extremely parameter-rich, we want to assess how well and in which way a specific model can capture a lexicological description of semantic structure.

In this presentation, we discuss three tools we are developing for a lexicological assessment of distributional models. Firstly, we are creating our own lexicologically informed 'gold standard' of disambiguated noun occurrences, based on the ANW (Algemeen Nederlands Woordenboek) and a random sample from two large-scale Belgian (1.3G) and Netherlandic (500M) Dutch newspaper corpora. Secondly, we are developing a visualisation tool to analyse the impact of parameter settings on the semantic structure captured by a distributional model. Thirdly, we have adapted the a clustering quality measure (McClain & Rao 1975) to assess how well a manual disambiguation is captured by a distributional model independently from a specific clustering algorithm. Similar to Lapesa and Evert's (2013) parameter sweep for a type-level model on semantic priming data, we are striving towards a large-scale parameter evaluation for token-level models on sense-annotated occurrences.

  • Lapesa, Gabriella and Stefan Evert. September 2013. Thematic Roles and Semantic Space. Insights from Distributional Semantic Models. Oral presentation during Quantitative Investigations in Theoretical Linguistics (QITL-5), Leuven.
  • McClain, John O. and Vithala R. Rao. 1975. Clustisz: A program to test for the quality of clustering of a set of objects. In: Journal of Marketing Research, Vol 12 (4): 456–460.
  • Turney, Peter D. and Patrick Pantel. 2010. Looking at word meaning. From Frequency to Meaning: Vector Space Models of Semantics. In: Journal of Artificial Intelligence, Vol 37: 141–188.

Conversions between D and MCFG_wn: Logical characterizations of the Mildly Context-Sensitive Languages
Gijs Wijnholds (Institute for Logic, Language and Computation)

Room 4, 13.40-14.00

Displacement Calculus is a type of categorial grammar that exploits the use of string separation. We link this concept to Well-Nested Multiple Context-Free Grammars (MCFG_wn), which are grammars that act on tuples of strings. By showing that two variants of Displacement Calculus are weakly equivalent to MCFG_wn, i.e. describe the same string languages, we obtain two logical characterizations of the Mildly Context-Sensitive Languages. Next to the main results, we show a con- structive lexicalization result for MCFGwn, and we get that there is a trade-off between the descriptional complexity of connectives and that of deduction rules.


RELISH LMF: unlocking the full power of the Lexical Markup Framework
Menzo Windhouwer (The Language Archive DANS)

Poster session, 12.15-13.20

In 2008 ISO TC 37 (ISO 24613:2008) published the Lexical Markup Framework (LMF) standard (www.lexicalmarkupframework.org). This standard was based on input of many experts in the field a core model and a whole series of extensions were specified in the form of UML class diagrams. For a specific lexicon one selects the extensions needed and adorns the resulting model with data categories taken from the ISOcat Data Category Registry (www.isocat.org). The standard supports interoperability of lexica on various levels: it establishes patterns for lexica with specific purposes and also defines the terminology associated with them. However, on the level of data exchange the standard provided less guidance. For example, the XML serialization specified by a DTD in an informative annex doesn’t support the modular setup of LMF and also underspecifies the link with the ISOcat. This poster will show RELISH LMF (tla.mpi.nl/relish/lmf) an alternative XML serialization of LMF based on Relax NG, a modern XML schema language, and Schematron, a rule based XML validator. This serialization allows one to extend the LMF core model with selected extensions and own lexicon specific extensions, use TEI feature structures declarations and representations to specify and store the actual information and to embed references into ISOcat for semantic interoperability. With these powerful features the RELISH LMF serialization supports and, on the XML level, unlocks the full power of LMF.


ISOcat and RELcat, two cooperating semantic registriest
Menzo Windhouwer (The Language Archive DANS) and Ineke Schuurman (KU Leuven and U.Utrecht)

Room 5, 17.10-17.30

In ISOcat, a data category registry, metadata and linguistic concepts are defined. In order to prevent a proliferation of entries, the components of a complex tag are defined, not the full tags themselves. I.e., instead of one definition for a CGN-tag like /VG(neven)/, definitions for its components /conjunction/ (VG) and /coordinate/ (neven) are entered, the profit clearly being avoidance of definitions for more complex tags, like /N(soort,ev,basis,zijd,stan)/, one of the representations of a 'soortnaam' (common noun). After all, many (most?) of these fine-grained tags are unlikely to be reuseable in other languages.

But having broken up these tags, we nevertheless want to be able to relate our components /conjunction/ and /coordinate/ with entries like /coordinate conjunction/ or /coordinating conjunction/, or /noun/ and /common/ with /common noun/.

This can be done in RELcat, in which ontological relations, like sameAs, almostSameAs, partOf, superclassof, can be defined. Thus far, this has been done for one-to-one relationships. For CGN, more complex instantiations are needed. First, the relevant CGN-components are to be combined, and next, this combination is to be related to another ISOcat concept. And what about the full tags? Are they entailed in the relation as well?

In this presentation we will answer these questions and show how different granularity levels can be tackled to enable the possibilities for semantic crosswalks.


Online Personal Exploration and Navigation of SoNaR
Menno van Zaanen (Tilburg University), Matje van de Camp (Taalmonsters), Katrien Depuydt (INL), Jesse de Does (INL), Max Louwerse (Tilburg University), Jan Niestadt (INL), Martin Reynaert (Tilburg University) and Ko van der Sloot (Tilburg University)

Room 5, 16.50-17.10

SoNaR is an over 500-million word tokens reference corpus for contemporary written Dutch, containing texts from a wide range of genres. Its two million text files, requiring nearly 500 GB of disk space, have been annotated with lemmata, PoS-Tags, morphology and named entities and have been further enhanced both syntactically in Lassy and semantically in DutchSemCor. The accompanying meta-data information represents an additional two million CMDI XML files.

In its current state, the corpus does not fit the needs of researchers in areas such as linguistics, cultural sciences, literary studies, or communication and media studies, due to the nature of the dataset. Handling millions of files requires specialized computational skills.

The CLARIN-NL OpenSoNaR project will make the SoNaR corpus available online to various user groups and eventually to the wider public. To this end, several user interfaces will be developed, ranging from simple search interfaces that allow for the searching of words in context to complex interfaces that allow for searching using specialized query description languages, such as CQL. All interfaces can be accessed through a unified system, WhiteLab. The availability of search interfaces of multiple levels of complexity enable both novice and expert users to interact with the data in an easy and efficient way. The back-end for the corpus search tool is INL's BlackLab, a search library built on top of Lucene.

Users from the different user groups drive the development of the OpenSoNaR system by providing use cases that describe typical search scenarios in their respective fields. Based on these use cases, the search interface will be further developed.


Detecting Strikes in Historical Daily Dutch Newspapers
Kalliopi Zervanou (Radboud University Nijmegen), Marten Düring (Radboud University Nijmegen), Iris Hendrickx (Faculty of Arts, Radboud University Nijmegen) and Antal van Den Bosch (Radboud University Nijmegen)

Room 1, 9.50-10.10

The increasing amount of available digitised sources for historical research has been gradually transforming historical research methods, by making available to historians, new, computational methods, stemming from dedicated research in language and information technologies. An essential part of historical research lies in the detection of information in primary historical sources, such as newspapers and letters. In this work, our aim is the detection of those primary sources, both for associating source evidence to existing research, as well as for supporting researchers working on a specific historical event. We focus on the historical daily Dutch newspaper archive of the National Library of the Netherlands and strike events in the Netherlands of the ’80s. Using a manually compiled database of strikes in the Netherlands, we first attempt to find reports on those strikes in historical daily newspapers by automatically associating database records to the daily press of the time covering the same strike. Then, we generalise our methodology to detect strike events in the press not currently covered by the strikes database, and support in this way the extension of secondary historical resources.


An Unstructured Distributional Semantic Model for Morphologically Rich Languages: The Dutch Case Study
Kalliopi Zervanou (Radboud University Nijmegen), Elias Iosif (reece Technical University of Crete) and Alexandros Potamianos (Greece Technical University of Crete)

Room 4, 16.50-17.10

Semantic similarity is the building block for numerous applications of natural language processing, such as text categorisation, paraphrasing, machine translation and grammar induction. Distributional semantic models (DSMs) are based on the distributional hypothesis of meaning, assuming that semantic similarity between words is a function of the overlap of their linguistic contexts. DSMs can be categorised into unstructured, which employ a bag-of-words model and structured, which employ syntactic relationships between words. Current data-driven approaches to estimating semantic similarity have mainly focused on the English language (such as the SemEval sentence-level semantic similarity challenges).

In this work, we investigate the performance in Dutch, of an unstructured and language-agnostic DSM algorithm which has been proposed by Iosif and Potamianos (2013)[1]. This unsupervised model for estimating semantic similarity uses no linguistic resources, other than a corpus created via web queries and it is based on estimations of smooth co-occurrence and context similarity statistics of semantic networks and their respective semantic neighbourhoods. Our goal in this work is twofold: First, we aim at investigating how this network DSMs performs in Dutch, a language characterised by a richer morphology than English. Second, we aim at researching on the particular role of morphology on performance and more specifically to what extent similarity features are affected by morphological richness.

  1. E. Iosif and A. Potamianos. 2013. Similarity computation using semantic networks created from web-harvested data. Natural Language Engineering (DOI: 10.1017/S1351324913000144).