Joint Dependency Parsing and Multiword Expressions Tokenization

Dataset used for ACL 2015 paper

Authors: Alexis Nasr, Carlos Ramisch, José Deulofeu, André Valli
Last modification: 2015-05-28

Download the data set

DESCRIPTION

The MORPH dataset was built to allow evaluation of complex function word parsing in French. It contains around 100 sentences per target construction, which are 7 frequent ADV+que complex conjunctions and 4 de-DET complex determiners. Each sentence contains a single instance of the target construction, and an annotation which describes whether it is used as a complex function word (MORPH) or as a regular combination (OTHER).

DATA FORMAT

The sentences of each target construction are stored in a separate file, named after the target construction. For instance, file ainsi_que.txt contains annotations for ainsi que, a complex conjunction. The two classes are stored in separate folders, "ADV-que" and "de-DET".

Each file is a tab-separated CSV encoded in UTF-8, with the following columns:

seg-annot: manual annotation of segmentation (MORPH or OTHER)
sentence-tok: tokenized sentence containing the target construction

The first column contains the annotation for the target construction in that sentence. Since there is only one occurrence of the target construction in the sentence, we do not indicate the position in the sentence to which this annotation corresponds. The two possible values are MORPH, for a complex function word use (there is a MORPH dependency link between the words of the construction) and OTHER (there is another, regular syntactic structure which does not include a MORPH link).

The second column contains the sentence itself, with the target construction. The sentence was tokenized, since it was extracted from the POS-tagged frWaC corpus. For the same reason, there might be spelling errors in the sentences, as they were automatically crawled from websites in the construction of frWaC.

DATA COLLECTION AND ANNOTATION

The sentences were extracted from the frWaC corpus, a 1.6B-word corpus of texts crawled from the web. We used the POS-tagged version of the corpus, made available on request on the WaCky website. We have selected sentences based on the following criteria:

The sentence should contain exactly one occurrence of the target construction type (e.g., no more than one de-DET determiner, regardless of its type)
The sentence should contain between 10 and 20 words, in order to provide enough context for annotation without being too long. Too long sentences may include irrelevant material which will only slow the annotation down
For de-DET constructions, we also required that a verb preceded the "de" preposition. However, the verb may appear several words before, it is not necessarily adjacent to the target construction. This reduces the number of nominal complements, like "président de la république" and favors the occurrence of determiner/prepositionall phrase ambiguity
Some sentences were manually removed during annotation because they contained too much noise (typos, grammar errors) or because there was not enough context to decide on the correct annotation

Annotation was performed by two experts on French syntax. They went through each set of sentences independently. Each target construction occurrence was judged as either MORPH or OTHER. The annotation class OTHER may contain several readings, e.g. the construction tant que may represent a regular adverb followed by a subordinative clause je voudrais tant que tu m'aimes or a comparative mange tant que tu veux. This distinction was not made in this annotation, since we are only interested in the MORPH/regular distinction. After a first pass, annotators cross-checked each other's sentences. Divergences were discussed and, if no consensus was reached, the sentence was discarded.

EVALUATION

The released dataset also contains a script called eval-morph.sh, which compares a parser's output in CONLL07 format with the annotated files. An example of parsing output is provided in file ainsi_que.conll07, you can run the evaluation script as follows:

./eval-morph.sh ainsi_que.conll07 ADV-que/ainsi_que.txt ainsi que

This will provide, in addition to precision and recall stats, an error analysis of the cases missed by the parser (differences wrt annotation). In order to evaluate other parsing files, use the CONLL07 format similar to the example, where the complex function words are linked by a dependency called MORPH.

Carlos Ramisch's personal webpage

Joint Dependency Parsing and Multiword Expressions Tokenization

DESCRIPTION

DATA FORMAT

DATA COLLECTION AND ANNOTATION

EVALUATION