Compositionality of Nominal Compounds - Datasets

Description

This package contains numerical judgements by human native speakers about 180 nominal compound compositionality in English (EN), French (FR) and Brazilian Portuguese (PT).

Judgements were obtained using Amazon Mechanical Turk (EN and FR) and a web interface for volunteers (PT). Every compound has 3 scores: compositionality of head word, compositionality of modifier word and compositionality of the whole. Scores range from 1 (fully idiomatic) to 5 (fully compositonal) and are averaged over several annotators (around 10 to 20 depending on the language). All compounds in FR and PT, and 90 compounds in EN, also have synonyms and similar expressions given by annotators.

The datasets are described in detail and used in the experiments of papers below. Please cite one of them if you use this material in your research.

Our methodology is inspired from Reddy, McCarthy and Manandhar (2011). We include their set of 90 compounds and judgments in our dataset for the analyses in our papers. We do not include their dataset here, though. Please download their data and cite their paper if you use the full EN dataset.

Quick start

If you only want to use our datasets to evaluate your compositionality prediction models, you're probably interested in the scores present in the column named compositionality of files:

Folders

Files post-processing

The following commands were executes in order to create most files in the annotations folder.

# _Generation of unfiltered, averaged files (e.g. annotations/en.unfiltered.csv) used in ACL long and short papers_

../bin/filter-answers.py --zscore-thresh=10000000 --spearman-thresh=-1 \
    --batch-file en.raw.csv --lang en > en.unfiltered.csv 2> en.unfiltered.log
../bin/filter-answers.py --zscore-thresh=10000000 --spearman-thresh=-1 \
    --batch-file fr.raw.csv --lang fr > fr.unfiltered.csv 2> fr.unfiltered.log
../bin/filter-answers.py --zscore-thresh=10000000 --spearman-thresh=-1 \
    --batch-file pt.raw.csv --lang pt > pt.unfiltered.csv 2> pt.unfiltered.log

# _Generation of filtered averaged files (e.g. annotations/en.filtered.csv) used in MWE workshop paper_

../bin/filter-answers.py --zscore-thresh=2.2 --spearman-thresh=0.5 \
    --batch-file en.raw.csv --lang en > en.filtered.csv 2> en.filtered.log
../bin/filter-answers.py --zscore-thresh=2.5 --spearman-thresh=0.5 \
    --batch-file fr.raw.csv --lang fr > fr.filtered.csv 2> fr.filtered.log
../bin/filter-answers.py --zscore-thresh=2.2 --spearman-thresh=0.5 \
    --batch-file pt.raw.csv --lang pt > pt.filtered.csv 2> pt.filtered.log

# _Generation of graphics and evaluation of datasets (e.g. annotations/en.filtered.quality) _

../bin/intrinsic-quality-dataset.py --avg-file en.unfiltered.csv 2> en.unfiltered.quality
../bin/intrinsic-quality-dataset.py --avg-file en.filtered.csv 2> en.filtered.quality
../bin/intrinsic-quality-dataset.py --avg-file fr.unfiltered.csv 2> fr.unfiltered.quality
../bin/intrinsic-quality-dataset.py --avg-file fr.filtered.csv 2> fr.filtered.quality
../bin/intrinsic-quality-dataset.py --avg-file pt.unfiltered.csv 2> pt.unfiltered.quality
../bin/intrinsic-quality-dataset.py --avg-file pt.filtered.csv 2> pt.filtered.quality
mkdir -p graphics
mv *.pdf graphics

Note : Data may differ slightly from papers because we added some new annotations since the papers were written.


LexSubNC - Lexical Substitution of Nominal Compounds in Portuguese

Description

This package is an extension of the original compositionality datasets and includes more detailed annotation for Portuguese lexical substitution candidates in the original dataset. It contains the same 180 nominal compounds in Portuguese as the compositionality dataset. It additionally contains frequency and PMI from a large Brazilian Portuguese corpos (around 1.2 billion words), as well as lexical substitutes annotated according to the following categories:

The lexical substitutes were provided by volunteer native speaker annotators, who were requested to provide suggestions of substitution candidates for the compounds in context. The suggestions from all annotators were pooled together and sorted according to their frequency. This pool was then manually categorized by a linguist, who attributed categories to each different substitution candidate.

The folder contains the following files:

The details about this data can be found in our IWCS 2017 paper:

Note : Data may differ slightly from papers because we added some new annotations since the papers were written.