A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions

Dataset used for ACL 2012 Student Research Workshop paper

About this data:

Several approaches have been proposed for the automatic acquisition of multiword expressions from corpora. However, there is no agreement about which of them presents the best cost-benefit ratio, as they have been evaluated in distinct datasets and/or languages. To address this issue, we investigate these techniques analysing the following dimensions: expression type (compound nouns, phrasal verbs), language (English, French) and corpus size.

This directory contains the results of the evaluation/MAP for each tool and each configuration, as well as the execution times of each step of each tool. It also contains the corpora from which the candidates have been extracted, the references used for evaluation, and the scripts used to run the experiments.

More details and result analysis can be found in the paper:

Carlos Ramisch, Vitor De Araujo and Aline Villavicencio. 2012. A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions. In Proceedings of the ACL 2012 Student Research Workshop, Jeju, Republic of Korea, Aug. ACL.

To reproduce the results:

  1. Install all the tools used in the comparison. Make sure that the pathnames in the first lines of the runEval.sh script reflect the pathnames where the tools are installed in your system. This might require some time and experience with the tools and with bash scripting. Please use the same versions as we do, unfortunately we cannot guarantee that more recent versions are retro-compatible with our experiments.
  2. Run
    This will generate directories with the results for each tool (localmaxs/, mwetk/, nsp/, ucs/), as well as a times/ directory containing the execution times for each step of execution of each tool. This will take around 24h to complete, depending on the computer.
  3. Run
    This script will collect the time information generated by runEval.sh, as well as the evaluation/MAP information, and print it.
  4. Run
    This script generates tables of the number of candidates that are common for each pair of tools, one table per corpus size, candidate type (noun compound or verb-particle), and language (en or fr).