search RSS twitter search

The 9th edition of the Language Resources and Evaluation Conference, 26-31 May, Reykjavik, Iceland

home contact

LREC2014-Box6.png CONFERENCE VENUE & TRAVEL
LREC2014-Box7.png SUBMISSION
LREC2014-Box3.png REGISTRATION
LREC2014-Box1.png ACCOMMODATION & TOURS

Current List of LREC 2014 Shared LRs

Share this page!
linkedin

After the conference, the Shared LRs set at LREC2014 was manually checked and a cleaned version of the list of LRs is now available. This list includes LRs complying with the following criteria:

  • LRs accessible (either when uploaded by the participants or when they provide an external URL for downloading the data)
  • LRs categorized as Datasets only. It can be a:
    • Corpus,
    • Grammar/Language Model,
    • Ontology,
    • Terminology,
    • Treebank.
    • Evaluation Data / Package

Excluded LRs are:

  • LRs uploaded when the content did not correspond to the description
  • LRs with no download URL provided or URL now a dead link
  • LRs categorized as tools or guidelines
  • LRs associated to rejected papers.

 

We added a new field in the metadata: “Conditions of use”. The value entered here indicates specific conditions of use provided by the submitter (such as Attribution, Non-commercial use, Share Alike, etc.)

Search for LRs

Filter by resource type:

 

Shared-LRs @ LREC 2014

  • Name
    A Colloquial Corpus of Japanese Sign Language
    Resource type
    Corpus
    Size
    3500 GByte
    Languages
    <Not Specified>
    Production status
    Newly created-in progress
    Resource usage
    Dialogue
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    We began building a corpus of Japanese Sign Language (JSL) in April 2011 with the support of the Japan Society for the Promotion of Science. The purpose of this project was to increase awareness of sign language as a distinctive language in Japan. This corpus is beneficial not only to linguistic research but also to hearing-impaired and deaf individuals, as it helps them to recognize and respect their linguistic differences and communication styles. This is the first JSL corpus developed for academic and public use. During the first stage of this project, from May to July 2012, we filmed 40 deaf subjects in two prefectures, Gunma and Nara, which are located about 50–100 km from Tokyo and Osaka, respectively. Each prefecture has one school for the deaf. We obtained data from an age-balanced sample of individuals 30–70 years of age in each prefecture, and each age group was divided into same-sex pairs. We used three approaches to collect data: interviews (for introductory purposes only), dialogues, and lexical elicitation. Each session, including our explanation of the ethical considerations and subjects’ provision of written consent, lasted 1.5 h.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    A Repository of State of the Art and Competitive Baseline Summaries for DUC 2004
    Resource type
    Corpus
    Size
    225 KByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Summarisation
    License
    <Not Specified>
    Conditions of use
    Attribution
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    AcadOnto
    Resource type
    Ontology
    Size
    <Not Specified>
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Information Extraction, Information Retrieval
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    An academic domain ontology populated using IIT Bombay organization corpus, web and the linked open data.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Aix Map Task
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    French (fra)
    Production status
    Existing-used
    Resource usage
    Dialogue
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    This is a corpus of audio and video recordings of task-oriented dialogues. It was modelled after the original HCRC Map Task corpus. Lexical material was designed for the analysis of speech and prosody. The corpus was collected under two communicative conditions, one audio-only condition and one face-to-face condition. The recordings took place in a studio and a sound attenuated booth respectively, with head-set microphones (and in the face-to-face condition with two video cameras). The recordings have been segmented into Inter-Pausal-Units and transcribed using transcription conventions containing actual productions and canonical forms of what was said.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Alignment of Parallel Texts from Cyrillic to Latin
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Romanian (ron)
    Production status
    Newly created-in progress
    Resource usage
    Transliteration
    License
    GNU
    Conditions of use
    you can redistribute it and/or modify it under the terms of the GNU General Public License, i.e your modified versions must carry all the freedoms stated in the GPL
    Description
    The text of the novel Sania (eng. The Sledge) served as a training corpus. It was written in 1955 by Ion Druță and printed originally in Cyrillic scripts. We have followed a special previously developed technology of recognition and specialized lexicons. In such a way, we have obtained the electronic version of Cyrillic script variant of the text. On the other hand, we did the same procedure with Latin script variant of the same text, transliterated manually by expert linguists. It permitted us to make an automatic aligning of Cyrillic variant of the text to contemporary Latin variant of the same text at the word/expression level. The process was semi-automated, based on the heuristics for transcription of letters and the expert linguists’ validation. The corpus is annotated at sentence and word levels, providing morpho-lexical information using UAIC Romanian Part of Speech Tagger (Simionescu, 2011).
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis
    Resource type
    Corpus
    Size
    26,724 tokens, 7,503 sentences sentences
    Languages
    Arabic
    Production status
    Newly created-in progress
    Resource usage
    Text Mining
    License
    Open Source
    Conditions of use
    Attribution
    Description
    An Arabic twitter data set of 7,503 tweets. The released data contains manual Sentiment Analysis annotations as well as automatically extracted features, saved in Comma Separated (CSV) and Attribute-Relation File Format (ARFF) file formats. Due to twitter privacy restrictions we replaced the original tweet with its ID.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Annotation of Syntactic Categories in Chinese Word Structures
    Resource type
    Corpus
    Size
    2.3 MByte
    Languages
    Mandarin Chinese (cmn)
    Production status
    Newly created-finished
    Resource usage
    Chinese word segmentation, POS-tagging, parsing
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    An Automatic Annotation of Syntactic Categories of Chinese Word Structures. The file contains annotated Chinese word structures with refined syntactic categories induced by the algorithms in the paper. Each line of the file contains the annotated word structure of a word type, which is a binarized tree in Penn Treebank format. The root of the tree is the POS tag of the word that can be used for POS tagging and syntactic parsing. Each node is the format of "Tag1ZZTag2ZZ..Tagn" in which "ZZ" is a delimiter of multiple tags of the node. Such combination of multiple syntactic tags are used as the category of the nodes that represent word constituents (characters, subwords). The final annotation is publicly available at: http://www.sfs.uni-tuebingen.de/~jma/word_str.txt Our annotation uses two inputs: the pos-tagged sentences in Penn Chinese Treebank and the branching and head directions in the word structure in http://ir.hit.edu.cn/ mszhang/data.zip.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Arabic Tweets NER test set
    Resource type
    Evaluation Data
    Size
    <Not Specified>
    Languages
    Arabic Colloquial Arabic
    Production status
    Newly created-finished
    Resource usage
    Named Entity Recognition
    License
    Research License
    Conditions of use
    Research only
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    AuCoPro - Splitting
    Resource type
    Lexicon
    Size
    746 KByte
    Languages
    Afrikaans Dutch
    Production status
    Newly created-finished
    Resource usage
    Language Modelling
    License
    Creative Commons Attribution 3.0 Unported
    Conditions of use
    Attribution
    Description
    The AuCoPro-Splitting dataset contains compounds annotated with their compound boundaries and linking morphemes. The dataset consists of two files, one for Afrikaans and one for Dutch. The annotation was performed according to annotation guidelines as described in Verhoeven, van Zaanen, van Huyssteen, & Daelemans (2014).
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    BabelNet 2.0 as Linked Data
    Resource type
    Linked Data
    Size
    1 billion RDF triples
    Languages
    English (eng) French (fra) Italian (ita) German (deu) Spanish (spa) CA, IS, PL, RO, AF, AR, BG, CS, CY, DA, EL, ET, FA, FI, GA, HE, HI, HR, HU, ID, JA, KO, LT, LV, MS, NL, NO, PT, RU, SK, SL, SQ, SR, SV, SW, TL, TR, UK, VI, ZH, MT, EU, EO, GL, LA
    Production status
    Newly created-finished
    Resource usage
    Semantic Web
    License
    Creative Commons Attribution-Noncommercial-Share Alike 3.0 License (http://creativecommons.org/licenses/by-nc-sa/3.0/)
    Conditions of use
    Attribution, Non-Commercial, ShareALike
    Description
    This resource corresponds to the publication of BabelNet 2.0 as Linked Data. BabelNet 2.0 is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and an ontology which connects concepts and named entities in a very large network of semantic relations, made up of more than 9 millions of entries. BabelNet 2.0 covers 50 languages and is obtained from the automatic integration of different lexical-semantic and encyclopedic resources. Its conversion as Linked Data is based on lemon, a Lexicon Model for Ontologies, complemented by SKOS. The RDF version of BabelNet contains approximately 1 billion triples.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Basque Postedition corpus
    Resource type
    Corpus
    Size
    50.204 words
    Languages
    Basque (eus) Basque (eus)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CC-BY-SH
    Conditions of use
    Attribution, ShareAlike
    Description
    Corpus of raw and manual post-edited translations (50.204 words). It was created by manual post-editing of the Basque outputs given by Matxin RBMT system translating 100 entries from the Spanish Wikipedia.
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    SaLTMiL 2014
  • Name
    Benchmark Database of Phonetic Alignments
    Resource type
    Evaluation Data
    Size
    750 alignments
    Languages
    Jaqaru, Hǎrbīn, Zutendaal (BeLb), Château-d'Oex, Cerlatez, Zheglica (vid), Sadina (pop), Halden, Gradec (vd), Stambolovo (hask), Obdorsk Khanti, Nordstrand, Dàshí, French, Vasiljovo (tetev), Tihomir (krgr), Gabare (bslat), Gōngxìng, Boudry, Courtepin, Arnex, Xiānggǎng, Koprivshtica (pird), Merdanja (vtarn), Xiāngtàn, American English, Savièse, Zhelen (svog), Sandnessjøen, Garvan (sil), Conthey, Bogdanov dol (pern), Asparuhovo (lom), Jian'ou, Pelimka Mansi, Asserøy, Serbian, Schinnen (Lb), Aduard (Gn), Shipka (kaz), Zhōuchéng, Gjøra, Stange, Occitan, Brouwershaven (Ze), Landeron, Dragoevo (presl), Meerbeek (BeBr), Tiānjìn, Nova lovcha (gd), Grône, Mørkved, Kostenec (iht), Sredec (zlgr), Glozhene (orjah), New Zealand English (Auckland), Dolna beshovica (vrach), Dorkovo (velgr), Savagnier, Ruzhinci (belgr), Jīnxīng, Karaisen (pavl), Měixiàn, Rani lug (tryn), Vaugondry, Panagjurishte (gd), Kalipetrovo (sil), Naas, Eide, Kaldfarnes, Tihomirovo (stzag), Flåte, Smolsko (pird), Starmen (bel), Valthermond (Dr), Zelenigrad (tran), Moorslede (BeWv), Dompierre, Ameide (ZH), Plagne, Varsseveld (Gl), Momina banja (pl), Svetlina (topgr), Belgian Dutch, Zhèngzhōu, Dolna melna (tran), Vakh Khanti, Momkovo (svgr), Dichin (vtarn), Dolna srudena (bel), Kopilovci (mont), Huancané, Bryne, Le Sentier, Wénzhōu, Diva slatina (mont), Italian, Australian English (Perth), Chernogorovo (paz), Dolno levski (pan), Chevroux, Likrisovskoje Khanti, Sushica (blgr), Zanozhene (berk), Ligurian, Schagen (NH), Vartovskoje Khanti, Drabishna (ivgr), Luòběnzhuō, Shiroki dol (sam), Caraz, Dragodanovo (sliv), Javorovo (asgr), Canadian English, Lánzhōu, Dobroselec (topgr), Golema rakovica (elpel), Gabra (elpel), Hohhot, Shuri, Brielle (ZH), Voden (elh), Marikostinovo (petr), Taquile, Tūnxī, Dutch (Limburg), Côte-aux-Fées, Knokke (BeWv), Bjørnevatn, Radovene (vr), Nigerian English (Igbo), Korten (nzag), Dermanci (luk), Bistrica (blgr), Shtipsko (prov), German, Opan (stzag), Dragojchinci (kjust), Chéngdū, Nikolovo(lipnik) (rus), Yúnlóng, Central German (Murrhardt), Tena, Collombey, Lozen (sof), Sado, Kyoto, Stanghelle, Shiroka laka (dev), Lillehammer, Nánjīng, Stakevci (blgr), High German (Graubuenden), Kortrijk (BeWv), Wierum (Fr), Chimborazo, Bansko (razl), West Frisian (Grou), Tromsø, Longirod, Oostende (BeWv), Ēnqī, Huancavelica, Orsières, Zheljazkovo (sred), Malomirovo (elh), Courtedoux, Low German (Achterhoek), Rabisha (belgr), English (Lindisfarne), Hachijō, Shāntóu, Huhla (ivgr), Montpreveyres, Bollezeele (FrVl), Prahins, Galata (tetev), Selfors, Verkhne Kalimsk Khanti, Xiángyún, Senokos (blgr), Norwegian (Stavanger), Danish, Krivnja (razgr), Goljama zheljazna (tetev), Sugiez, English (Liverpool, Avry-sur-Matran, Swedish (Stockholm), Tokyo, Gorni varpishta (drjan), Mussel (Gn), Sombeval, Momchilovci (smol), Mandal, Trondheim, West-Terschelling (Fr), Svirkovo (harm), Russian, Zamfirovo (berk), Vladinja (lov), Fagerhaug, Kramolin (sevl), Low German (Bargstedt), Plakovo (vt), Nova nadezhda (hs), Sint-Annen (Gn), Buurmalsen (Gl), English (London, Héfēi, Renkum (Gl), Markovo (shum), Cerneux-Péquignot, Vranilovci (gabr), Central German (Cologne), Gorna rosica (sevl), Sekirovo (plov), Lombard (East), Mouscron (Be), Eggesbøneset, Cajamarca, Lower Lozva Mansi, Sùzhōu, Straldzha (jamb), Midsland (Fr), Buren (Fr), High German (Bodensee), Vullierens, Furen (vrach), Foldvik, Brunlanes, Amami, Vrachesh (botgr), Vermes, Hǎikǒu, Venetian, Scottish, Nieurlet (FrVl), Razboishte (god), Brekke, Zabardo (asgr), Chavín, Nijmegen (Gl), Guǎngzhōu, Stjørdal, Central German (Honigberg), Devenci (luk), Murist, Tjøtta, Chepelare (asgr), Běijīng, Ezerovo (parvom), Arconciel, Icelandic, Leermens (Gn), Sloten (Fr), Inkawasi, Ter Apel (Gn), Sherkali Khanti, Den Hoorn (NH), Stroevo (plov), Collex, Sørkjosen, Brønnøysund, Charnex, Merichleri (chirp), Pelatikovo (kjust), Zabernovo (mt), Yiddish (New York), Pavelsko (asgr), Osenec (razgr), Middle Lozva Mansi, Ivanski (shum), Omarchevo (nzag), Wateringen (ZH), Hundvåg, Zdravkovec (gabr), Koedijk (NH), Vallorbe, Petarnica (plev), Tremjugan Khanti, Champéry, North Mansi, Pevec (targ), Oruro, Orvin, Molde, Lourtier, Almkerk (NB), Silvolde (Gl), Kalojanovo (plov), Kantens (Gn), Cerovica (kjust), Dutch, High German (North Alsace), Lomnes, Zhaltusha (ard), Saparevo (dup), Dolna dikanja (radom), Slaveino (smol), Zevenaar (Gl), Kōchi, Govedarci (sam), Yīnchuàn, Nánchàng, Dragizhevo (vtarn), Skobelevo (sliv), Ürümqi, Nánníng, English (North Carolina), Commugny, Asparuhovo (prov), Borre, Balgari (carev), Ayent, Apolobamba, Dutch (Antwerp), Pozharevo (tutr), Lánpíng, Jìnán, Caparevo (sand), Ladino, Hisøy, Holwerd (Fr), Elnesvågen, Smochevo (dupn), Nikolovo (hask), Hvojna (asgr), Ustovo (sm), Vernier, Kerkrade (Lb), Sucre, Smilde (Dr), Bilzen (BeLb), Dobrotino (gd), Poperinge (BeWv), Borisovo (elh), Momina klisura (pz), Hángzhōu, Tiwanaku, Fyresdal, Ognen (karn), High German (Walser), Trancovica (nik), Kozichino (pom), Xi'an, Villars-le-Terroir, Paskalevec (pavl), Dokka, Sætre, Nizjam Khanti, St-Gingolph, Lamboing, Ormont-Dessus, Kazim Khanti, Central German (Luxembourg), Vabel (nik), Dobroslavci (sof), English, Indzhe vojvoda (mtarn), Xīníng, Mihalci (pavl), Trastenik (plev), Kovachevci (sam), Golica (varn), Jiànchuàn, Tynset sentrum, Izvorovo (harm), Upper Demjanka Khanti, Støa, Krogtoft, Borge, Scheveningen (ZH), Callantsoog (NH), Táoyuán, Mugla (dev), Grimentz, Aldomirovci (slivnica), High German (Herrlisheim), Táiběi, Hafrsfjord, Enina (kaz), Bagrenci (kjust), Indian English (Delhi), Undheim, Rakevo (vr), Orkanger, Zheravna (kot), Belica (razl), Martigny, Dinevo (hask), Chernomorec (bs), Koekelare (BeWv), Buchin prohod (god), Stjørdalshalsen, Égà, Develier, Sestrino (petr), Tavda Mansi, Court, Vinishte (mont), Kaspichan (np), Wǔhàn, Beglezh (luk), Nijeholtpade (Fr), Roche, Faroese, Novo selo (trojan), Ingen (Gl), Évolène, Kreta (vrach), Fyllingsdalen, High German (Ortisei), Vardun (targ), Noiraigue, Hèqìng, Lies (Fr), Jīnmǎn, Fùzhōu, Banishte (brezn), Proto-Germanic, Tuōluò, Corongo, Garmen (gd), Bachkovo (asgr), Bygstad, Montbovon, High German (Tuebingen), Topolchane (sliv), L'Auberson, Kagoshima, Dobarsko (razl), Corcelles, Swedish (Skane), Chukovec (radom), Huancayo, Brashljan (mtarn), Stoilovo (mt), High German (Biel), South African English (Johannisburg), Shèxiàn, Semsalves, Podvis (karn), Dolni bogrov (sof), Miyako, English (Tyrone), Kolju marinovo (chirp), Lombard (West), Workum (Fr), Gouderak (ZH), Shànghǎi, Vojnjagovo (karl), Malyj Jugan Khanti, Nendaz, Lipnica (botgr), Goljamo shivachevo (sliv), Oki, Kristiansand, Mǎzhělóng, Chaux-du-Milieu, Dovre, Sinja Khanti, Nazareth (BeOv), Czech, Chángshà, Levunovo (sand), Karanovo (ajt), Konda Khanti, Bov (svog), Varbica (presl), Varbovo (blgr), Suhindol (vtarn), Bø, Vaklinovo (gd), Milchina laka (kul), Ganchovec (drjan), Ljubenova mahala (nzag), Cuzco, Raundalen, Fagnastøl, Laconnex, Puno, Dolna riksa (mont), Hermance, Guìyáng, Dutch (Ostend), Divdjadovo (shum), Piershil (ZH), Devesilica (krgr), Liljache (vr), Solishta (dev), Montalchez, Veyrier, Elov dol (pk), Brouckerque (FrVl), Marchaevo (sof), Qingdao, Gega (petr), Varvara (paz), Kùnmíng, Sortland, Øra, Valche pole (svgr), Miège, Belene (svisht), Xiàmén, Vresovo (ajt), Kravenik (sevl), Velkovci (pk), Humbeek (BeBr), Bulgarian, Oudeschoot (Fr), Qīlǐqiáo, English (Singapore), Polish, Walshoutem (BeBr), English (Buckie), Tàiyuán, Russland, Waregem (BeWv), Abbekerk (NH), Ooike (BeOv), Babjak (razl), Zuydcoote (FrVl), Konda Mansi, Konska (brezn), Vinarovo (vid), Noevci (brezn), Fully, Ěryuán, Oosterend (Fr), Vasjugan Khanti, Kawki, Pingyao, Ulft (Gl), Cochabamba, Slavjanovo (plev), Golemo malovo (sliven), Lobosh (rad), Jugan Khanti, Haamstede (Ze), Rakovica (kul)
    Production status
    Newly created-finished
    Resource usage
    testing of phonetic alignment algorithms
    License
    Creative Commons Attribution-NonCommercial 3.0 Unported <http://creativecommons.org/licenses/by-nc/3.0/>
    Conditions of use
    Attribution, Non-Commercial
    Description
    Automatic methods for phonetic alignment play an increasingly important role in quantitative approaches to historical linguistics and dialectology. With the "Benchmark Database for Phonetic Alignments" (BDPA), we present a new data resource which offers collections of cognate words from different language varieties. In contrast to other resources which concentrate on questions of cognacy and lexical change, the BDPA represents the data in form of pairwise and multiple alignments. An alignment is a matrix representation of two or more sequences in which corresponding segments in the sequences are placed in the same column, with empty cells resulting from non-matching segments being filled by gap symbols. Currently, the BDPA offers a total of 750 multiple alignments based on 12 different sources of language and dialect varieties.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Bilingual Dictionaries
    Resource type
    Lexicon
    Size
    240.000 entries
    Languages
    English (eng) French (fra) Italian (ita) German (deu)
    Production status
    Existing-updated
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    This resource contains 3 OpenLogos bilingual dictionaries, namely the English-German, the English-French, and the English-Italian dictionaries. In addition to the usual information on part-of-speech, gender, and number for nouns, offered by most dictionaries currently available, OpenLogos bilingual dictionaries have some distinctive features that make them unique: they contain cross-language morphological information (inflectional and derivational), semantico-syntactic knowledge, indication of the head word in multiword units, information about whether a source word corresponds to an homograph, information about verb auxiliaries, alternate words (i.e., predicate or process nouns), causatives, reflexivity, verb aspect, among others. The strongest characteristic of the dictionaries consists on the semantico-syntactic knowledge embedded in each entry.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Burst-Annotated Co-Occurrence Network for the Arab Spring Domain
    Resource type
    Corpus
    Size
    280 MByte
    Languages
    American English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Knowledge Discovery/Representation
    License
    CC-BY
    Conditions of use
    Attribution
    Description
    A burst-annotated co-occurrence network about the Arab Spring topic built on the top of New York Times article snapshots from the years 2010-2013.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    CARTÃO
    Resource type
    Lexicon
    Size
    146 lexemes
    Languages
    Portuguese (por)
    Production status
    Existing-used
    Resource usage
    Word Sense Disambiguation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Lexical-semantic network extracted automatically from three Portuguese dictionaries. Contains ~146k lexical items, connected by ~286k semantic relation instances covering a rich set of types, including including synonymy, hypernymy, several types of meronymy, causation and purpose.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Chinese Open Wordnet
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    Mandarin Chinese (cmn)
    Production status
    Newly created-in progress
    Resource usage
    Word Sense Disambiguation
    License
    CreativeCommons Attribution (CC BY)
    Conditions of use
    Attribution
    Description
    We are creating a large scale, freely available, semantic dictionary of Mandarin Chinese: the Chinese Open Wordnet, inspired by the Princeton WordNet and the Global WordNet Grid. All relations (hypernyms, meronyms ...) come from Princeton WordNet 3.0. We have enriched the synsets with Chinese lexical units.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Chinese to Kazahk Bilingual Dictionary
    Resource type
    Lexicon
    Size
    52, 478 entries
    Languages
    Mandarin Chinese (cmn) Kazakh (kaz)
    Production status
    Existing-updated
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    This resource is a one-to-many mapping bilingual dictionary of Chinese and Kazakh language pair, which is an experimental resource used in corresponding submission. = Resource Detail = File type: Microsoft Excel(.xlsx) File size: 3.36MB Scale: 52, 478 entities, 232,589 translation pairs Script: Mandarin Chinese, Mandarin Chinese Pinyin, Arabic-based Kazakh Encode: Unicode Provider: Ishida&Matsubara Laboratory, Department of Social Informatics, Kyoto University. Format Note: Multiple Kazakh translations are separated with "?"
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Chinese to Uyghur Bilingual Dictionary
    Resource type
    Lexicon
    Size
    52, 478 entries
    Languages
    Mandarin Chinese (cmn) Uighur (uig)
    Production status
    Existing-updated
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    This resource is a one-to-many mapping bilingual dictionary of Chinese and Uyghur(Uighur), which is an experimental resource used in corresponding submission. = Resource Detail = File type: Microsoft Excel(.xlsx) File size: 2.84MB Scale: 52, 478 entities, 118,805 translation pairs Script: Mandarin Chinese, Mandarin Chinee Pinyin, Arabic-based Uyghur Encode: Unicode Provider: Ishida&Matsubara Laboratory, Department of Social Informatics, Kyoto University. Format Note: Multiple Uyghur translations are separated with "?"
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social (CIEMPIESS)
    Resource type
    Corpus
    Size
    17 hours
    Languages
    Spanish (spa)
    Production status
    Newly created-finished
    Resource usage
    Speech Recognition/Understanding
    License
    Creative Commons Attribution-ShareAlike 4.0 International
    Conditions of use
    Attribution, ShareAlike
    Description
    "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social" (CIEMPIESS) is a new open-sourced corpus extracted from Spanish spoken FM podcasts in the dialect of the center of Mexico. The CIEMPIESS corpus was designed to be used in the field of automatic speech recongnition (ASR). The corpus size is 17 hours and we provide language models and language dictionaries for experimentation.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Corpus of Japanese Predicate Phrases for Synonym/Antonym Relations
    Resource type
    Corpus
    Size
    507 KByte
    Languages
    Japanese
    Production status
    Newly created-in progress
    Resource usage
    Semantic Similarities between words
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    A large human annotated set of predicates for synonym-antonym relations in Japanese. Accompanied by a noun phrase and case information, the data consists of 7,278 pairs of predicates such as “receive-permission (ACC)” vs. “obtain-permission (ACC)”; the relations are categorized as synonyms, antonyms, or unrelated. Antonyms are further categorized into three different classes depending on their aspect of oppositeness.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Corpus of Semantic Graphs with associated English strings
    Resource type
    Corpus
    Size
    98'818 Graph/string pairs
    Languages
    American English (eng)
    Production status
    Newly created-finished
    Resource usage
    Natural Language Generation, Natural Language Understanding, Machine Translation
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    Automatically generated corpus of 98'818 graph/string pairs.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    CorpusSVCs
    Resource type
    Corpus
    Size
    100 SVC translations evaluated for 5 languages for 2 MT systems (total SVCs evaluated = 1,000)
    Languages
    English (eng) French (fra) German (deu) Italian (ita) Portuguese (por) Spanish
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    This resource contains a corpus of 100 English support verb construction (SVC) translations and their evaluation for 5 languages: French, German, Italian, Portuguese and Spanish. The translations were performed by 2 MT systems, the GoogleTranslate and the OpenLogos systems. The total number of SVC translations evaluated was 1,000.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Criminality Corpus for Text Categorization
    Resource type
    Corpus
    Size
    260 documents
    Languages
    Italian (ita)
    Production status
    Newly created-finished
    Resource usage
    Document Classification, Text categorisation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    A collection of 260 documents classified in nine different categories, to be used for testing text categorisation systems in the field of criminality.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Czech Meteor tables
    Resource type
    Corpus
    Size
    7.8 MByte
    Languages
    Czech (ces)
    Production status
    Existing-used
    Resource usage
    Paraphrasing
    License
    GNU LGPL
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    DBnary
    Resource type
    Lexicon
    Size
    >1,700,000 entries
    Languages
    English (eng) French (fra) German (deu) Russian (rus) Japanese (jpn) Bulgarian, Finnish, Greek, Italian, Portuguese, Spanish, Turkish,
    Production status
    Existing-updated
    Resource usage
    Semantic Web
    License
    CreativeCommons-by-sa
    Conditions of use
    Attribution, ShareAlike
    Description
    12 wiktionary language edition, available as a LEMON-based lexical resource in RDF, plus attachements of 3.3M translations pairs to the appropriate source word sense.
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    LDL 2014
  • Name
    Database of Lexical Simplification Errors
    Resource type
    Evaluation Data
    Size
    200 KByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Lexical Simplification
    License
    CC-BY-SA
    Conditions of use
    Attribution, ShareAlike
    Description
    The data described in the paper. A categorisation of the errors occurring during the lexical simplification pipeline.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Deep Sequoia
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    French (fra)
    Production status
    Existing-updated
    Resource usage
    Deep parsing
    License
    LGPL-LR (Lesser General Public License For Linguistic Resources)
    Conditions of use
    <Not Specified>
    Description
    The Sequoia treebank contains 3099 sentences in French, annotated with POS, constituency trees and dependency trees. The current submission proposes an extension of the ressource : an additional layer of deep syntactic annotations, and was also the occasion of correcting the surface dependency trees.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    DeriNet
    Resource type
    Lexicon
    Size
    250000 lexemes
    Languages
    Czech (ces)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CreativeCommons
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    The presented resource is a network that attempts to capture word formation processes in Czech. Technically, it is an oriented graph consisting of nodes representing lexemes and edges that represent derivations (e.g. teach->teacher). The network was built using a combination of existing NLP resources for Czech as well as new annotations.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    DerivBase.hr
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    Croatian (hrv)
    Production status
    Newly created-in progress
    Resource usage
    Various semantic tasks (entailment, SRL) as well as IR/IE tasks
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    DerivBase.hr is lexicon of derivationally related Croatian lemmas, induced automatically from Croatian web corpus hrWaC. The resource has a high-coverage (100K lemmas) and a good quality (81% precision and 77% recall).
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    DiLAF African languages-French dictionaries
    Resource type
    Lexicon
    Size
    1.7 MByte
    Languages
    Hausa (hau) Central Kanuri (knc) Tamajaq (tmh) Songhai-zarma French (fra)
    Production status
    Newly created-in progress
    Resource usage
    Web Services
    License
    CC BY 3.0
    Conditions of use
    Attribution
    Description
    Bilingual dictionaries encoded in XML - Hausa-French dict. for basic cycle, 2008 Soutéba: 7,823 entries; - Kanuri-French dict. for basic cycle, 2004 Soutéba: 5,994 entries; - Tamajaq-French dict. for basic cycle, 2007 Soutéba: 5,205 entries; - Songhai-zarma-French dict. for basic cycle, 2007 Soutéba: 6,916 entries
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    CCURL 2014
  • Name
    Diachronic Ontologies from People's Daily
    Resource type
    Ontology
    Size
    8.83 MByte
    Languages
    Mandarin Chinese (cmn)
    Production status
    Newly created-finished
    Resource usage
    Word Sense Disambiguation
    License
    <Not Specified>
    Conditions of use
    Non-commercial, ShareAlike
    Description
    1. Language Source Description This diachronic ontology is constructed from People's Daily of fifty years (i.e., from 1947 to 1996). The ontology for each year is consisted with several concept trees and we only consider words with frequencies not lower than 100 in each year. Numerals, punctuations, non-morpheme words, quantifiers and function words are excluded. In addition, we have subjectively defined eight eras of consecutive years. The ontology for each era only includes words with frequencies over 300 for each period. This language resource is established and maintained by the research group of Dr. Junfeng Hu in ICL Peking University. Updates and modifications will be unloaded and announced in the KLCL website (www.klcl.pku.edu.cn). 2. User License You may use, copy, reproduce, and distribute this ontology for any non-commercial purpose, subject to the restrictions in this license agreement. Some purposes which can be non-commercial are teaching, academic research, public demonstrations and personal experimentation. 3. You may not use or distribute this ontology or any derivative works in any form for commercial purposes. Examples of commercial purposes would be running business operations, licensing, leasing, or selling the ontology, distributing the ontology for use with commercial products, using the ontology in the creation or use of commercial products or any other activity which purpose is to procure a commercial gain to you or others. If you distribute the Ontology or any derivative works of the ontology, you will distribute them under the same terms and conditions as in this license, and you will not grant other rights to the Corpus or derivative works that are different from those provided by this license agreement. If you have created derivative works of the ontology, and distribute such derivative works, you will cause the modified files to carry prominent notices so that recipients know that they are not receiving the original ontology. Such notices must state: (i) that you have changed the ontology; and (ii) the date of any changes. Copyright (c) Key Laboratory of Computational Linguistics, Peking University. All rights reserved.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Domain-Specific Gold Standard Translation Set
    Resource type
    Evaluation Data
    Size
    200 sentences
    Languages
    English (eng) Brazilian Portuguese (por)
    Production status
    Existing-used
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    A set of 40 sentences, in English, all automatically extracted from a dermatology corpus and 4 gold standard translations in Portuguese, carefully translated by a specialist, for each of them. Files are encoded in UTF-8.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Dot type corpus with experts
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    American English (eng)
    Production status
    Newly created-finished
    Resource usage
    Word Sense Disambiguation
    License
    CC
    Conditions of use
    <Not Specified>
    Description
    Two datasets annotated for the container/content and location/organization metonymic alternation by turkers (all items) and experts (items with low agreement by turkers). This is an extension on the sense-annotated corpus published in Martinez Alonso et al, 2013 (ACL)
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Dutch sentiment lexicon
    Resource type
    Lexicon
    Size
    61 KByte
    Languages
    Dutch (nld)
    Production status
    Newly created-in progress
    Resource usage
    Information Extraction, Information Retrieval
    License
    Creative Commons
    Conditions of use
    <Not Specified>
    Description
    Dutch sentiment lexicon: 3013 positive words, 3014 negative words
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    ES3LOD 2014
  • Name
    EVOCA
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Emotion Recognition/Generation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    English Web Reviews Multiword Expressions Corpus
    Resource type
    Corpus
    Size
    55000 words
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    <Not Specified>
    License
    Creative Commons Attribution-ShareAlike 3.0 Unported
    Conditions of use
    Attribution, ShareAlike
    Description
    Tokens of sentences from online reviews are grouped together to indicate multiword expressions (MWEs). The annotation proceeded sentence by sentence, and is thus comprehensive: it captures many kind of MWEs, and is not biased by any predetermined lexicon or syntactic pattern. 3500 MWE instances are marked. Many MWEs are "gappy" (discontinuous in the sentence). Each MWE is marked as "strong" (highly idiomatic) or "weak" (collocational). A token cannot belong to multiple MWEs, except when a weak MWE contains a strong MWE as a constituent. Annotations are specified as token offsets into the English Web Treebank (available from LDC).
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Estonian resource grammar for Grammatical Framework
    Resource type
    Grammar/Language Model
    Size
    1132 rules
    Languages
    Estonian (est)
    Production status
    Newly created-in progress
    Resource usage
    Language Modelling
    License
    LGPL
    Conditions of use
    <Not Specified>
    Description
    A GF resource grammar for Estonian, implementing the language-neutral API of the GF Resource Grammar Library as well as a morphological synthesizer.
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    SaLTMiL 2014
  • Name
    Freepal
    Resource type
    Corpus
    Size
    23 GByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Information Extraction, Information Retrieval
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    Freepal is a resource designed to assist with the creation of relation extractors for more than 5,000 relations defined in the Freebase knowledge base. The resource consists of over 10 million distinct lexico-syntactic patterns extracted from dependency trees, each of which is assigned to one or more Freebase relations with different confidence strengths.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    French Framenet
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    French (fra)
    Production status
    Newly created-in progress
    Resource usage
    FrameNet-based Shallow Semantic Parsing
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Ongoing effort to create a French version of the FrameNet resource. The current status contains a set of approx. 100 frames, slightly modified with respect to the English frames, and a lexicon of French lexemes associated to frames.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    GenitivDB
    Resource type
    Grammar/Language Model
    Size
    >9 million words
    Languages
    German
    Production status
    Newly created-finished
    Resource usage
    Language Modelling
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Corpus-Generated Dataset for German Genitive Classification
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Georgian-Russian-Ukrainian-German Parallel Treebank
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Georgian (kat) Russian (rus) Ukrainian (ukr) German (deu)
    Production status
    <Not Specified>
    Resource usage
    Language Modelling
    License
    Creative Commons Attribution 3.0 Unported (CC BY 3.0) http://creativecommons.org/licenses/by/3.0/
    Conditions of use
    Attribution
    Description
    This dataset is made of two types of resources: four monolingual Treebanks (German, Georgian, Russian and Ukrainian), and four parallel Treebanks (German-Georgian, German-Russian, German-Ukrainian, Georgian-Ukrainian). The parallel texts used for the outlined experiment comprises German sentences and their translations into Georgian and Russian languages compiled for the GREG NLP lexicon project. The GREG lexicon itself contains a manually aligned German, Russian, English and Georgian valency data supplied with syntactic subcategorization frames and saturated with semantic role labels. The multilingual verb lexicon is expended with examples of sentences in 4 languages involved. They unfold lexical entries’ meaning and are considered as mutual translation equivalents. The size of bilingual sublexicons, depending to a specific language pair, varies between 1200-1300 entries and the number of example sentences appended to the lexicons are different. For example, a German-Georgian subcorpus, used for this study, has a size of roughly 2600 sentence pairs that correspond to different syntactic subcategorization frames. For the German-Russian language pair had been extracted more fine grained subcorpus with about 4000 sentences as translation equivalents. A German-Ukrainian subcorpus, created for the GRUG initiative support, is relatively small.
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    CCURL 2014
  • Name
    Google Books Distributional Thesaurus
    Resource type
    Lexicon
    Size
    100 GByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Word Sense Disambiguation
    License
    CC
    Conditions of use
    <Not Specified>
    Description
    Distributional Thesaurus (DT) for various time slices for the whole Google Books syntactic n-grams. A DT contains, for the while vocabilary, a ranked list of most similar words. Indicated URL contains several such DTs for various corpora. Additionally, we provide sense clusters for each of the DT.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    HNZ segmented and POS tagged corpus
    Resource type
    Corpus
    Size
    1 MByte
    Languages
    Archaic Chinese
    Production status
    Newly created-in progress
    Resource usage
    Knowledge Discovery/Representation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    The HNZ corpus is an Archaic Chinese corpus consisting of all the articles in the book of Huainanzi with word segmentation and POS tagging annotation. Huainanzi, also known as Huainan Honglie, is a collective work written by Prince Huainan, Liu An (179 BC-122 BC), and a group of his retainers in the Western Han Dynasty (206 BC-9AD). Huainanzi was first circulated in the Western Han Dynasty, which is near the end of the Archaic Chinese era. The book has 21 chapters, covering a wide range of topics on philosophy, astrology, geography, politics, customs, military affairs, mountains, sociology, etc. It has been described as the ``Encyclopedia of the early Han Dynasty''. Its abundant language capacity reveals characteristics of lexical usage in the Western Han Dynasty, and demonstrates how the usage had been transformed from the Qin Dynasty to the Han Dynasty. In this regard, Huainanzi contains valuable data for an in-depth analysis of Archaic Chinese. Because of these nice properties, we selected the book as the raw data for our Archaic Chinese corpus. All the manual annotation and correction was done by a Chinese linguist who is an expert on Archaic Chinese.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Harvard Uncertainty Speech Corpus
    Resource type
    Corpus
    Size
    150 minutes
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Emotion Recognition/Generation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    The Harvard Uncertainty Speech Corpus contains speech recordings, level of certainty annotations, and acoustic feature vector data. The speech elicitation materials include items from three domains: vocabulary, public transportation, and handwritten digits. In total, the Uncertainty Corpus has 1700 utterances and 148.79 minutes of speech.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    HiEve
    Resource type
    Corpus
    Size
    8034 KByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Information Extraction, Information Retrieval
    License
    Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
    Conditions of use
    Attribution ; Non-Commercial; ShareAlike (CC-BY-NC-SA 3.0)
    Description
    A corpus of manually annotated event hierarchies in news stories.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Hindi-English Code-Switch Corpus
    Resource type
    Corpus
    Size
    43 MByte
    Languages
    English (eng) Hindi (hin)
    Production status
    Newly created-finished
    Resource usage
    Speech Recognition/Understanding
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    This is a small Hindi-English speech corpus. The corpus consists of student interview speech. A total of 9 students answering 12 questions were recorded for this corpus. For details regarding the corpus please refer to the paper "A Hindi-English Code-Switching Corpus" by Anik Dey and Pascale Fung to be presented at LREC 2014. You can also send an email to adey@connect.ust.hk for instructions on how to use this corpus.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    IULA Spanish LSP Treebank
    Resource type
    Corpus
    Size
    3.247 MByte
    Languages
    Spanish (spa)
    Production status
    Newly created-finished
    Resource usage
    <Not Specified>
    License
    Creative Commons Attribution 3.0 Unported License
    Conditions of use
    Attribution
    Description
    This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it from a very different domain than those of the Iula Spanish LSP Treebank.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Katakana-English Scientific Terms Lexicon
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    Japanese (jpn) English (eng)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Free
    Conditions of use
    Attribution
    Description
    A lexicon of 170K Japanese-English scientific terms automatically extracted using a transliteration filtering algorithm.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Khresmoi Query Translation Test Data for the Medical Domain version 1.0
    Resource type
    Corpus
    Size
    1508 sentences
    Languages
    English (eng) Czech (ces) German (deu) French (fra)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CreativeCommons BY-NC 3.0
    Conditions of use
    Attribution; Non-commercial
    Description
    This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from the general public and from medical experts.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    LAST MINUTE
    Resource type
    Corpus
    Size
    56 hours
    Languages
    German (deu)
    Production status
    Existing-used
    Resource usage
    Dialogue
    License
    Own
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    LQVSumm
    Resource type
    Corpus
    Size
    1.1 MByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Summarisation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Stand-off annotations of linguistic quality violations found in automatically-produced summaries. The summaries are from TAC 2011 Guided Summarization task (intial summaries) and from the G-Flow summarization system.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Lowlands Twitter data
    Resource type
    Corpus
    Size
    3064 tokens
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Part of speech tagging
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    200 tweets collected over the span of one day, POS-annotated by three annotators.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    LuxId
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Luxemburguish
    Production status
    Newly created-finished
    Resource usage
    Language Identification
    License
    CC BY-SA 3.0
    Conditions of use
    Attribution, ShareAlike
    Description
    corpus of mixed language (French, German,Luxemburguish) sentences from {sc Chamber} (House of Parliament) debate reports manually annotated at segment level with 6 labels : Lux, Fre, Ger, Lux + Fre, Lux + Ger, Lux + Fre + Ger
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Mannheim Corpus of Historical Newspapers and Magazines
    Resource type
    Corpus
    Size
    4.1 Mio tokens
    Languages
    German (deu)
    Production status
    Newly created-in progress
    Resource usage
    Text Mining
    License
    CreativeCommons, NonCommercial, Attribution
    Conditions of use
    Attribution, Non-Commercial
    Description
    The Mannheim Corpus of Historical Newspapers and Magazines consists of 21 German newspapers and magazines from the 18th and 19th century. It comprises about 652 individual volumes with over 4.1 Mio word tokens on 4678 pages overall. This corpus has been assembled and digitized from 2009 to 2011, and been converted to TEI P5 in 2013.
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    LRT4HDA 2014
  • Name
    Metaphors for Economic Inequality in English, Farsi, Spanish, and Russian
    Resource type
    Corpus
    Size
    3.5 MByte
    Languages
    English Russian Farsi Spanish
    Production status
    Newly created-finished
    Resource usage
    Discourse
    License
    Creative Commons
    Conditions of use
    <Not Specified>
    Description
    Excel spreadsheets containing the results of SketchEngine WordSketch searches for metaphors in the target domain of economic inequality. The languages are English, Russian, Farsi, and Spanish. The data is taken from the SketchEngine TenTen corpora, as described in LREC 2014 papers from MacWhinney, B. & Fromm, D. , as well as Levin et al. There are four sheets of the general conceptual metaphors: ccm_rus, ccm_eng, cc_far, and cc_spa. There are also four much larger sheets containing the actual sentences with the metaphors, along with pointers to URLs where these occurred.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    MotaMot French-Khmer Pivot Database
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    French (fra) Central Khmer (khm)
    Production status
    Newly created-in progress
    Resource usage
    Web Services
    License
    CC Attribution 3.0 Unported (CC BY 3.0)
    Conditions of use
    Attribution
    Description
    French-Khmer pivot lexical database
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Multilingual corpora with coreferential annotation of person entities
    Resource type
    Corpus
    Size
    1 MByte
    Languages
    Portuguese (por) Galician (glg) Spanish (spa)
    Production status
    Newly created-in progress
    Resource usage
    Person Identification
    License
    GPL
    Conditions of use
    <Not Specified>
    Description
    Multilingual corpora with coreferential annotation of person entities ===================================================================== In-progress corpora with coreferent annotation of person entities. Sources: journals and Wikipedia. Languages: * Portuguese: varieties from Portugal, Brazil, Angola, Mozambique (and Wikipedia) * Spanish: varieties from Spain and Argentina (and Wikipedia) * Galician: from Galician journals (and Wikipedia) Format: SemEval-10: * Recasens, Marta, Lluís Màrquez, Emili Sapena, M Antònia Martí, Mariona Taulé, Véronique Hoste, Massimo Poesio and Yannick Versley, 2010. SemEval-2010 Task 1: Coreference resolution in multiple languages. In Proceedings of the 5th International Work- shop on Semantic Evaluation (SemEval ’10): 1–8. ACL.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    N3-Collection
    Resource type
    Corpus
    Size
    728 annotated documents
    Languages
    English German
    Production status
    Newly created-finished
    Resource usage
    Semantic Web
    License
    Creative Commons BY-NC-SA
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    We publish three novel datasets called N3. N3 will be published using NIF ensuring a greater interoperability to overcome the need for corpus-specific parsers. The data can be downloaded from our project homepage.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    NST acoustic and language models
    Resource type
    Acoustic and Language Models for Speech Recognition
    Size
    2.1 GByte
    Languages
    Swedish (swe)
    Production status
    Newly created-finished
    Resource usage
    Speech Recognition/Understanding
    License
    CreativeCommons
    Conditions of use
    Attribution
    Description
    The package contains resources for large vocabulary continuous speech recognition (LVCSR) in Swedish. We trained acoustic models on the public domain NST Swedish corpus and made them freely available to the community. We also provide scripts to generate language models containing a chosen subset of words from the NST n-grams. Note that the models may be updated before the date of the conference if new results are available.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    NoSta-D: German NER Dataset Train/Dev
    Resource type
    Corpus
    Size
    26200 sentences
    Languages
    German (deu)
    Production status
    Newly created-finished
    Resource usage
    Named Entity Recognition
    License
    CC-BY
    Conditions of use
    Attribution
    Description
    Freely available large dataset, manually annotated for German NER. Includes nested span annotations. Source text from German Wikipedia and news. This data set does not contain the test data, which is used for the GermEval 2014 NER task at KONVENS. Test data will be available from September 2014.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    NomLex-BR
    Resource type
    Lexicon
    Size
    2323 entries
    Languages
    Brazilian Portuguese (por) Portuguese (por)
    Production status
    Newly created-in progress
    Resource usage
    Information Extraction, Information Retrieval
    License
    CC BY SA
    Conditions of use
    Attribution, ShareAlike
    Description
    A computational lexicon for Portuguese that provides mappings between verbs and their nominalizations.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    ODIN database
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    examples for thousands of languages
    Production status
    Newly created-in progress
    Resource usage
    Data can help linguistic studies and bootstrap NLP tools for resource-poor languages
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    OSS Online Communication Messages
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    American English (eng)
    Production status
    Newly created-finished
    Resource usage
    Document Classification, Text categorisation
    License
    Apache License, Version 2.0
    Conditions of use
    Retain, in the Source form of any Derivative Works that You distribute, all licensing information from the Source form of the Work
    Description
    The corpus contains 1,030 online communication messages, randomly selected from Network News Transfer Protocol (NNTP) newsgroups, the bug tracking system Bugzilla and the bug tracking system GitHub. NNTP articles, Bugzilla and GitHub comments were selected randomly so that the sample exhibits similar characteristics to the population as a whole. Each message was annotated manually as a request or a non-request. The corpus was created as part of the work presented in the current paper and it is described in section 3. We intend to make the corpus available freely.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Onto.PT
    Resource type
    Ontology
    Size
    117 synsets
    Languages
    Portuguese (por)
    Production status
    Existing-used
    Resource usage
    Word Sense Disambiguation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Large wordnet for Portuguese, created automatically after integrating the relation instances extracted from three Portuguese dictionaries in the synsets of TeP and OpenWordNet.PT. Its current version, 0.6, contains ~168k lexical items and ~238k word senses, organised in ~117k synsets, connected by ~341k relation instances, that cover the same types as PAPEL. About 40% of the synsets contain glosses, assigned automatically.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    OpenWordNet.PT
    Resource type
    Ontology
    Size
    39 synsets
    Languages
    Portuguese (por)
    Production status
    Existing-used
    Resource usage
    Word Sense Disambiguation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Portuguese wordnet that results from the manual translation of a set of base synsets from Princeton WordNet 3.0. Semantic relations were inherited from the latter, given the synset matches. Currently, it contains ~48k lexical items and ~54k word senses, organised in ~39k synsets, connected by ~84k relation instances, that cover the same types as WordNet 3.0.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    PACE Corpus
    Resource type
    Corpus
    Size
    246688 tokens
    Languages
    English (eng) German (deu)
    Production status
    Newly created-finished
    Resource usage
    Sentiment Analysis
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Publicly available multilingual evaluation corpus for phrase-level Sentiment Analysis that can be used to evaluate real world applications in an industrial context. Data from English and German Internet forums (1000 posts each) focusing on the automotive domain.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    PAPEL
    Resource type
    Lexicon
    Size
    102 lexemes
    Languages
    Portuguese (por)
    Production status
    Existing-used
    Resource usage
    Word Sense Disambiguation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Lexical-semantic network extracted automatically from a proprietary Portuguese dictionary. PAPEL 3.5, contains ~102k lexical items, connected by ~191k semantic relation instances covering a rich set of types, including including synonymy, hypernymy, several types of meronymy, causation and purpose.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    PAROLE-SIMPLE-CLIPS
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    Italian (ita)
    Production status
    Existing-used
    Resource usage
    Lexical classification
    License
    ELRA
    Conditions of use
    Attribution
    Description
    PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS). The full lexicon is available in the ELRA catalogue (see http://catalog.elra.info/product_info.php?products_id=881). Part of the resource is also available as Linked Data at http://datahub.io/dataset/simple
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    PDT-VALLEX 2.0
    Resource type
    Lexicon
    Size
    11656 entries
    Languages
    Czech (ces)
    Production status
    Existing-used
    Resource usage
    linking lexicons
    License
    Creative Commons 3.0 - BY - NC - SA
    Conditions of use
    Attribution, NonCommercial, ShareAlike
    Description
    The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool) , and also in more human readable form. The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    PanLex
    Resource type
    Lexicon
    Size
    20,000,000 lexemes
    Languages
    <Not Specified>
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CC0
    Conditions of use
    <Not Specified>
    Description
    A panlingual lexical translation database currently documenting 1.1 billion pairwise translations among 20 million lexemes in 9,300 language varieties.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    ParCor 1.0
    Resource type
    Corpus
    Size
    332054 tokens
    Languages
    English German
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    None
    Conditions of use
    <Not Specified>
    Description
    ParCor is a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as referring expressions – has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    ParTUT
    Resource type
    Corpus
    Size
    3194 sentences
    Languages
    Italian (ita) English (eng) French (fra)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Creative Commons
    Conditions of use
    <Not Specified>
    Description
    ParTUT is a project for the development of a multilingual parallel treebank for Italian, English and French. The aim of this work is twofold: building an aligned parallel treebank for Italian, English and French, by extending and applying a single treebank schema to other languages, and studying how the schema can be used to address issues typically related to parallel corpora. The annotation and tools used for the development of this resource are those of the Turin University Treebank (TUT), a collection of Italian sentences annotated at a morpho-syntactic, syntactic and (to a lesser extent) semantic level, with dependency-oriented representation format
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Parallel sentences for error detection and correction from WIkipedia 2012 and 2013
    Resource type
    Corpus
    Size
    4604 pairs of sentences
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Error Detection and Correction
    License
    <Not Specified>
    Conditions of use
    Attribution, ShareAlike
    Description
    This resource consists of the two files. Files contain 4604 parallel sentences extracted automatically from Wikipedia 2012 and Wikipedia 2013. Sentence-splitting was performed with jmx mxterminator and alignment was done with Microsoft Aligner. The resource was used for the experiments in the current submission with an assumption that Wikipedia 2013 provides error-corrections for the sentences from Wikipedia 2012.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Paraphrase Fragment Corpus
    Resource type
    Corpus
    Size
    113314 entries
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Textual Entailment
    License
    Various
    Conditions of use
    <Not Specified>
    Description
    The big corpus is automatically constructed by the method described in the abstract; and there is also a small corpus of gold-standard annotations.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Plane Crash Dataset
    Resource type
    Evaluation Data
    Size
    193 entries
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Information Extraction, Information Retrieval
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    A knowledge base of 193 plane crash events based on Wikipedia infoboxes, and stand-off annotation for automatically generated slot-type labels for 4,093 newswire documents from Tipster-1, Tipster-2, Tipster-3 and Gigaword-5.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Predicate Matrix
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    English
    Production status
    Newly created-in progress
    Resource usage
    Semantic Role Labeling
    License
    CreativeCommons
    Conditions of use
    Attribution
    Description
    Predicate Matrix, a new lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet. With the Predicate Matrix, we expect to provide a more robust and interoperable predicate lexicon. Moreover, we plan to extend the coverage of current predicate resources, to enrich WordNet with predicate information, discover and solve inherent inconsistencies among the resources and possibly to extend predicate information to languages other than English (by exploiting the local wordnets aligned to the English WordNet.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Priberam Compressive Summarization Corpus
    Resource type
    Corpus
    Size
    800 documents
    Languages
    Portuguese (por)
    Production status
    Newly created-finished
    Resource usage
    Summarisation
    License
    Creative Commons 3.0 (NonCommercial, ShareAlike)
    Conditions of use
    Non-Commercial; ShareAlike (CC-NC-SA 3.0)
    Description
    This is a corpus for multi-document summarization for European Portuguese. It contains 80 topics, each of which has 10 documents, for a total of 800 documents. Each topic contains two human summaries. The summaries are compressive: they are the result of a compression of the sentences in the original documents.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Qatari Arabic Corpus
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Qatari Arabic
    Production status
    Newly created-in progress
    Resource usage
    Speech Recognition/Understanding
    License
    Not yet released
    Conditions of use
    <Not Specified>
    Description
    The Qatari Arabic (QA) corpus was collected from different TV series and talk show programs. Data are selected from programs in which the majority of speech is in QA; segments from each program are selected after audition confirms the quality of the speech signal. The programs are: Tesaneef (popular Qatari series), Sabah El-Doha (talk show program), and some episodes from Al-Jazeerah are selected if guest speakers are speaking Qatari dialect. The corpus is recorded in linear PCM, 16 kHz, and 16 bits.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    SETimes.HR
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Croatian (hrv)
    Production status
    Newly created-in progress
    Resource usage
    <Not Specified>
    License
    CC BY-SA 3.0
    Conditions of use
    Attribution, ShareAlike
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Sample annotated text for definiteness annotations
    Resource type
    Corpus
    Size
    12 KByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Sample annotated text for definiteness annotations
    Resource type
    Corpus
    Size
    32 KByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Sample annotated text for definiteness annotations
    Resource type
    Corpus
    Size
    32 KByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Sample annotated text for definiteness annotations
    Resource type
    Corpus
    Size
    2 KByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Sample annotated text for definiteness annotations
    Resource type
    Corpus
    Size
    7 KByte
    Languages
    English (eng)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    No description provided, see the related article
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Sclera2cornetto
    Resource type
    Lexicon
    Size
    5710 entries
    Languages
    Dutch (nld) Sclera
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Free
    Conditions of use
    <Not Specified>
    Description
    Sclera2Cornetto.1.0.tgz
    ==========================
    Created by Vincent Vandeghinste and Ineke Schuurman. More background information can be found in LREC2014 paper. This archive contains two resource files: 1. Sclera2Cornetto.csv
    ====================================
    We have manually linked a subset of 5710 Sclera pictographs to Cornetto synsets. As these pictographs sometimes depict complex concepts, they can be linked to one or to more synsets indicating that their meaning combines the meanings of the synsets. In these cases we have identified one of the synsets as the head synset, indicating that the other linked synsets are in some kind of dependency relation with the head synset. In cases where the pictograph meaning was not reflected by one or more synsets, we often have linked the pictograph to the synset of its hyperonym.
    Sclera2Cornetto consists of a tab-separated database table with the following columns (N stands for NULL):
    -lemma: name of the pictograph (spaces in the original filenames have been replaced with hyphens) For simple pictographs
    -synset: synset identifier matching Sclera pictograph
    -relation: whether the synset is synonym/hyperonym of pictograph Other columns are set to N For complex pictographs
    -head: synset identifier of head -headrel: relation of synset to pictograph (synonym/hyperonym)
    -dependent: comma-separated list of synset identifiers for dependents
    -deprel: comma-separated list of relations (synonym/hyperonym) of synsets for each dependent
    Other columns are set to N 2. Dutch2Sclera.csv
    ===================
    We also make our Dutch2Sclera dictionary table available, consisting of 372 entries linking Dutch words straight to Sclera pictographs. This table contains token, lemma, tag, and picto columns, allowing underspecification.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Senso Comune
    Resource type
    Lexicon
    Size
    288013 entries
    Languages
    Italian (ita)
    Production status
    Existing-updated
    Resource usage
    Word Sense Disambiguation
    License
    Creative Commons Attribution-Share Alike 2.5 Italy License
    Conditions of use
    Attribution, ShareAlike
    Description
    Lexical-semantic database for Italian, composed by three modules comprising a top level module, which contains basic ontological concepts and relations, a lexical module, which models general linguistic and lexicographic structures, and a frame module providing concepts and axioms for modeling the predicative structure of verbs, nouns and adjectives. The resource has been aligned for word senses for verbs and nouns with the Italian MultiWordNet.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    SexIt
    Resource type
    Terminology
    Size
    <Not Specified>
    Languages
    French (fra)
    Production status
    always in creation
    Resource usage
    <Not Specified>
    License
    CC
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    SwissAdmin
    Resource type
    Corpus
    Size
    20,000,000 words
    Languages
    German (deu) French (fra) Italian (ita) English (eng)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Open source for annotations; license for source text as stated in the paper
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Syntactic Reference Corpus of Medieval French (SRCMF)
    Resource type
    Corpus
    Size
    280000 words
    Languages
    Old French
    Production status
    Newly created-in progress
    Resource usage
    Diachronic syntax
    License
    CreativeCommons for the annotation of all texts and the words of some texts. More restrictive licenses apply for the words of other texts.
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    The SRCMF contains the 15 Old French texts with about 280000 words. It has a high-quality manual annotation, based on a linguistically adequate dependency grammar. Annotation data is provided as RDF/XML. Available export formats are CONLL and TigerXML. The final revision of the texts is ongoing and will be finished by the end of 2013. The project was funded by the Agence Nationale de la Recherche (ANR, France) and the Deutsche Forschungsgemeinschaft (DFG, Germany) 2009-2012.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    TED-LIUM
    Resource type
    Corpus
    Size
    207 hours
    Languages
    American English (eng)
    Production status
    Existing-updated-(release this year)
    Resource usage
    Speech Recognition/Understanding
    License
    CreativeCommons
    Conditions of use
    Attribution, Non-Commercial, Non-Derivative
    Description
    Corpus of speech with transcriptions (207 hours) based on TED talks
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    TeP
    Resource type
    Lexicon
    Size
    19.888 synsets
    Languages
    Brazilian Portuguese (por)
    Production status
    Existing-used
    Resource usage
    Word Sense Disambiguation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Electronic thesaurus for Brazilian Portuguese, created manually. TeP 2.0, contains more than 44,000 lexical items, organised in 19,888 synsets, and also 4,276 antonymy relations between synsets.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Terminesp LD
    Resource type
    Lexicon
    Size
    73317 entries
    Languages
    Spanish (spa) English (eng) German (deu) French (fra) Swedish (swe) Latin, Italian
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    to be specified
    Conditions of use
    <Not Specified>
    Description
    Terminesp LD is the Linked Data version of Terminesp, a terminological database in Spanish created by AETER (Asociación Española de Terminología) by extracting terminological data produced by AENOR (Asociación Española de Normalización y Certificación). It contains more than thirty thousand terms with equivalences in other languages whenever they are available. The core data has been modelled using the lemon model and the translations between terms have been modelled using the lemon translation module proposed by the OEG.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    The N2 Corpus
    Resource type
    Corpus
    Size
    42480 words
    Languages
    English
    Production status
    Newly created-finished
    Resource usage
    Narrative understanding and comprehension; dynamics of radicalization
    License
    Creative Commons CC-BY 4.0
    Conditions of use
    Attribution
    Description
    The N2 (Narrative Networks) Corpus is a collection of 100 stories, comprising approximately 42,000 words, most originally in Arabic but all translated into English. The corpus contents are all texts that Islamist Extremists have produced, or texts that are often referenced by them. These include: personal narratives gathered from internet forums; press releases describing bombings and attacks by extremist groups in Afghanistan; articles containing stories included in al-Qaeda propaganda materials (the Inspire magazine); and religious stories (Hadith and Sirah) often referenced by extremist groups. Every text in the corpus is a story. Also, every text in the corpus has been annotated for 14 layers of syntax and semantics, including: referring expressions and co-reference; events, time expressions, and temporal relationships; semantic roles; and word senses. In cases where automatic analyzers are not available to do near-perfect annotations, layers were double-annotated and adjudicated by trained annotators. The corpus comprises 100 texts and approximately 42,000 words.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    The Norwegian Dependency Treebank
    Resource type
    Corpus
    Size
    614000 tokens
    Languages
    Norwegian Bokmål (nob) Norwegian Nynorsk (nno)
    Production status
    Newly created-finished
    Resource usage
    PoS-tagging, parsing
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    The Norwegian Dependency Treebank encompasses treebanks for both written standard of Norwegian (Bokmål and Nynorsk). It is the result of a 2 year project conducted at the National Library of Norway (Språkbanken) and was finished at the start of 2014. At present it contains 311000 tokens of Norwegian Bokmål and 303000 tokens of Norwegian Nynorsk. These have been annotated morphologically and syntactically by trained linguists.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Translation errors
    Resource type
    Corpus
    Size
    300 sentences
    Languages
    Portuguese (por)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    We have created a corpus constituted by automatic translations performed by two widely used translation engines (Google Translator and Moses) in three different scenarios representing different challenges in the translation from English to European Portuguese. This corpus was annotated with the translation errors according to a taxonomy defined by us.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    UM-Corpus
    Resource type
    Corpus
    Size
    2 million sentences
    Languages
    <Not Specified>
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Creative Commons Non-Commercial 3.0 Licenses
    Conditions of use
    Non-Commercial
    Description
    There are total more than 10 million English-Chinese (E-C) parallel sentences, and more than 2 million training data and 5,000 testing sentences are released for public for free. Different from previous work, the corpus is designed to embrace eight different domains (News, Novels, Laws, Thesis, Educational Materials, Science, Speech/Subtitles, and Microblog). Some of them are further categorized into different topics. The corpus has been released to the research community under the Creative Commons Non-Commercial 3.0 Licenses.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    USAGE Corpus
    Resource type
    Corpus
    Size
    1200 reviews
    Languages
    English (eng) German (deu)
    Production status
    Newly created-finished
    Resource usage
    Sentiment Analysis
    License
    Open Data Commons Attribute License (ODB-By) v1.0
    Conditions of use
    <Not Specified>
    Description
    Annotations of German and English Amazon reviews, aspects, evaluating subjective phrases, their polarity and their relations. To retrieve the Amazon texts themselves, a crawler is made available.The reviews themselves are not part of this data publication.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    UW Bio-NLP X-ray Event Corpus
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    American English (eng)
    Production status
    Newly created-in progress
    Resource usage
    <Not Specified>
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    A set of x-ray text snippets annotated with change-of-state events
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Uppsala Persian Dependency Treebank
    Resource type
    Treebank
    Size
    151.671 tokens
    Languages
    Iranian Persian (pes)
    Production status
    Existing-updated
    Resource usage
    Syntactic parsing
    License
    Creative Commons
    Conditions of use
    Attribution
    Description
    The Uppsala Persian Dependency Treebank (UPDT) is a dependency-based syntactically annotated corpus for Persian. The treebank consists of 6000 sentences (151,671 tokens) of written text in CoNLL-format which has been developed through a bootstrapping procedure. The entire dependency relations used in the annotation including the guidelines for sentence segmentation, tokenization, and morphological annotation are described in detail in the Uppsala Persian Dependency Treebank Annotation Guidelines. The annotation guidelines is written in English.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Uyghur-Kazahk One-to-One Mapping Bilingual Dictionary
    Resource type
    Lexicon
    Size
    50 entries
    Languages
    Uighur (uig) Kazakh (kaz)
    Production status
    Newly created-in progress
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    This resource is a one-to-one mapping bilingual dictionary of Uyghur(Uighur) and Kazakh language pair, which is the experimental result of automatic induction from Chinese-Uyghur and Chinese-Kazakh bilingual dictionaries by using constraint approach proposed in corresponding submission. = Resource Detail = File type: Microsoft Excel(.xlsx) File size: 1.78MB Scale: 50,000 translation pairs (one-to-one mapping) Accuracy: About 83% (human confirmed) Script: Arabic-based Uyghur and Kazahk scripts Encode: Unicode Provider: Ishida&Matsubara Laboratory, Department of Social Informatics, Kyoto University.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    VOCE Corpus
    Resource type
    Corpus
    Size
    638 minutes
    Languages
    Portuguese (por)
    Production status
    Newly created-in progress
    Resource usage
    Emotion Recognition/Generation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    The VOCE corpus consists of a collection of 38 raw recordings, adding up to 78~min of Baseline, 73.6~min of Experiment and 487~min of Event free speech, with accompanying metadata (demographic and health questionnaires). The recordings are annotated by individual appraisal of stress based on self-reports and physiological measures, whereby the first validate that participants are experiencing stress and the latter provide fine-grained annotation of the speech. Speakers are 38 students from the University of Porto, aged 19 to 49.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Valency Lexicon of Czech Verbs (VALLEX 2.6)
    Resource type
    Lexicon
    Size
    2730 lexemes
    Languages
    Czech (ces)
    Production status
    Existing-used
    Resource usage
    linking lexicons
    License
    Creative Commons 3.0-BY-NC-SA
    Conditions of use
    Attribution, NonCommercial, ShareAlike
    Description
    The Valency Lexicon of Czech Verbs, Version 2 (VALLEX 2.x), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.x has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague. VALLEX 2.x is a successor of VALLEX 1.0, extended in both theoretical and quantitative aspects. VALLEX 2.x provides information on the valency structure (combinatorial potential) of verbs in their particular senses. VALLEX is closely related to the Prague Dependency Treebank project: both of them use Functional Generative Description (FGD), being developed by Petr Sgall and his collaborators since the 1960s, as the background theory. In VALLEX 2.x, there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses"). Note that VALLEX 2.x - according to FGD, but unlike traditional dictionaries and also unlike VALLEX 1.0 - treats a pair of perfective and imperfective aspectual counterparts as a single lexeme (if perfective and imperfective verbs would be counted separately, the size of VALLEX 2.x would virtually grow to 4,250 verb entries). To ensure high quality of the data, all VALLEX entries have been created manually, using several previously existing lexicons as well as corpus evidence from the Czech National Corpus.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Valency Lexicon of Czech Verbs (VALLEX 2.6)
    Resource type
    Lexicon
    Size
    2730 lexemes
    Languages
    Czech
    Production status
    Existing-updated
    Resource usage
    Lexicon extension
    License
    Creative Commons 3.0-BY-NC-SA
    Conditions of use
    Attribution, Non-Commercial, ShareAlike
    Description
    The Valency Lexicon of Czech Verbs, Version 2 (VALLEX 2.x), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.x has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague. VALLEX 2.x is a successor of VALLEX 1.0, extended in both theoretical and quantitative aspects. VALLEX 2.x provides information on the valency structure (combinatorial potential) of verbs in their particular senses. VALLEX is closely related to the Prague Dependency Treebank project: both of them use Functional Generative Description (FGD), being developed by Petr Sgall and his collaborators since the 1960s, as the background theory. In VALLEX 2.x, there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses"). Note that VALLEX 2.x - according to FGD, but unlike traditional dictionaries and also unlike VALLEX 1.0 - treats a pair of perfective and imperfective aspectual counterparts as a single lexeme (if perfective and imperfective verbs would be counted separately, the size of VALLEX 2.x would virtually grow to 4,250 verb entries). To ensure high quality of the data, all VALLEX entries have been created manually, using several previously existing lexicons as well as corpus evidence from the Czech National Corpus.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Vystadial 2013 – Czech data
    Resource type
    Corpus
    Size
    1.5 GByte
    Languages
    Czech (ces)
    Production status
    Newly created-finished
    Resource usage
    Speech Recognition/Understanding
    License
    Creative Commons (CC-BY-SA 3.0)
    Conditions of use
    Attribution, ShareAlike
    Description
    Dataset of telephone conversations (audio and transcriptions) in Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Vystadial 2013 – English data
    Resource type
    Corpus
    Size
    2.6 GByte
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Speech Recognition/Understanding
    License
    Creative Commons (CC-BY-SA 3.0)
    Conditions of use
    Attribution, ShareAlike
    Description
    Dataset of telephone conversations (audio and transcriptions) in English, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    WMT12
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    English (eng) German (deu) French (fra)
    Production status
    Existing-used
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    <Not Specified>
    Conditions of use
    <Not Specified>
    Description
    Parallel Corpora, NewsCommentary
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    WMT12 Data
    Resource type
    Corpus
    Size
    35 MByte
    Languages
    English (eng) Czech (ces) Spanish (spa) French (fra) German (deu)
    Production status
    Existing-used
    Resource usage
    Evaluation
    License
    Unspecified
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    WMT13 Data
    Resource type
    Corpus
    Size
    56 MByte
    Languages
    English (eng) Spanish (spa) Russian (rus) Czech (ces) German (deu) French
    Production status
    Existing-used
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    Unspecified
    Conditions of use
    <Not Specified>
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    Walenty
    Resource type
    Lexicon
    Size
    8587 entries
    Languages
    Polish (pol)
    Production status
    Newly created, further developed
    Resource usage
    Parsing; also human use
    License
    Creative Commons BY-SA
    Conditions of use
    Attribution, ShareAlike
    Description
    See the submitted paper and http://zil.ipipan.waw.pl/Walenty.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    WordNet RDF
    Resource type
    Lexicon
    Size
    210.772 lexemes
    Languages
    English (eng)
    Production status
    Newly created-finished
    Resource usage
    Semantic Web
    License
    Princeton WordNet License
    Conditions of use
    <Not Specified>
    Description
    Export of WordNet in lemon and RDF
    Download from
    Referring paper
    Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
    Edition
    LDL 2014
  • Name
    caWaC
    Resource type
    Corpus
    Size
    779,086,559 tokens
    Languages
    Catalan (cat)
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CC-BY-SA 3.0
    Conditions of use
    Attribution, ShareAlike
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    filesRo.zip
    Resource type
    Corpus
    Size
    689 KByte
    Languages
    Romanian (ron) Italian (ita) Spanish (spa) Portuguese (por) French (fra) Turkish
    Production status
    Newly created-finished
    Resource usage
    cognates
    License
    OpenSource
    Conditions of use
    <Not Specified>
    Description
    We provide an archive containing automatically extracted cognate pairs and word-etymon pairs for Romanian words and five related languages: French, Italian, Spanish, Portuguese and Turkish. We ran our experiments on the Romanian vocabulary provided by dexonline machine-readable dictionary (http://dexonline.ro). In "cognates" folder there is one file for each language L containing pairs of cognates shared between L and Romanian. In "etymons" folder there is one file for each language L containing word-etymon pairs for Romanian words having L etymology.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    par-lvf
    Resource type
    Lexicon
    Size
    <Not Specified>
    Languages
    French (fra)
    Production status
    Newly created-finished
    Resource usage
    Parsing
    License
    LGPL-LR
    Conditions of use
    <Not Specified>
    Description
    This is an extended version of the exLVF lexicon where examples are unfolded, corrected and parsed with a second parser.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    slTwitterCorpus
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Slovenian (slv)
    Production status
    Newly created-finished
    Resource usage
    <Not Specified>
    License
    CC-BY-SA 3.0
    Conditions of use
    Attribution, ShareAlike
    Description
    <Not Specified>
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    tweet-norm_es
    Resource type
    Corpus
    Size
    <Not Specified>
    Languages
    Spanish (spa)
    Production status
    Newly created-finished
    Resource usage
    Microtext normalization
    License
    CC-BY
    Conditions of use
    Attribution
    Description
    Corpus of annotated tweets for lexical normalization in Spanish. Two collections have been generated: the development corpus and the test corpus, which consist of 600 tweets each. A total of 775 and 724 OOV words were manually annotated respectively in both corpora.
    Download from
    Referring paper
    Edition
    LREC 2014
  • Name
    wiki_zh_ja_corpus
    Resource type
    Corpus
    Size
    126.811 sentences
    Languages
    Mandarin Chinese (cmn) Japanese
    Production status
    Newly created-finished
    Resource usage
    Machine Translation, SpeechToSpeech Translation
    License
    CreativeCommons
    Conditions of use
    <Not Specified>
    Description
    This resource is a Chinese-Japanese parallel corpus automatically extracted from Wikipedia.
    Download from
    Referring paper
    Edition
    LREC 2014

Important Dates

  • 24 October 2013: Abstract, Workshop, Tutorial and Panel submission
  • 27 November 2013: Notification of acceptance for Workshops & Tutorials
  • 31 January 2014: Notification of accepted papers
  • 22 March 2014: Final Submission of accepted papers
  • 6 April 2014: Submission of workshop proceedings
  • 28 - 29 - 30 May 2014: Main Conference
  • 26 and 27 May 2014: Pre-conference workshops and tutorials
  • 31 May 2014: Post-conference workshops and tutorials

Latest Tweets