After the conference, the Shared LRs set at LREC2014 was manually checked and a cleaned version of the list of LRs is now available. This list includes LRs complying with the following criteria:

LRs accessible (either when uploaded by the participants or when they provide an external URL for downloading the data)

LRs categorized as Datasets only. It can be a:

Corpus,
Grammar/Language Model,
Ontology,
Terminology,
Treebank.
Evaluation Data / Package

Excluded LRs are:

LRs uploaded when the content did not correspond to the description
LRs with no download URL provided or URL now a dead link
LRs categorized as tools or guidelines
LRs associated to rejected papers.

We added a new field in the metadata: “Conditions of use”. The value entered here indicates specific conditions of use provided by the submitter (such as Attribution, Non-commercial use, Share Alike, etc.)

Shared-LRs @ LREC 2014

Name	A Colloquial Corpus of Japanese Sign Language
Resource type	Corpus
Size	3500 GByte
Languages	<Not Specified>
Production status	Newly created-in progress
Resource usage	Dialogue
License	<Not Specified>
Conditions of use	<Not Specified>
Description	We began building a corpus of Japanese Sign Language (JSL) in April 2011 with the support of the Japan Society for the Promotion of Science. The purpose of this project was to increase awareness of sign language as a distinctive language in Japan. This corpus is beneficial not only to linguistic research but also to hearing-impaired and deaf individuals, as it helps them to recognize and respect their linguistic differences and communication styles. This is the first JSL corpus developed for academic and public use. During the first stage of this project, from May to July 2012, we filmed 40 deaf subjects in two prefectures, Gunma and Nara, which are located about 50–100 km from Tokyo and Osaka, respectively. Each prefecture has one school for the deaf. We obtained data from an age-balanced sample of individuals 30–70 years of age in each prefecture, and each age group was divided into same-sex pairs. We used three approaches to collect data: interviews (for introductory purposes only), dialogues, and lexical elicitation. Each session, including our explanation of the ethical considerations and subjects’ provision of written consent, lasted 1.5 h.
Download from	http://research.nii.ac.jp/jsl-corpus/en/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/278.html
Edition	LREC 2014

Name	A Repository of State of the Art and Competitive Baseline Summaries for DUC 2004
Resource type	Corpus
Size	225 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Summarisation
License	<Not Specified>
Conditions of use	Attribution
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1093_res_1.gz [305.54 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1093.html
Edition	LREC 2014

Name	AcadOnto
Resource type	Ontology
Size	<Not Specified>
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Information Extraction, Information Retrieval
License	<Not Specified>
Conditions of use	<Not Specified>
Description	An academic domain ontology populated using IIT Bombay organization corpus, web and the linked open data.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_251_res_1.gz [803.74 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/251.html
Edition	LREC 2014

Name	Aix Map Task
Resource type	Corpus
Size	<Not Specified>
Languages	French (fra)
Production status	Existing-used
Resource usage	Dialogue
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This is a corpus of audio and video recordings of task-oriented dialogues. It was modelled after the original HCRC Map Task corpus. Lexical material was designed for the analysis of speech and prosody. The corpus was collected under two communicative conditions, one audio-only condition and one face-to-face condition. The recordings took place in a studio and a sound attenuated booth respectively, with head-set microphones (and in the face-to-face condition with two video cameras). The recordings have been segmented into Inter-Pausal-Units and transcribed using transcription conventions containing actual productions and canonical forms of what was said.
Download from	http://sldr.org/sldr000732
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/719.html
Edition	LREC 2014

Name	Alignment of Parallel Texts from Cyrillic to Latin
Resource type	Corpus
Size	<Not Specified>
Languages	Romanian (ron)
Production status	Newly created-in progress
Resource usage	Transliteration
License	GNU
Conditions of use	you can redistribute it and/or modify it under the terms of the GNU General Public License, i.e your modified versions must carry all the freedoms stated in the GPL
Description	The text of the novel Sania (eng. The Sledge) served as a training corpus. It was written in 1955 by Ion Druță and printed originally in Cyrillic scripts. We have followed a special previously developed technology of recognition and specialized lexicons. In such a way, we have obtained the electronic version of Cyrillic script variant of the text. On the other hand, we did the same procedure with Latin script variant of the same text, transliterated manually by expert linguists. It permitted us to make an automatic aligning of Cyrillic variant of the text to contemporary Latin variant of the same text at the word/expression level. The process was semi-automated, based on the heuristics for transcription of letters and the expert linguists’ validation. The corpus is annotated at sentence and word levels, providing morpho-lexical information using UAIC Romanian Part of Speech Tagger (Simionescu, 2011).
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_328_res_1.gz [61.84 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/328.html
Edition	LREC 2014

Name	An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis
Resource type	Corpus
Size	26,724 tokens, 7,503 sentences sentences
Languages	Arabic
Production status	Newly created-in progress
Resource usage	Text Mining
License	Open Source
Conditions of use	Attribution
Description	An Arabic twitter data set of 7,503 tweets. The released data contains manual Sentiment Analysis annotations as well as automatically extracted features, saved in Comma Separated (CSV) and Attribute-Relation File Format (ARFF) file formats. Due to twitter privacy restrictions we replaced the original tweet with its ID.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_317_res_1.gz [1.51 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/317.html
Edition	LREC 2014

Name	Annotation of Syntactic Categories in Chinese Word Structures
Resource type	Corpus
Size	2.3 MByte
Languages	Mandarin Chinese (cmn)
Production status	Newly created-finished
Resource usage	Chinese word segmentation, POS-tagging, parsing
License	<Not Specified>
Conditions of use	<Not Specified>
Description	An Automatic Annotation of Syntactic Categories of Chinese Word Structures. The file contains annotated Chinese word structures with refined syntactic categories induced by the algorithms in the paper. Each line of the file contains the annotated word structure of a word type, which is a binarized tree in Penn Treebank format. The root of the tree is the POS tag of the word that can be used for POS tagging and syntactic parsing. Each node is the format of "Tag1ZZTag2ZZ..Tagn" in which "ZZ" is a delimiter of multiple tags of the node. Such combination of multiple syntactic tags are used as the category of the nodes that represent word constituents (characters, subwords). The final annotation is publicly available at: http://www.sfs.uni-tuebingen.de/~jma/word_str.txt Our annotation uses two inputs: the pos-tagged sentences in Penn Chinese Treebank and the branching and head directions in the word structure in http://ir.hit.edu.cn/ mszhang/data.zip.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1158_res_1.gz [393.02 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1158.html
Edition	LREC 2014

Name	Arabic Tweets NER test set
Resource type	Evaluation Data
Size	<Not Specified>
Languages	Arabic Colloquial Arabic
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	Research License
Conditions of use	Research only
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_186_res_1.gz [254.37 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/186.html
Edition	LREC 2014

Name	AuCoPro - Splitting
Resource type	Lexicon
Size	746 KByte
Languages	Afrikaans Dutch
Production status	Newly created-finished
Resource usage	Language Modelling
License	Creative Commons Attribution 3.0 Unported
Conditions of use	Attribution
Description	The AuCoPro-Splitting dataset contains compounds annotated with their compound boundaries and linking morphemes. The dataset consists of two files, one for Afrikaans and one for Dutch. The annotation was performed according to annotation guidelines as described in Verhoeven, van Zaanen, van Huyssteen, & Daelemans (2014).
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_66_res_1.gz [215.42 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/66.html
Edition	LREC 2014

Name	BabelNet 2.0 as Linked Data
Resource type	Linked Data
Size	1 billion RDF triples
Languages	English (eng) French (fra) Italian (ita) German (deu) Spanish (spa) CA, IS, PL, RO, AF, AR, BG, CS, CY, DA, EL, ET, FA, FI, GA, HE, HI, HR, HU, ID, JA, KO, LT, LV, MS, NL, NO, PT, RU, SK, SL, SQ, SR, SV, SW, TL, TR, UK, VI, ZH, MT, EU, EO, GL, LA
Production status	Newly created-finished
Resource usage	Semantic Web
License	Creative Commons Attribution-Noncommercial-Share Alike 3.0 License (http://creativecommons.org/licenses/by-nc-sa/3.0/)
Conditions of use	Attribution, Non-Commercial, ShareALike
Description	This resource corresponds to the publication of BabelNet 2.0 as Linked Data. BabelNet 2.0 is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and an ontology which connects concepts and named entities in a very large network of semantic relations, made up of more than 9 millions of entries. BabelNet 2.0 covers 50 languages and is obtained from the automatic integration of different lexical-semantic and encyclopedic resources. Its conversion as Linked Data is based on lemon, a Lexicon Model for Ontologies, complemented by SKOS. The RDF version of BabelNet contains approximately 1 billion triples.
Download from	http://babelnet.org/download.jsp
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/810.html
Edition	LREC 2014

Name	Basque Postedition corpus
Resource type	Corpus
Size	50.204 words
Languages	Basque (eus) Basque (eus)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CC-BY-SH
Conditions of use	Attribution, ShareAlike
Description	Corpus of raw and manual post-edited translations (50.204 words). It was created by manual post-editing of the Basque outputs given by Matxin RBMT system translating 100 entries from the Spanish Wikipedia.
Download from	http://ixa2.si.ehu.es/glabaka/OmegaT/OpenMT-OmegaT-CS-TM.zip [27.55 MB]
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	SaLTMiL 2014

Name	Benchmark Database of Phonetic Alignments
Resource type	Evaluation Data
Size	750 alignments
Languages	Jaqaru, Hǎrbīn, Zutendaal (BeLb), Château-d'Oex, Cerlatez, Zheglica (vid), Sadina (pop), Halden, Gradec (vd), Stambolovo (hask), Obdorsk Khanti, Nordstrand, Dàshí, French, Vasiljovo (tetev), Tihomir (krgr), Gabare (bslat), Gōngxìng, Boudry, Courtepin, Arnex, Xiānggǎng, Koprivshtica (pird), Merdanja (vtarn), Xiāngtàn, American English, Savièse, Zhelen (svog), Sandnessjøen, Garvan (sil), Conthey, Bogdanov dol (pern), Asparuhovo (lom), Jian'ou, Pelimka Mansi, Asserøy, Serbian, Schinnen (Lb), Aduard (Gn), Shipka (kaz), Zhōuchéng, Gjøra, Stange, Occitan, Brouwershaven (Ze), Landeron, Dragoevo (presl), Meerbeek (BeBr), Tiānjìn, Nova lovcha (gd), Grône, Mørkved, Kostenec (iht), Sredec (zlgr), Glozhene (orjah), New Zealand English (Auckland), Dolna beshovica (vrach), Dorkovo (velgr), Savagnier, Ruzhinci (belgr), Jīnxīng, Karaisen (pavl), Měixiàn, Rani lug (tryn), Vaugondry, Panagjurishte (gd), Kalipetrovo (sil), Naas, Eide, Kaldfarnes, Tihomirovo (stzag), Flåte, Smolsko (pird), Starmen (bel), Valthermond (Dr), Zelenigrad (tran), Moorslede (BeWv), Dompierre, Ameide (ZH), Plagne, Varsseveld (Gl), Momina banja (pl), Svetlina (topgr), Belgian Dutch, Zhèngzhōu, Dolna melna (tran), Vakh Khanti, Momkovo (svgr), Dichin (vtarn), Dolna srudena (bel), Kopilovci (mont), Huancané, Bryne, Le Sentier, Wénzhōu, Diva slatina (mont), Italian, Australian English (Perth), Chernogorovo (paz), Dolno levski (pan), Chevroux, Likrisovskoje Khanti, Sushica (blgr), Zanozhene (berk), Ligurian, Schagen (NH), Vartovskoje Khanti, Drabishna (ivgr), Luòběnzhuō, Shiroki dol (sam), Caraz, Dragodanovo (sliv), Javorovo (asgr), Canadian English, Lánzhōu, Dobroselec (topgr), Golema rakovica (elpel), Gabra (elpel), Hohhot, Shuri, Brielle (ZH), Voden (elh), Marikostinovo (petr), Taquile, Tūnxī, Dutch (Limburg), Côte-aux-Fées, Knokke (BeWv), Bjørnevatn, Radovene (vr), Nigerian English (Igbo), Korten (nzag), Dermanci (luk), Bistrica (blgr), Shtipsko (prov), German, Opan (stzag), Dragojchinci (kjust), Chéngdū, Nikolovo(lipnik) (rus), Yúnlóng, Central German (Murrhardt), Tena, Collombey, Lozen (sof), Sado, Kyoto, Stanghelle, Shiroka laka (dev), Lillehammer, Nánjīng, Stakevci (blgr), High German (Graubuenden), Kortrijk (BeWv), Wierum (Fr), Chimborazo, Bansko (razl), West Frisian (Grou), Tromsø, Longirod, Oostende (BeWv), Ēnqī, Huancavelica, Orsières, Zheljazkovo (sred), Malomirovo (elh), Courtedoux, Low German (Achterhoek), Rabisha (belgr), English (Lindisfarne), Hachijō, Shāntóu, Huhla (ivgr), Montpreveyres, Bollezeele (FrVl), Prahins, Galata (tetev), Selfors, Verkhne Kalimsk Khanti, Xiángyún, Senokos (blgr), Norwegian (Stavanger), Danish, Krivnja (razgr), Goljama zheljazna (tetev), Sugiez, English (Liverpool, Avry-sur-Matran, Swedish (Stockholm), Tokyo, Gorni varpishta (drjan), Mussel (Gn), Sombeval, Momchilovci (smol), Mandal, Trondheim, West-Terschelling (Fr), Svirkovo (harm), Russian, Zamfirovo (berk), Vladinja (lov), Fagerhaug, Kramolin (sevl), Low German (Bargstedt), Plakovo (vt), Nova nadezhda (hs), Sint-Annen (Gn), Buurmalsen (Gl), English (London, Héfēi, Renkum (Gl), Markovo (shum), Cerneux-Péquignot, Vranilovci (gabr), Central German (Cologne), Gorna rosica (sevl), Sekirovo (plov), Lombard (East), Mouscron (Be), Eggesbøneset, Cajamarca, Lower Lozva Mansi, Sùzhōu, Straldzha (jamb), Midsland (Fr), Buren (Fr), High German (Bodensee), Vullierens, Furen (vrach), Foldvik, Brunlanes, Amami, Vrachesh (botgr), Vermes, Hǎikǒu, Venetian, Scottish, Nieurlet (FrVl), Razboishte (god), Brekke, Zabardo (asgr), Chavín, Nijmegen (Gl), Guǎngzhōu, Stjørdal, Central German (Honigberg), Devenci (luk), Murist, Tjøtta, Chepelare (asgr), Běijīng, Ezerovo (parvom), Arconciel, Icelandic, Leermens (Gn), Sloten (Fr), Inkawasi, Ter Apel (Gn), Sherkali Khanti, Den Hoorn (NH), Stroevo (plov), Collex, Sørkjosen, Brønnøysund, Charnex, Merichleri (chirp), Pelatikovo (kjust), Zabernovo (mt), Yiddish (New York), Pavelsko (asgr), Osenec (razgr), Middle Lozva Mansi, Ivanski (shum), Omarchevo (nzag), Wateringen (ZH), Hundvåg, Zdravkovec (gabr), Koedijk (NH), Vallorbe, Petarnica (plev), Tremjugan Khanti, Champéry, North Mansi, Pevec (targ), Oruro, Orvin, Molde, Lourtier, Almkerk (NB), Silvolde (Gl), Kalojanovo (plov), Kantens (Gn), Cerovica (kjust), Dutch, High German (North Alsace), Lomnes, Zhaltusha (ard), Saparevo (dup), Dolna dikanja (radom), Slaveino (smol), Zevenaar (Gl), Kōchi, Govedarci (sam), Yīnchuàn, Nánchàng, Dragizhevo (vtarn), Skobelevo (sliv), Ürümqi, Nánníng, English (North Carolina), Commugny, Asparuhovo (prov), Borre, Balgari (carev), Ayent, Apolobamba, Dutch (Antwerp), Pozharevo (tutr), Lánpíng, Jìnán, Caparevo (sand), Ladino, Hisøy, Holwerd (Fr), Elnesvågen, Smochevo (dupn), Nikolovo (hask), Hvojna (asgr), Ustovo (sm), Vernier, Kerkrade (Lb), Sucre, Smilde (Dr), Bilzen (BeLb), Dobrotino (gd), Poperinge (BeWv), Borisovo (elh), Momina klisura (pz), Hángzhōu, Tiwanaku, Fyresdal, Ognen (karn), High German (Walser), Trancovica (nik), Kozichino (pom), Xi'an, Villars-le-Terroir, Paskalevec (pavl), Dokka, Sætre, Nizjam Khanti, St-Gingolph, Lamboing, Ormont-Dessus, Kazim Khanti, Central German (Luxembourg), Vabel (nik), Dobroslavci (sof), English, Indzhe vojvoda (mtarn), Xīníng, Mihalci (pavl), Trastenik (plev), Kovachevci (sam), Golica (varn), Jiànchuàn, Tynset sentrum, Izvorovo (harm), Upper Demjanka Khanti, Støa, Krogtoft, Borge, Scheveningen (ZH), Callantsoog (NH), Táoyuán, Mugla (dev), Grimentz, Aldomirovci (slivnica), High German (Herrlisheim), Táiběi, Hafrsfjord, Enina (kaz), Bagrenci (kjust), Indian English (Delhi), Undheim, Rakevo (vr), Orkanger, Zheravna (kot), Belica (razl), Martigny, Dinevo (hask), Chernomorec (bs), Koekelare (BeWv), Buchin prohod (god), Stjørdalshalsen, Égà, Develier, Sestrino (petr), Tavda Mansi, Court, Vinishte (mont), Kaspichan (np), Wǔhàn, Beglezh (luk), Nijeholtpade (Fr), Roche, Faroese, Novo selo (trojan), Ingen (Gl), Évolène, Kreta (vrach), Fyllingsdalen, High German (Ortisei), Vardun (targ), Noiraigue, Hèqìng, Lies (Fr), Jīnmǎn, Fùzhōu, Banishte (brezn), Proto-Germanic, Tuōluò, Corongo, Garmen (gd), Bachkovo (asgr), Bygstad, Montbovon, High German (Tuebingen), Topolchane (sliv), L'Auberson, Kagoshima, Dobarsko (razl), Corcelles, Swedish (Skane), Chukovec (radom), Huancayo, Brashljan (mtarn), Stoilovo (mt), High German (Biel), South African English (Johannisburg), Shèxiàn, Semsalves, Podvis (karn), Dolni bogrov (sof), Miyako, English (Tyrone), Kolju marinovo (chirp), Lombard (West), Workum (Fr), Gouderak (ZH), Shànghǎi, Vojnjagovo (karl), Malyj Jugan Khanti, Nendaz, Lipnica (botgr), Goljamo shivachevo (sliv), Oki, Kristiansand, Mǎzhělóng, Chaux-du-Milieu, Dovre, Sinja Khanti, Nazareth (BeOv), Czech, Chángshà, Levunovo (sand), Karanovo (ajt), Konda Khanti, Bov (svog), Varbica (presl), Varbovo (blgr), Suhindol (vtarn), Bø, Vaklinovo (gd), Milchina laka (kul), Ganchovec (drjan), Ljubenova mahala (nzag), Cuzco, Raundalen, Fagnastøl, Laconnex, Puno, Dolna riksa (mont), Hermance, Guìyáng, Dutch (Ostend), Divdjadovo (shum), Piershil (ZH), Devesilica (krgr), Liljache (vr), Solishta (dev), Montalchez, Veyrier, Elov dol (pk), Brouckerque (FrVl), Marchaevo (sof), Qingdao, Gega (petr), Varvara (paz), Kùnmíng, Sortland, Øra, Valche pole (svgr), Miège, Belene (svisht), Xiàmén, Vresovo (ajt), Kravenik (sevl), Velkovci (pk), Humbeek (BeBr), Bulgarian, Oudeschoot (Fr), Qīlǐqiáo, English (Singapore), Polish, Walshoutem (BeBr), English (Buckie), Tàiyuán, Russland, Waregem (BeWv), Abbekerk (NH), Ooike (BeOv), Babjak (razl), Zuydcoote (FrVl), Konda Mansi, Konska (brezn), Vinarovo (vid), Noevci (brezn), Fully, Ěryuán, Oosterend (Fr), Vasjugan Khanti, Kawki, Pingyao, Ulft (Gl), Cochabamba, Slavjanovo (plev), Golemo malovo (sliven), Lobosh (rad), Jugan Khanti, Haamstede (Ze), Rakovica (kul)
Production status	Newly created-finished
Resource usage	testing of phonetic alignment algorithms
License	Creative Commons Attribution-NonCommercial 3.0 Unported <http://creativecommons.org/licenses/by-nc/3.0/>
Conditions of use	Attribution, Non-Commercial
Description	Automatic methods for phonetic alignment play an increasingly important role in quantitative approaches to historical linguistics and dialectology. With the "Benchmark Database for Phonetic Alignments" (BDPA), we present a new data resource which offers collections of cognate words from different language varieties. In contrast to other resources which concentrate on questions of cognacy and lexical change, the BDPA represents the data in form of pairwise and multiple alignments. An alignment is a matrix representation of two or more sequences in which corresponding segments in the sequences are placed in the same column, with empty cells resulting from non-matching segments being filled by gap symbols. Currently, the BDPA offers a total of 750 multiple alignments based on 12 different sources of language and dialect varieties.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_299_res_1.gz [668.74 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/299.html
Edition	LREC 2014

Name	Bilingual Dictionaries
Resource type	Lexicon
Size	240.000 entries
Languages	English (eng) French (fra) Italian (ita) German (deu)
Production status	Existing-updated
Resource usage	Machine Translation, SpeechToSpeech Translation
License	OpenSource
Conditions of use	<Not Specified>
Description	This resource contains 3 OpenLogos bilingual dictionaries, namely the English-German, the English-French, and the English-Italian dictionaries. In addition to the usual information on part-of-speech, gender, and number for nouns, offered by most dictionaries currently available, OpenLogos bilingual dictionaries have some distinctive features that make them unique: they contain cross-language morphological information (inflectional and derivational), semantico-syntactic knowledge, indication of the head word in multiword units, information about whether a source word corresponds to an homograph, information about verb auxiliaries, alternate words (i.e., predicate or process nouns), causatives, reflexivity, verb aspect, among others. The strongest characteristic of the dictionaries consists on the semantico-syntactic knowledge embedded in each entry.
Download from	http://www.l2f.inesc-id.pt/~abarreiro/wiki/index.php?n=Resources.Resources
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1155.html
Edition	LREC 2014

Name	Burst-Annotated Co-Occurrence Network for the Arab Spring Domain
Resource type	Corpus
Size	280 MByte
Languages	American English (eng)
Production status	Newly created-in progress
Resource usage	Knowledge Discovery/Representation
License	CC-BY
Conditions of use	Attribution
Description	A burst-annotated co-occurrence network about the Arab Spring topic built on the top of New York Times article snapshots from the years 2010-2013.
Download from	http://goo.gl/e1OO8S
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1170.html
Edition	LREC 2014

Name	CARTÃO
Resource type	Lexicon
Size	146 lexemes
Languages	Portuguese (por)
Production status	Existing-used
Resource usage	Word Sense Disambiguation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Lexical-semantic network extracted automatically from three Portuguese dictionaries. Contains ~146k lexical items, connected by ~286k semantic relation instances covering a rich set of types, including including synonymy, hypernymy, several types of meronymy, causation and purpose.
Download from	http://ontopt.dei.uc.pt/index.php?sec=downloads
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/46.html
Edition	LREC 2014

Name	Chinese Open Wordnet
Resource type	Lexicon
Size	<Not Specified>
Languages	Mandarin Chinese (cmn)
Production status	Newly created-in progress
Resource usage	Word Sense Disambiguation
License	CreativeCommons Attribution (CC BY)
Conditions of use	Attribution
Description	We are creating a large scale, freely available, semantic dictionary of Mandarin Chinese: the Chinese Open Wordnet, inspired by the Princeton WordNet and the Global WordNet Grid. All relations (hypernyms, meronyms ...) come from Princeton WordNet 3.0. We have enriched the synsets with Chinese lexical units.
Download from	http://compling.hss.ntu.edu.sg/cow/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/916.html
Edition	LREC 2014

Name	Chinese to Kazahk Bilingual Dictionary
Resource type	Lexicon
Size	52, 478 entries
Languages	Mandarin Chinese (cmn) Kazakh (kaz)
Production status	Existing-updated
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This resource is a one-to-many mapping bilingual dictionary of Chinese and Kazakh language pair, which is an experimental resource used in corresponding submission. = Resource Detail = File type: Microsoft Excel(.xlsx) File size: 3.36MB Scale: 52, 478 entities, 232,589 translation pairs Script: Mandarin Chinese, Mandarin Chinese Pinyin, Arabic-based Kazakh Encode: Unicode Provider: Ishida&Matsubara Laboratory, Department of Social Informatics, Kyoto University. Format Note: Multiple Kazakh translations are separated with "?"
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_417_res_3.gz [3.05 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/417.html
Edition	LREC 2014

Name	Chinese to Uyghur Bilingual Dictionary
Resource type	Lexicon
Size	52, 478 entries
Languages	Mandarin Chinese (cmn) Uighur (uig)
Production status	Existing-updated
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This resource is a one-to-many mapping bilingual dictionary of Chinese and Uyghur(Uighur), which is an experimental resource used in corresponding submission. = Resource Detail = File type: Microsoft Excel(.xlsx) File size: 2.84MB Scale: 52, 478 entities, 118,805 translation pairs Script: Mandarin Chinese, Mandarin Chinee Pinyin, Arabic-based Uyghur Encode: Unicode Provider: Ishida&Matsubara Laboratory, Department of Social Informatics, Kyoto University. Format Note: Multiple Uyghur translations are separated with "?"
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_417_res_2.gz [2.57 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/417.html
Edition	LREC 2014

Name	Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social (CIEMPIESS)
Resource type	Corpus
Size	17 hours
Languages	Spanish (spa)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	Creative Commons Attribution-ShareAlike 4.0 International
Conditions of use	Attribution, ShareAlike
Description	"Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social" (CIEMPIESS) is a new open-sourced corpus extracted from Spanish spoken FM podcasts in the dialect of the center of Mexico. The CIEMPIESS corpus was designed to be used in the field of automatic speech recongnition (ASR). The corpus size is 17 hours and we provide language models and language dictionaries for experimentation.
Download from	http://odin.fi-b.unam.mx/CIEMPIESS-UNAM/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/182.html
Edition	LREC 2014

Name	Corpus of Japanese Predicate Phrases for Synonym/Antonym Relations
Resource type	Corpus
Size	507 KByte
Languages	Japanese
Production status	Newly created-in progress
Resource usage	Semantic Similarities between words
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A large human annotated set of predicates for synonym-antonym relations in Japanese. Accompanied by a noun phrase and case information, the data consists of 7,278 pairs of predicates such as “receive-permission (ACC)” vs. “obtain-permission (ACC)”; the relations are categorized as synonyms, antonyms, or unrelated. Antonyms are further categorized into three different classes depending on their aspect of oppositeness.
Download from	http://nlp.ist.i.kyoto-u.ac.jp/index.php?PredicateEvalSet
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/267.html
Edition	LREC 2014

Name	Corpus of Semantic Graphs with associated English strings
Resource type	Corpus
Size	98'818 Graph/string pairs
Languages	American English (eng)
Production status	Newly created-finished
Resource usage	Natural Language Generation, Natural Language Understanding, Machine Translation
License	OpenSource
Conditions of use	<Not Specified>
Description	Automatically generated corpus of 98'818 graph/string pairs.
Download from	http://amr.isi.edu/download/boygirl.tgz [394.71 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1080.html
Edition	LREC 2014

Name	CorpusSVCs
Resource type	Corpus
Size	100 SVC translations evaluated for 5 languages for 2 MT systems (total SVCs evaluated = 1,000)
Languages	English (eng) French (fra) German (deu) Italian (ita) Portuguese (por) Spanish
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This resource contains a corpus of 100 English support verb construction (SVC) translations and their evaluation for 5 languages: French, German, Italian, Portuguese and Spanish. The translations were performed by 2 MT systems, the GoogleTranslate and the OpenLogos systems. The total number of SVC translations evaluated was 1,000.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_16_res_1.gz [20.21 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/16.html
Edition	LREC 2014

Name	Criminality Corpus for Text Categorization
Resource type	Corpus
Size	260 documents
Languages	Italian (ita)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A collection of 260 documents classified in nine different categories, to be used for testing text categorisation systems in the field of criminality.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_941_res_1.gz [359.28 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/941.html
Edition	LREC 2014

Name	Czech Meteor tables
Resource type	Corpus
Size	7.8 MByte
Languages	Czech (ces)
Production status	Existing-used
Resource usage	Paraphrasing
License	GNU LGPL
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	http://www.cs.cmu.edu/~alavie/METEOR/download/meteor-1.4.tgz [242.55 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/935.html
Edition	LREC 2014

Name	DBnary
Resource type	Lexicon
Size	>1,700,000 entries
Languages	English (eng) French (fra) German (deu) Russian (rus) Japanese (jpn) Bulgarian, Finnish, Greek, Italian, Portuguese, Spanish, Turkish,
Production status	Existing-updated
Resource usage	Semantic Web
License	CreativeCommons-by-sa
Conditions of use	Attribution, ShareAlike
Description	12 wiktionary language edition, available as a LEMON-based lexical resource in RDF, plus attachements of 3.3M translations pairs to the appropriate source word sense.
Download from	http://kaiko.getalp.org/about-dbnary
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	LDL 2014

Name	Database of Lexical Simplification Errors
Resource type	Evaluation Data
Size	200 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Lexical Simplification
License	CC-BY-SA
Conditions of use	Attribution, ShareAlike
Description	The data described in the paper. A categorisation of the errors occurring during the lexical simplification pipeline.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_479_res_1.gz [119.59 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/479.html
Edition	LREC 2014

Name	Deep Sequoia
Resource type	Corpus
Size	<Not Specified>
Languages	French (fra)
Production status	Existing-updated
Resource usage	Deep parsing
License	LGPL-LR (Lesser General Public License For Linguistic Resources)
Conditions of use	<Not Specified>
Description	The Sequoia treebank contains 3099 sentences in French, annotated with POS, constituency trees and dependency trees. The current submission proposes an extension of the ressource : an additional layer of deep syntactic annotations, and was also the occasion of correcting the surface dependency trees.
Download from	http://deep-sequoia.inria.fr
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/494.html
Edition	LREC 2014

Name	DeriNet
Resource type	Lexicon
Size	250000 lexemes
Languages	Czech (ces)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CreativeCommons
Conditions of use	Attribution, Non-Commercial, ShareAlike
Description	The presented resource is a network that attempts to capture word formation processes in Czech. Technically, it is an oriented graph consisting of nodes representing lexemes and edges that represent derivations (e.g. teach->teacher). The network was built using a combination of existing NLP resources for Czech as well as new annotations.
Download from	http://ufal.mff.cuni.cz/derinet
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/501.html
Edition	LREC 2014

Name	DerivBase.hr
Resource type	Lexicon
Size	<Not Specified>
Languages	Croatian (hrv)
Production status	Newly created-in progress
Resource usage	Various semantic tasks (entailment, SRL) as well as IR/IE tasks
License	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
Conditions of use	Attribution, Non-Commercial, ShareAlike
Description	DerivBase.hr is lexicon of derivationally related Croatian lemmas, induced automatically from Croatian web corpus hrWaC. The resource has a high-coverage (100K lemmas) and a good quality (81% precision and 77% recall).
Download from	http://takelab.fer.hr/data/derivbasehr
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1090.html
Edition	LREC 2014

Name	DiLAF African languages-French dictionaries
Resource type	Lexicon
Size	1.7 MByte
Languages	Hausa (hau) Central Kanuri (knc) Tamajaq (tmh) Songhai-zarma French (fra)
Production status	Newly created-in progress
Resource usage	Web Services
License	CC BY 3.0
Conditions of use	Attribution
Description	Bilingual dictionaries encoded in XML - Hausa-French dict. for basic cycle, 2008 Soutéba: 7,823 entries; - Kanuri-French dict. for basic cycle, 2004 Soutéba: 5,994 entries; - Tamajaq-French dict. for basic cycle, 2007 Soutéba: 5,205 entries; - Songhai-zarma-French dict. for basic cycle, 2007 Soutéba: 6,916 entries
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/CCURL_9_res_1.gz [1.65 MB]
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	CCURL 2014

Name	Diachronic Ontologies from People's Daily
Resource type	Ontology
Size	8.83 MByte
Languages	Mandarin Chinese (cmn)
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	<Not Specified>
Conditions of use	Non-commercial, ShareAlike
Description	1. Language Source Description This diachronic ontology is constructed from People's Daily of fifty years (i.e., from 1947 to 1996). The ontology for each year is consisted with several concept trees and we only consider words with frequencies not lower than 100 in each year. Numerals, punctuations, non-morpheme words, quantifiers and function words are excluded. In addition, we have subjectively defined eight eras of consecutive years. The ontology for each era only includes words with frequencies over 300 for each period. This language resource is established and maintained by the research group of Dr. Junfeng Hu in ICL Peking University. Updates and modifications will be unloaded and announced in the KLCL website (www.klcl.pku.edu.cn). 2. User License You may use, copy, reproduce, and distribute this ontology for any non-commercial purpose, subject to the restrictions in this license agreement. Some purposes which can be non-commercial are teaching, academic research, public demonstrations and personal experimentation. 3. You may not use or distribute this ontology or any derivative works in any form for commercial purposes. Examples of commercial purposes would be running business operations, licensing, leasing, or selling the ontology, distributing the ontology for use with commercial products, using the ontology in the creation or use of commercial products or any other activity which purpose is to procure a commercial gain to you or others. If you distribute the Ontology or any derivative works of the ontology, you will distribute them under the same terms and conditions as in this license, and you will not grant other rights to the Corpus or derivative works that are different from those provided by this license agreement. If you have created derivative works of the ontology, and distribute such derivative works, you will cause the modified files to carry prominent notices so that recipients know that they are not receiving the original ontology. Such notices must state: (i) that you have changed the ontology; and (ii) the date of any changes. Copyright (c) Key Laboratory of Computational Linguistics, Peking University. All rights reserved.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_337_res_1.gz [8.45 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/337.html
Edition	LREC 2014

Name	Domain-Specific Gold Standard Translation Set
Resource type	Evaluation Data
Size	200 sentences
Languages	English (eng) Brazilian Portuguese (por)
Production status	Existing-used
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A set of 40 sentences, in English, all automatically extracted from a dermatology corpus and 4 gold standard translations in Portuguese, carefully translated by a specialist, for each of them. Files are encoded in UTF-8.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1095_res_2.gz [9.03 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1095.html
Edition	LREC 2014

Name	Dot type corpus with experts
Resource type	Corpus
Size	<Not Specified>
Languages	American English (eng)
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	CC
Conditions of use	<Not Specified>
Description	Two datasets annotated for the container/content and location/organization metonymic alternation by turkers (all items) and experts (items with low agreement by turkers). This is an extension on the sense-annotated corpus published in Martinez Alonso et al, 2013 (ACL)
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_471_res_1.gz [102.42 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/471.html
Edition	LREC 2014

Name	Dutch sentiment lexicon
Resource type	Lexicon
Size	61 KByte
Languages	Dutch (nld)
Production status	Newly created-in progress
Resource usage	Information Extraction, Information Retrieval
License	Creative Commons
Conditions of use	<Not Specified>
Description	Dutch sentiment lexicon: 3013 positive words, 3014 negative words
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/ES3LOD_18_res_1.gz [24.21 KB]
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	ES3LOD 2014

Name	EVOCA
Resource type	Corpus
Size	<Not Specified>
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_136_res_2.gz [743.63 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/136.html
Edition	LREC 2014

Name	English Web Reviews Multiword Expressions Corpus
Resource type	Corpus
Size	55000 words
Languages	English (eng)
Production status	Newly created-finished
Resource usage	<Not Specified>
License	Creative Commons Attribution-ShareAlike 3.0 Unported
Conditions of use	Attribution, ShareAlike
Description	Tokens of sentences from online reviews are grouped together to indicate multiword expressions (MWEs). The annotation proceeded sentence by sentence, and is thus comprehensive: it captures many kind of MWEs, and is not biased by any predetermined lexicon or syntactic pattern. 3500 MWE instances are marked. Many MWEs are "gappy" (discontinuous in the sentence). Each MWE is marked as "strong" (highly idiomatic) or "weak" (collocational). A token cannot belong to multiple MWEs, except when a weak MWE contains a strong MWE as a constituent. Annotations are specified as token offsets into the English Web Treebank (available from LDC).
Download from	http://www.ark.cs.cmu.edu/LexSem
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/521.html
Edition	LREC 2014

Name	Estonian resource grammar for Grammatical Framework
Resource type	Grammar/Language Model
Size	1132 rules
Languages	Estonian (est)
Production status	Newly created-in progress
Resource usage	Language Modelling
License	LGPL
Conditions of use	<Not Specified>
Description	A GF resource grammar for Estonian, implementing the language-neutral API of the GF Resource Grammar Library as well as a morphological synthesizer.
Download from	https://github.com/GF-Estonian/GF-Estonian
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	SaLTMiL 2014

Name	Freepal
Resource type	Corpus
Size	23 GByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	CreativeCommons
Conditions of use	<Not Specified>
Description	Freepal is a resource designed to assist with the creation of relation extractors for more than 5,000 relations defined in the Freebase knowledge base. The resource consists of over 10 million distinct lexico-syntactic patterns extracted from dependency trees, each of which is assigned to one or more Freebase relations with different confidence strengths.
Download from	http://free-pal.appspot.com
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/764.html
Edition	LREC 2014

Name	French Framenet
Resource type	Lexicon
Size	<Not Specified>
Languages	French (fra)
Production status	Newly created-in progress
Resource usage	FrameNet-based Shallow Semantic Parsing
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Ongoing effort to create a French version of the FrameNet resource. The current status contains a set of approx. 100 frames, slightly modified with respect to the English frames, and a lexicon of French lexemes associated to frames.
Download from	https://sites.google.com/site/anrasfalda/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/496.html
Edition	LREC 2014

Name	GenitivDB
Resource type	Grammar/Language Model
Size	>9 million words
Languages	German
Production status	Newly created-finished
Resource usage	Language Modelling
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Corpus-Generated Dataset for German Genitive Classification
Download from	http://hypermedia.ids-mannheim.de/call/public/korpus.genitivdb
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/346.html
Edition	LREC 2014

Name	Georgian-Russian-Ukrainian-German Parallel Treebank
Resource type	Corpus
Size	<Not Specified>
Languages	Georgian (kat) Russian (rus) Ukrainian (ukr) German (deu)
Production status	<Not Specified>
Resource usage	Language Modelling
License	Creative Commons Attribution 3.0 Unported (CC BY 3.0) http://creativecommons.org/licenses/by/3.0/
Conditions of use	Attribution
Description	This dataset is made of two types of resources: four monolingual Treebanks (German, Georgian, Russian and Ukrainian), and four parallel Treebanks (German-Georgian, German-Russian, German-Ukrainian, Georgian-Ukrainian). The parallel texts used for the outlined experiment comprises German sentences and their translations into Georgian and Russian languages compiled for the GREG NLP lexicon project. The GREG lexicon itself contains a manually aligned German, Russian, English and Georgian valency data supplied with syntactic subcategorization frames and saturated with semantic role labels. The multilingual verb lexicon is expended with examples of sentences in 4 languages involved. They unfold lexical entries’ meaning and are considered as mutual translation equivalents. The size of bilingual sublexicons, depending to a specific language pair, varies between 1200-1300 entries and the number of example sentences appended to the lexicons are different. For example, a German-Georgian subcorpus, used for this study, has a size of roughly 2600 sentence pairs that correspond to different syntactic subcategorization frames. For the German-Russian language pair had been extracted more fine grained subcorpus with about 4000 sentences as translation equivalents. A German-Ukrainian subcorpus, created for the GRUG initiative support, is relatively small.
Download from	http://fedora.clarin-d.uni-saarland.de/grug/downloads.html
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	CCURL 2014

Name	Google Books Distributional Thesaurus
Resource type	Lexicon
Size	100 GByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Word Sense Disambiguation
License	CC
Conditions of use	<Not Specified>
Description	Distributional Thesaurus (DT) for various time slices for the whole Google Books syntactic n-grams. A DT contains, for the while vocabilary, a ranked list of most similar words. Indicated URL contains several such DTs for various corpora. Additionally, we provide sense clusters for each of the DT.
Download from	https://sourceforge.net/p/jobimtext/wiki/LREC2014_Google_DT/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/274.html
Edition	LREC 2014

Name	HNZ segmented and POS tagged corpus
Resource type	Corpus
Size	1 MByte
Languages	Archaic Chinese
Production status	Newly created-in progress
Resource usage	Knowledge Discovery/Representation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	The HNZ corpus is an Archaic Chinese corpus consisting of all the articles in the book of Huainanzi with word segmentation and POS tagging annotation. Huainanzi, also known as Huainan Honglie, is a collective work written by Prince Huainan, Liu An (179 BC-122 BC), and a group of his retainers in the Western Han Dynasty (206 BC-9AD). Huainanzi was first circulated in the Western Han Dynasty, which is near the end of the Archaic Chinese era. The book has 21 chapters, covering a wide range of topics on philosophy, astrology, geography, politics, customs, military affairs, mountains, sociology, etc. It has been described as the ``Encyclopedia of the early Han Dynasty''. Its abundant language capacity reveals characteristics of lexical usage in the Western Han Dynasty, and demonstrates how the usage had been transformed from the Qin Dynasty to the Han Dynasty. In this regard, Huainanzi contains valuable data for an in-depth analysis of Archaic Chinese. Because of these nice properties, we selected the book as the raw data for our Archaic Chinese corpus. All the manual annotation and correction was done by a Chinese linguist who is an expert on Archaic Chinese.
Download from	http://faculty.washington.edu/fxia/hnz/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/138.html
Edition	LREC 2014

Name	Harvard Uncertainty Speech Corpus
Resource type	Corpus
Size	150 minutes
Languages	English
Production status	Newly created-finished
Resource usage	Emotion Recognition/Generation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	The Harvard Uncertainty Speech Corpus contains speech recordings, level of certainty annotations, and acoustic feature vector data. The speech elicitation materials include items from three domains: vocabulary, public transportation, and handwritten digits. In total, the Uncertainty Corpus has 1700 utterances and 148.79 minutes of speech.
Download from	http://dvn.iq.harvard.edu/dvn/dv/ponbarry
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1167.html
Edition	LREC 2014

Name	HiEve
Resource type	Corpus
Size	8034 KByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
Conditions of use	Attribution ; Non-Commercial; ShareAlike (CC-BY-NC-SA 3.0)
Description	A corpus of manually annotated event hierarchies in news stories.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1023_res_1.gz [7.52 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1023.html
Edition	LREC 2014

Name	Hindi-English Code-Switch Corpus
Resource type	Corpus
Size	43 MByte
Languages	English (eng) Hindi (hin)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This is a small Hindi-English speech corpus. The corpus consists of student interview speech. A total of 9 students answering 12 questions were recorded for this corpus. For details regarding the corpus please refer to the paper "A Hindi-English Code-Switching Corpus" by Anik Dey and Pascale Fung to be presented at LREC 2014. You can also send an email to adey@connect.ust.hk for instructions on how to use this corpus.
Download from	https://drive.google.com/?authuser=0#folders/0B1hloe5qNnNTR1o0QkdvQUFYQm8
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/922.html
Edition	LREC 2014

Name	IULA Spanish LSP Treebank
Resource type	Corpus
Size	3.247 MByte
Languages	Spanish (spa)
Production status	Newly created-finished
Resource usage	<Not Specified>
License	Creative Commons Attribution 3.0 Unported License
Conditions of use	Attribution
Description	This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it from a very different domain than those of the Iula Spanish LSP Treebank.
Download from	http://repositori.upf.edu/handle/10230/20408
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/382.html
Edition	LREC 2014

Name	Katakana-English Scientific Terms Lexicon
Resource type	Lexicon
Size	<Not Specified>
Languages	Japanese (jpn) English (eng)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Free
Conditions of use	Attribution
Description	A lexicon of 170K Japanese-English scientific terms automatically extracted using a transliteration filtering algorithm.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_102_res_1.gz [1.61 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/102.html
Edition	LREC 2014

Name	Khresmoi Query Translation Test Data for the Medical Domain version 1.0
Resource type	Corpus
Size	1508 sentences
Languages	English (eng) Czech (ces) German (deu) French (fra)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CreativeCommons BY-NC 3.0
Conditions of use	Attribution; Non-commercial
Description	This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from the general public and from medical experts.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_99_res_1.gz [54.46 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/99.html
Edition	LREC 2014

Name	LAST MINUTE
Resource type	Corpus
Size	56 hours
Languages	German (deu)
Production status	Existing-used
Resource usage	Dialogue
License	Own
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	https://companion.et.uni-magdeburg.de/dokuwiki/downloads:last-minute:start
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/321.html
Edition	LREC 2014

Name	LQVSumm
Resource type	Corpus
Size	1.1 MByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Summarisation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Stand-off annotations of linguistic quality violations found in automatically-produced summaries. The summaries are from TAC 2011 Guided Summarization task (intial summaries) and from the G-Flow summarization system.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_578_res_1.gz [705.97 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/578.html
Edition	LREC 2014

Name	Lowlands Twitter data
Resource type	Corpus
Size	3064 tokens
Languages	English
Production status	Newly created-finished
Resource usage	Part of speech tagging
License	<Not Specified>
Conditions of use	<Not Specified>
Description	200 tweets collected over the span of one day, POS-annotated by three annotators.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_476_res_1.gz [9.20 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/476.html
Edition	LREC 2014

Name	LuxId
Resource type	Corpus
Size	<Not Specified>
Languages	Luxemburguish
Production status	Newly created-finished
Resource usage	Language Identification
License	CC BY-SA 3.0
Conditions of use	Attribution, ShareAlike
Description	corpus of mixed language (French, German,Luxemburguish) sentences from {sc Chamber} (House of Parliament) debate reports manually annotated at segment level with 6 labels : Lux, Fre, Ger, Lux + Fre, Lux + Ger, Lux + Fre + Ger
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_732_res_1.gz [22.33 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/732.html
Edition	LREC 2014

Name	Mannheim Corpus of Historical Newspapers and Magazines
Resource type	Corpus
Size	4.1 Mio tokens
Languages	German (deu)
Production status	Newly created-in progress
Resource usage	Text Mining
License	CreativeCommons, NonCommercial, Attribution
Conditions of use	Attribution, Non-Commercial
Description	The Mannheim Corpus of Historical Newspapers and Magazines consists of 21 German newspapers and magazines from the 18th and 19th century. It comprises about 652 individual volumes with over 4.1 Mio word tokens on 4678 pages overall. This corpus has been assembled and digitized from 2009 to 2011, and been converted to TEI P5 in 2013.
Download from	http://hdl.handle.net/10932/00-01B8-AE41-41A4-DC01-5
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	LRT4HDA 2014

Name	Metaphors for Economic Inequality in English, Farsi, Spanish, and Russian
Resource type	Corpus
Size	3.5 MByte
Languages	English Russian Farsi Spanish
Production status	Newly created-finished
Resource usage	Discourse
License	Creative Commons
Conditions of use	<Not Specified>
Description	Excel spreadsheets containing the results of SketchEngine WordSketch searches for metaphors in the target domain of economic inequality. The languages are English, Russian, Farsi, and Spanish. The data is taken from the SketchEngine TenTen corpora, as described in LREC 2014 papers from MacWhinney, B. & Fromm, D. , as well as Levin et al. There are four sheets of the general conceptual metaphors: ccm_rus, ccm_eng, cc_far, and cc_spa. There are also four much larger sheets containing the actual sentences with the metaphors, along with pointers to URLs where these occurred.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_419_res_1.gz
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/419.html
Edition	LREC 2014

Name	MotaMot French-Khmer Pivot Database
Resource type	Lexicon
Size	<Not Specified>
Languages	French (fra) Central Khmer (khm)
Production status	Newly created-in progress
Resource usage	Web Services
License	CC Attribution 3.0 Unported (CC BY 3.0)
Conditions of use	Attribution
Description	French-Khmer pivot lexical database
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_128_res_1.gz [2.96 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/128.html
Edition	LREC 2014

Name	Multilingual corpora with coreferential annotation of person entities
Resource type	Corpus
Size	1 MByte
Languages	Portuguese (por) Galician (glg) Spanish (spa)
Production status	Newly created-in progress
Resource usage	Person Identification
License	GPL
Conditions of use	<Not Specified>
Description	Multilingual corpora with coreferential annotation of person entities ===================================================================== In-progress corpora with coreferent annotation of person entities. Sources: journals and Wikipedia. Languages: * Portuguese: varieties from Portugal, Brazil, Angola, Mozambique (and Wikipedia) * Spanish: varieties from Spain and Argentina (and Wikipedia) * Galician: from Galician journals (and Wikipedia) Format: SemEval-10: * Recasens, Marta, Lluís Màrquez, Emili Sapena, M Antònia Martí, Mariona Taulé, Véronique Hoste, Massimo Poesio and Yannick Versley, 2010. SemEval-2010 Task 1: Coreference resolution in multiple languages. In Proceedings of the 5th International Work- shop on Semantic Evaluation (SemEval ’10): 1–8. ACL.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_918_res_1.gz [1.05 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/918.html
Edition	LREC 2014

Name	N3-Collection
Resource type	Corpus
Size	728 annotated documents
Languages	English German
Production status	Newly created-finished
Resource usage	Semantic Web
License	Creative Commons BY-NC-SA
Conditions of use	Attribution, Non-Commercial, ShareAlike
Description	We publish three novel datasets called N3. N3 will be published using NIF ensuring a greater interoperability to overcome the need for corpus-specific parsers. The data can be downloaded from our project homepage.
Download from	http://aksw.org/Projects/N3nernednif
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/856.html
Edition	LREC 2014

Name	NST acoustic and language models
Resource type	Acoustic and Language Models for Speech Recognition
Size	2.1 GByte
Languages	Swedish (swe)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	CreativeCommons
Conditions of use	Attribution
Description	The package contains resources for large vocabulary continuous speech recognition (LVCSR) in Swedish. We trained acoustic models on the public domain NST Swedish corpus and made them freely available to the community. We also provide scripts to generate language models containing a chosen subset of words from the NST n-grams. Note that the models may be updated before the date of the conference if new results are available.
Download from	http://www.speech.kth.se/asr/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/312.html
Edition	LREC 2014

Name	NoSta-D: German NER Dataset Train/Dev
Resource type	Corpus
Size	26200 sentences
Languages	German (deu)
Production status	Newly created-finished
Resource usage	Named Entity Recognition
License	CC-BY
Conditions of use	Attribution
Description	Freely available large dataset, manually annotated for German NER. Includes nested span annotations. Source text from German Wikipedia and news. This data set does not contain the test data, which is used for the GermEval 2014 NER task at KONVENS. Test data will be available from September 2014.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_276_res_1.gz [2.78 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/276.html
Edition	LREC 2014

Name	NomLex-BR
Resource type	Lexicon
Size	2323 entries
Languages	Brazilian Portuguese (por) Portuguese (por)
Production status	Newly created-in progress
Resource usage	Information Extraction, Information Retrieval
License	CC BY SA
Conditions of use	Attribution, ShareAlike
Description	A computational lexicon for Portuguese that provides mappings between verbs and their nominalizations.
Download from	https://github.com/arademaker/nomlex-br
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1031.html
Edition	LREC 2014

Name	ODIN database
Resource type	Corpus
Size	<Not Specified>
Languages	examples for thousands of languages
Production status	Newly created-in progress
Resource usage	Data can help linguistic studies and bootstrap NLP tools for resource-poor languages
License	<Not Specified>
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	http://faculty.washington.edu/fxia/odin/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1072.html
Edition	LREC 2014

Name	OSS Online Communication Messages
Resource type	Corpus
Size	<Not Specified>
Languages	American English (eng)
Production status	Newly created-finished
Resource usage	Document Classification, Text categorisation
License	Apache License, Version 2.0
Conditions of use	Retain, in the Source form of any Derivative Works that You distribute, all licensing information from the Source form of the Work
Description	The corpus contains 1,030 online communication messages, randomly selected from Network News Transfer Protocol (NNTP) newsgroups, the bug tracking system Bugzilla and the bug tracking system GitHub. NNTP articles, Bugzilla and GitHub comments were selected randomly so that the sample exhibits similar characteristics to the population as a whole. Each message was annotated manually as a request or a non-request. The corpus was created as part of the work presented in the current paper and it is described in section 3. We intend to make the corpus available freely.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_498_res_1.gz [493.76 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/498.html
Edition	LREC 2014

Name	Onto.PT
Resource type	Ontology
Size	117 synsets
Languages	Portuguese (por)
Production status	Existing-used
Resource usage	Word Sense Disambiguation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Large wordnet for Portuguese, created automatically after integrating the relation instances extracted from three Portuguese dictionaries in the synsets of TeP and OpenWordNet.PT. Its current version, 0.6, contains ~168k lexical items and ~238k word senses, organised in ~117k synsets, connected by ~341k relation instances, that cover the same types as PAPEL. About 40% of the synsets contain glosses, assigned automatically.
Download from	http://ontopt.dei.uc.pt/index.php?sec=downloads
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/46.html
Edition	LREC 2014

Name	OpenWordNet.PT
Resource type	Ontology
Size	39 synsets
Languages	Portuguese (por)
Production status	Existing-used
Resource usage	Word Sense Disambiguation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Portuguese wordnet that results from the manual translation of a set of base synsets from Princeton WordNet 3.0. Semantic relations were inherited from the latter, given the synset matches. Currently, it contains ~48k lexical items and ~54k word senses, organised in ~39k synsets, connected by ~84k relation instances, that cover the same types as WordNet 3.0.
Download from	https://github.com/arademaker/openWordnet-PT
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/46.html
Edition	LREC 2014

Name	PACE Corpus
Resource type	Corpus
Size	246688 tokens
Languages	English (eng) German (deu)
Production status	Newly created-finished
Resource usage	Sentiment Analysis
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Publicly available multilingual evaluation corpus for phrase-level Sentiment Analysis that can be used to evaluate real world applications in an industrial context. Data from English and German Internet forums (1000 posts each) focusing on the automotive domain.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_258_res_1.gz [874.19 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/258.html
Edition	LREC 2014

Name	PAPEL
Resource type	Lexicon
Size	102 lexemes
Languages	Portuguese (por)
Production status	Existing-used
Resource usage	Word Sense Disambiguation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Lexical-semantic network extracted automatically from a proprietary Portuguese dictionary. PAPEL 3.5, contains ~102k lexical items, connected by ~191k semantic relation instances covering a rich set of types, including including synonymy, hypernymy, several types of meronymy, causation and purpose.
Download from	http://www.linguateca.pt/PAPEL/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/46.html
Edition	LREC 2014

Name	PAROLE-SIMPLE-CLIPS
Resource type	Lexicon
Size	<Not Specified>
Languages	Italian (ita)
Production status	Existing-used
Resource usage	Lexical classification
License	ELRA
Conditions of use	Attribution
Description	PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS). The full lexicon is available in the ELRA catalogue (see http://catalog.elra.info/product_info.php?products_id=881). Part of the resource is also available as Linked Data at http://datahub.io/dataset/simple
Download from	http://datahub.io/dataset/simple
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/556.html
Edition	LREC 2014

Name	PDT-VALLEX 2.0
Resource type	Lexicon
Size	11656 entries
Languages	Czech (ces)
Production status	Existing-used
Resource usage	linking lexicons
License	Creative Commons 3.0 - BY - NC - SA
Conditions of use	Attribution, NonCommercial, ShareAlike
Description	The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool) , and also in more human readable form. The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
Download from	http://ufal.mff.cuni.cz/pcedt2.0/en/documentation.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/704.html
Edition	LREC 2014

Name	PanLex
Resource type	Lexicon
Size	20,000,000 lexemes
Languages	<Not Specified>
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CC0
Conditions of use	<Not Specified>
Description	A panlingual lexical translation database currently documenting 1.1 billion pairwise translations among 20 million lexemes in 9,300 language varieties.
Download from	http://dev.panlex.org/db/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1029.html
Edition	LREC 2014

Name	ParCor 1.0
Resource type	Corpus
Size	332054 tokens
Languages	English German
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	None
Conditions of use	<Not Specified>
Description	ParCor is a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as referring expressions – has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.
Download from	http://opus.lingfil.uu.se/ParCor
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/298.html
Edition	LREC 2014

Name	ParTUT
Resource type	Corpus
Size	3194 sentences
Languages	Italian (ita) English (eng) French (fra)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Creative Commons
Conditions of use	<Not Specified>
Description	ParTUT is a project for the development of a multilingual parallel treebank for Italian, English and French. The aim of this work is twofold: building an aligned parallel treebank for Italian, English and French, by extending and applying a single treebank schema to other languages, and studying how the schema can be used to address issues typically related to parallel corpora. The annotation and tools used for the development of this resource are those of the Turin University Treebank (TUT), a collection of Italian sentences annotated at a morpho-syntactic, syntactic and (to a lesser extent) semantic level, with dependency-oriented representation format
Download from	http://www.di.unito.it/~tutreeb/partut.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/674.html
Edition	LREC 2014

Name	Parallel sentences for error detection and correction from WIkipedia 2012 and 2013
Resource type	Corpus
Size	4604 pairs of sentences
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Error Detection and Correction
License	<Not Specified>
Conditions of use	Attribution, ShareAlike
Description	This resource consists of the two files. Files contain 4604 parallel sentences extracted automatically from Wikipedia 2012 and Wikipedia 2013. Sentence-splitting was performed with jmx mxterminator and alignment was done with Microsoft Aligner. The resource was used for the experiments in the current submission with an assumption that Wikipedia 2013 provides error-corrections for the sentences from Wikipedia 2012.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_453_res_1.gz [541.06 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/453.html
Edition	LREC 2014

Name	Paraphrase Fragment Corpus
Resource type	Corpus
Size	113314 entries
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Textual Entailment
License	Various
Conditions of use	<Not Specified>
Description	The big corpus is automatically constructed by the method described in the abstract; and there is also a small corpus of gold-standard annotations.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1195_res_1.gz [6.97 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1195.html
Edition	LREC 2014

Name	Plane Crash Dataset
Resource type	Evaluation Data
Size	193 entries
Languages	English
Production status	Newly created-finished
Resource usage	Information Extraction, Information Retrieval
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A knowledge base of 193 plane crash events based on Wikipedia infoboxes, and stand-off annotation for automatically generated slot-type labels for 4,093 newswire documents from Tipster-1, Tipster-2, Tipster-3 and Gigaword-5.
Download from	http://nlp.stanford.edu/projects/dist-sup-event-extraction.shtml
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1127.html
Edition	LREC 2014

Name	Predicate Matrix
Resource type	Lexicon
Size	<Not Specified>
Languages	English
Production status	Newly created-in progress
Resource usage	Semantic Role Labeling
License	CreativeCommons
Conditions of use	Attribution
Description	Predicate Matrix, a new lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet. With the Predicate Matrix, we expect to provide a more robust and interoperable predicate lexicon. Moreover, we plan to extend the coverage of current predicate resources, to enrich WordNet with predicate information, discover and solve inherent inconsistencies among the resources and possibly to extend predicate information to languages other than English (by exploiting the local wordnets aligned to the English WordNet.
Download from	http://adimen.si.ehu.es/web/PredicateMatrix
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/589.html
Edition	LREC 2014

Name	Priberam Compressive Summarization Corpus
Resource type	Corpus
Size	800 documents
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Summarisation
License	Creative Commons 3.0 (NonCommercial, ShareAlike)
Conditions of use	Non-Commercial; ShareAlike (CC-NC-SA 3.0)
Description	This is a corpus for multi-document summarization for European Portuguese. It contains 80 topics, each of which has 10 documents, for a total of 800 documents. Each topic contains two human summaries. The summaries are compressive: they are the result of a compression of the sentences in the original documents.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_187_res_1.gz [977.04 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/187.html
Edition	LREC 2014

Name	Qatari Arabic Corpus
Resource type	Corpus
Size	<Not Specified>
Languages	Qatari Arabic
Production status	Newly created-in progress
Resource usage	Speech Recognition/Understanding
License	Not yet released
Conditions of use	<Not Specified>
Description	The Qatari Arabic (QA) corpus was collected from different TV series and talk show programs. Data are selected from programs in which the majority of speech is in QA; segments from each program are selected after audition confirms the quality of the speech signal. The programs are: Tesaneef (popular Qatari series), Sabah El-Doha (talk show program), and some episodes from Al-Jazeerah are selected if guest speakers are speaking Qatari dialect. The corpus is recorded in linear PCM, 16 kHz, and 16 bits.
Download from	http://sprosig.isle.illinois.edu/corpora/1
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/430.html
Edition	LREC 2014

Name	SETimes.HR
Resource type	Corpus
Size	<Not Specified>
Languages	Croatian (hrv)
Production status	Newly created-in progress
Resource usage	<Not Specified>
License	CC BY-SA 3.0
Conditions of use	Attribution, ShareAlike
Description	<Not Specified>
Download from	http://nlp.ffzg.hr/data/corpora/setimes.hr.v1.conllx.gz [1.27 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/690.html
Edition	LREC 2014

Name	Sample annotated text for definiteness annotations
Resource type	Corpus
Size	12 KByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1194_res_2.gz [5.02 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1194.html
Edition	LREC 2014

Name	Sample annotated text for definiteness annotations
Resource type	Corpus
Size	32 KByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1194_res_3.gz [10.13 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1194.html
Edition	LREC 2014

Name	Sample annotated text for definiteness annotations
Resource type	Corpus
Size	32 KByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1194_res_4.gz [10.16 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1194.html
Edition	LREC 2014

Name	Sample annotated text for definiteness annotations
Resource type	Corpus
Size	2 KByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1194_res_5.gz [1.16 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1194.html
Edition	LREC 2014

Name	Sample annotated text for definiteness annotations
Resource type	Corpus
Size	7 KByte
Languages	English (eng)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	No description provided, see the related article
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_1194_res_6.gz [2.28 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1194.html
Edition	LREC 2014

Name	Sclera2cornetto
Resource type	Lexicon
Size	5710 entries
Languages	Dutch (nld) Sclera
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Free
Conditions of use	<Not Specified>
Description	Sclera2Cornetto.1.0.tgz ========================== Created by Vincent Vandeghinste and Ineke Schuurman. More background information can be found in LREC2014 paper. This archive contains two resource files: 1. Sclera2Cornetto.csv ==================================== We have manually linked a subset of 5710 Sclera pictographs to Cornetto synsets. As these pictographs sometimes depict complex concepts, they can be linked to one or to more synsets indicating that their meaning combines the meanings of the synsets. In these cases we have identified one of the synsets as the head synset, indicating that the other linked synsets are in some kind of dependency relation with the head synset. In cases where the pictograph meaning was not reflected by one or more synsets, we often have linked the pictograph to the synset of its hyperonym. Sclera2Cornetto consists of a tab-separated database table with the following columns (N stands for NULL): -lemma: name of the pictograph (spaces in the original filenames have been replaced with hyphens) For simple pictographs -synset: synset identifier matching Sclera pictograph -relation: whether the synset is synonym/hyperonym of pictograph Other columns are set to N For complex pictographs -head: synset identifier of head -headrel: relation of synset to pictograph (synonym/hyperonym) -dependent: comma-separated list of synset identifiers for dependents -deprel: comma-separated list of relations (synonym/hyperonym) of synsets for each dependent Other columns are set to N 2. Dutch2Sclera.csv =================== We also make our Dutch2Sclera dictionary table available, consisting of 372 entries linking Dutch words straight to Sclera pictographs. This table contains token, lemma, tag, and picto columns, allowing underspecification.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_189_res_1.gz [68.40 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/189.html
Edition	LREC 2014

Name	Senso Comune
Resource type	Lexicon
Size	288013 entries
Languages	Italian (ita)
Production status	Existing-updated
Resource usage	Word Sense Disambiguation
License	Creative Commons Attribution-Share Alike 2.5 Italy License
Conditions of use	Attribution, ShareAlike
Description	Lexical-semantic database for Italian, composed by three modules comprising a top level module, which contains basic ontological concepts and relations, a lexical module, which models general linguistic and lexicographic structures, and a frame module providing concepts and axioms for modeling the predicative structure of verbs, nouns and adjectives. The resource has been aligned for word senses for verbs and nouns with the Italian MultiWordNet.
Download from	http://www.sensocomune.it/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/563.html
Edition	LREC 2014

Name	SexIt
Resource type	Terminology
Size	<Not Specified>
Languages	French (fra)
Production status	always in creation
Resource usage	<Not Specified>
License	CC
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	http://www.jeuxdemots.org/sexit.php?action=list
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/817.html
Edition	LREC 2014

Name	SwissAdmin
Resource type	Corpus
Size	20,000,000 words
Languages	German (deu) French (fra) Italian (ita) English (eng)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Open source for annotations; license for source text as stated in the paper
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	http://www.latl.unige.ch/swissadmin
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/772.html
Edition	LREC 2014

Name	Syntactic Reference Corpus of Medieval French (SRCMF)
Resource type	Corpus
Size	280000 words
Languages	Old French
Production status	Newly created-in progress
Resource usage	Diachronic syntax
License	CreativeCommons for the annotation of all texts and the words of some texts. More restrictive licenses apply for the words of other texts.
Conditions of use	Attribution, Non-Commercial, ShareAlike
Description	The SRCMF contains the 15 Old French texts with about 280000 words. It has a high-quality manual annotation, based on a linguistically adequate dependency grammar. Annotation data is provided as RDF/XML. Available export formats are CONLL and TigerXML. The final revision of the texts is ongoing and will be finished by the end of 2013. The project was funded by the Agence Nationale de la Recherche (ANR, France) and the Deutsche Forschungsgemeinschaft (DFG, Germany) 2009-2012.
Download from	http://srcmf.org
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/239.html
Edition	LREC 2014

Name	TED-LIUM
Resource type	Corpus
Size	207 hours
Languages	American English (eng)
Production status	Existing-updated-(release this year)
Resource usage	Speech Recognition/Understanding
License	CreativeCommons
Conditions of use	Attribution, Non-Commercial, Non-Derivative
Description	Corpus of speech with transcriptions (207 hours) based on TED talks
Download from	http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/1104.html
Edition	LREC 2014

Name	TeP
Resource type	Lexicon
Size	19.888 synsets
Languages	Brazilian Portuguese (por)
Production status	Existing-used
Resource usage	Word Sense Disambiguation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Electronic thesaurus for Brazilian Portuguese, created manually. TeP 2.0, contains more than 44,000 lexical items, organised in 19,888 synsets, and also 4,276 antonymy relations between synsets.
Download from	http://www.nilc.icmc.usp.br/tep2/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/46.html
Edition	LREC 2014

Name	Terminesp LD
Resource type	Lexicon
Size	73317 entries
Languages	Spanish (spa) English (eng) German (deu) French (fra) Swedish (swe) Latin, Italian
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	to be specified
Conditions of use	<Not Specified>
Description	Terminesp LD is the Linked Data version of Terminesp, a terminological database in Spanish created by AETER (Asociación Española de Terminología) by extracting terminological data produced by AENOR (Asociación Española de Normalización y Certificación). It contains more than thirty thousand terms with equivalences in other languages whenever they are available. The core data has been modelled using the lemon model and the translations between terms have been modelled using the lemon translation module proposed by the OEG.
Download from	http://linguistic.linkeddata.es/terminesp
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/863.html
Edition	LREC 2014

Name	The N2 Corpus
Resource type	Corpus
Size	42480 words
Languages	English
Production status	Newly created-finished
Resource usage	Narrative understanding and comprehension; dynamics of radicalization
License	Creative Commons CC-BY 4.0
Conditions of use	Attribution
Description	The N2 (Narrative Networks) Corpus is a collection of 100 stories, comprising approximately 42,000 words, most originally in Arabic but all translated into English. The corpus contents are all texts that Islamist Extremists have produced, or texts that are often referenced by them. These include: personal narratives gathered from internet forums; press releases describing bombings and attacks by extremist groups in Afghanistan; articles containing stories included in al-Qaeda propaganda materials (the Inspire magazine); and religious stories (Hadith and Sirah) often referenced by extremist groups. Every text in the corpus is a story. Also, every text in the corpus has been annotated for 14 layers of syntax and semantics, including: referring expressions and co-reference; events, time expressions, and temporal relationships; semantic roles; and word senses. In cases where automatic analyzers are not available to do near-perfect annotations, layers were double-annotated and adjudicated by trained annotators. The corpus comprises 100 texts and approximately 42,000 words.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_48_res_1.gz [17.99 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/48.html
Edition	LREC 2014

Name	The Norwegian Dependency Treebank
Resource type	Corpus
Size	614000 tokens
Languages	Norwegian Bokmål (nob) Norwegian Nynorsk (nno)
Production status	Newly created-finished
Resource usage	PoS-tagging, parsing
License	<Not Specified>
Conditions of use	<Not Specified>
Description	The Norwegian Dependency Treebank encompasses treebanks for both written standard of Norwegian (Bokmål and Nynorsk). It is the result of a 2 year project conducted at the National Library of Norway (Språkbanken) and was finished at the start of 2014. At present it contains 311000 tokens of Norwegian Bokmål and 303000 tokens of Norwegian Nynorsk. These have been annotated morphologically and syntactically by trained linguists.
Download from	http://www.nb.no/sbfil/tekst/20140103_NDT_1-0.tar.gz [29.89 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/303.html
Edition	LREC 2014

Name	Translation errors
Resource type	Corpus
Size	300 sentences
Languages	Portuguese (por)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CreativeCommons
Conditions of use	<Not Specified>
Description	We have created a corpus constituted by automatic translations performed by two widely used translation engines (Google Translator and Moses) in three different scenarios representing different challenges in the translation from English to European Portuguese. This corpus was annotated with the translation errors according to a taxonomy defined by us.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_199_res_1.gz [39.68 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/199.html
Edition	LREC 2014

Name	UM-Corpus
Resource type	Corpus
Size	2 million sentences
Languages	<Not Specified>
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Creative Commons Non-Commercial 3.0 Licenses
Conditions of use	Non-Commercial
Description	There are total more than 10 million English-Chinese (E-C) parallel sentences, and more than 2 million training data and 5,000 testing sentences are released for public for free. Different from previous work, the corpus is designed to embrace eight different domains (News, Novels, Laws, Thesis, Educational Materials, Science, Speech/Subtitles, and Microblog). Some of them are further categorized into different topics. The corpus has been released to the research community under the Creative Commons Non-Commercial 3.0 Licenses.
Download from	http://nlp2ct.cis.umac.mo/um-corpus/index.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/774.html
Edition	LREC 2014

Name	USAGE Corpus
Resource type	Corpus
Size	1200 reviews
Languages	English (eng) German (deu)
Production status	Newly created-finished
Resource usage	Sentiment Analysis
License	Open Data Commons Attribute License (ODB-By) v1.0
Conditions of use	<Not Specified>
Description	Annotations of German and English Amazon reviews, aspects, evaluating subjective phrases, their polarity and their relations. To retrieve the Amazon texts themselves, a crawler is made available.The reviews themselves are not part of this data publication.
Download from	http://dx.doi.org/10.4119/unibi/citec.2014.14
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/85.html
Edition	LREC 2014

Name	UW Bio-NLP X-ray Event Corpus
Resource type	Corpus
Size	<Not Specified>
Languages	American English (eng)
Production status	Newly created-in progress
Resource usage	<Not Specified>
License	<Not Specified>
Conditions of use	<Not Specified>
Description	A set of x-ray text snippets annotated with change-of-state events
Download from	http://depts.washington.edu/bionlp/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/386.html
Edition	LREC 2014

Name	Uppsala Persian Dependency Treebank
Resource type	Treebank
Size	151.671 tokens
Languages	Iranian Persian (pes)
Production status	Existing-updated
Resource usage	Syntactic parsing
License	Creative Commons
Conditions of use	Attribution
Description	The Uppsala Persian Dependency Treebank (UPDT) is a dependency-based syntactically annotated corpus for Persian. The treebank consists of 6000 sentences (151,671 tokens) of written text in CoNLL-format which has been developed through a bootstrapping procedure. The entire dependency relations used in the annotation including the guidelines for sentence segmentation, tokenization, and morphological annotation are described in detail in the Uppsala Persian Dependency Treebank Annotation Guidelines. The annotation guidelines is written in English.
Download from	http://stp.lingfil.uu.se/~mojgan/UPDT.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/378.html
Edition	LREC 2014

Name	Uyghur-Kazahk One-to-One Mapping Bilingual Dictionary
Resource type	Lexicon
Size	50 entries
Languages	Uighur (uig) Kazakh (kaz)
Production status	Newly created-in progress
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	This resource is a one-to-one mapping bilingual dictionary of Uyghur(Uighur) and Kazakh language pair, which is the experimental result of automatic induction from Chinese-Uyghur and Chinese-Kazakh bilingual dictionaries by using constraint approach proposed in corresponding submission. = Resource Detail = File type: Microsoft Excel(.xlsx) File size: 1.78MB Scale: 50,000 translation pairs (one-to-one mapping) Accuracy: About 83% (human confirmed) Script: Arabic-based Uyghur and Kazahk scripts Encode: Unicode Provider: Ishida&Matsubara Laboratory, Department of Social Informatics, Kyoto University.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_417_res_1.gz [1.49 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/417.html
Edition	LREC 2014

Name	VOCE Corpus
Resource type	Corpus
Size	638 minutes
Languages	Portuguese (por)
Production status	Newly created-in progress
Resource usage	Emotion Recognition/Generation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	The VOCE corpus consists of a collection of 38 raw recordings, adding up to 78~min of Baseline, 73.6~min of Experiment and 487~min of Event free speech, with accompanying metadata (demographic and health questionnaires). The recordings are annotated by individual appraisal of stress based on self-reports and physiological measures, whereby the first validate that participants are experiencing stress and the latter provide fine-grained annotation of the speech. Speakers are 38 students from the University of Porto, aged 19 to 49.
Download from	http://paginas.fe.up.pt/~voce/articles.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/647.html
Edition	LREC 2014

Name	Valency Lexicon of Czech Verbs (VALLEX 2.6)
Resource type	Lexicon
Size	2730 lexemes
Languages	Czech (ces)
Production status	Existing-used
Resource usage	linking lexicons
License	Creative Commons 3.0-BY-NC-SA
Conditions of use	Attribution, NonCommercial, ShareAlike
Description	The Valency Lexicon of Czech Verbs, Version 2 (VALLEX 2.x), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.x has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague. VALLEX 2.x is a successor of VALLEX 1.0, extended in both theoretical and quantitative aspects. VALLEX 2.x provides information on the valency structure (combinatorial potential) of verbs in their particular senses. VALLEX is closely related to the Prague Dependency Treebank project: both of them use Functional Generative Description (FGD), being developed by Petr Sgall and his collaborators since the 1960s, as the background theory. In VALLEX 2.x, there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses"). Note that VALLEX 2.x - according to FGD, but unlike traditional dictionaries and also unlike VALLEX 1.0 - treats a pair of perfective and imperfective aspectual counterparts as a single lexeme (if perfective and imperfective verbs would be counted separately, the size of VALLEX 2.x would virtually grow to 4,250 verb entries). To ensure high quality of the data, all VALLEX entries have been created manually, using several previously existing lexicons as well as corpus evidence from the Czech National Corpus.
Download from	http://ufal.mff.cuni.cz/vallex/2.6/doc/data.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/704.html
Edition	LREC 2014

Name	Valency Lexicon of Czech Verbs (VALLEX 2.6)
Resource type	Lexicon
Size	2730 lexemes
Languages	Czech
Production status	Existing-updated
Resource usage	Lexicon extension
License	Creative Commons 3.0-BY-NC-SA
Conditions of use	Attribution, Non-Commercial, ShareAlike
Description	The Valency Lexicon of Czech Verbs, Version 2 (VALLEX 2.x), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.x has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague. VALLEX 2.x is a successor of VALLEX 1.0, extended in both theoretical and quantitative aspects. VALLEX 2.x provides information on the valency structure (combinatorial potential) of verbs in their particular senses. VALLEX is closely related to the Prague Dependency Treebank project: both of them use Functional Generative Description (FGD), being developed by Petr Sgall and his collaborators since the 1960s, as the background theory. In VALLEX 2.x, there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses"). Note that VALLEX 2.x - according to FGD, but unlike traditional dictionaries and also unlike VALLEX 1.0 - treats a pair of perfective and imperfective aspectual counterparts as a single lexeme (if perfective and imperfective verbs would be counted separately, the size of VALLEX 2.x would virtually grow to 4,250 verb entries). To ensure high quality of the data, all VALLEX entries have been created manually, using several previously existing lexicons as well as corpus evidence from the Czech National Corpus.
Download from	http://ufal.mff.cuni.cz/vallex/2.6/doc/data.html
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/773.html
Edition	LREC 2014

Name	Vystadial 2013 – Czech data
Resource type	Corpus
Size	1.5 GByte
Languages	Czech (ces)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	Creative Commons (CC-BY-SA 3.0)
Conditions of use	Attribution, ShareAlike
Description	Dataset of telephone conversations (audio and transcriptions) in Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.
Download from	http://hdl.handle.net/11858/00-097C-0000-0023-4670-6
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/535.html
Edition	LREC 2014

Name	Vystadial 2013 – English data
Resource type	Corpus
Size	2.6 GByte
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Speech Recognition/Understanding
License	Creative Commons (CC-BY-SA 3.0)
Conditions of use	Attribution, ShareAlike
Description	Dataset of telephone conversations (audio and transcriptions) in English, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.
Download from	http://hdl.handle.net/11858/00-097C-0000-0023-4671-4
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/535.html
Edition	LREC 2014

Name	WMT12
Resource type	Corpus
Size	<Not Specified>
Languages	English (eng) German (deu) French (fra)
Production status	Existing-used
Resource usage	Machine Translation, SpeechToSpeech Translation
License	<Not Specified>
Conditions of use	<Not Specified>
Description	Parallel Corpora, NewsCommentary
Download from	http://www.statmt.org/wmt12/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/735.html
Edition	LREC 2014

Name	WMT12 Data
Resource type	Corpus
Size	35 MByte
Languages	English (eng) Czech (ces) Spanish (spa) French (fra) German (deu)
Production status	Existing-used
Resource usage	Evaluation
License	Unspecified
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	http://www.statmt.org/wmt12/wmt12-data.tar.gz [34.89 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/935.html
Edition	LREC 2014

Name	WMT13 Data
Resource type	Corpus
Size	56 MByte
Languages	English (eng) Spanish (spa) Russian (rus) Czech (ces) German (deu) French
Production status	Existing-used
Resource usage	Machine Translation, SpeechToSpeech Translation
License	Unspecified
Conditions of use	<Not Specified>
Description	<Not Specified>
Download from	http://www.statmt.org/wmt13/wmt13-data.tar.gz [55.13 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/935.html
Edition	LREC 2014

Name	Walenty
Resource type	Lexicon
Size	8587 entries
Languages	Polish (pol)
Production status	Newly created, further developed
Resource usage	Parsing; also human use
License	Creative Commons BY-SA
Conditions of use	Attribution, ShareAlike
Description	See the submitted paper and http://zil.ipipan.waw.pl/Walenty.
Download from	http://zil.ipipan.waw.pl/Walenty
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/279.html
Edition	LREC 2014

Name	WordNet RDF
Resource type	Lexicon
Size	210.772 lexemes
Languages	English (eng)
Production status	Newly created-finished
Resource usage	Semantic Web
License	Princeton WordNet License
Conditions of use	<Not Specified>
Description	Export of WordNet in lemon and RDF
Download from	http://wordnet-rdf.princeton.edu/wn31.nt.gz [74.27 MB]
Referring paper	Please check the relevant workshop and paper at: http://www.lrec-conf.org/proceedings/lrec2014/workshops.html
Edition	LDL 2014

Name	caWaC
Resource type	Corpus
Size	779,086,559 tokens
Languages	Catalan (cat)
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CC-BY-SA 3.0
Conditions of use	Attribution, ShareAlike
Description	<Not Specified>
Download from	http://nlp.ffzg.hr/resources/corpora/cawac/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/841.html
Edition	LREC 2014

Name	filesRo.zip
Resource type	Corpus
Size	689 KByte
Languages	Romanian (ron) Italian (ita) Spanish (spa) Portuguese (por) French (fra) Turkish
Production status	Newly created-finished
Resource usage	cognates
License	OpenSource
Conditions of use	<Not Specified>
Description	We provide an archive containing automatically extracted cognate pairs and word-etymon pairs for Romanian words and five related languages: French, Italian, Spanish, Portuguese and Turkish. We ran our experiments on the Romanian vocabulary provided by dexonline machine-readable dictionary (http://dexonline.ro). In "cognates" folder there is one file for each language L containing pairs of cognates shared between L and Romanian. In "etymons" folder there is one file for each language L containing word-etymon pairs for Romanian words having L etymology.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_175_res_1.gz [686.00 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/175.html
Edition	LREC 2014

Name	par-lvf
Resource type	Lexicon
Size	<Not Specified>
Languages	French (fra)
Production status	Newly created-finished
Resource usage	Parsing
License	LGPL-LR
Conditions of use	<Not Specified>
Description	This is an extended version of the exLVF lexicon where examples are unfolded, corrected and parsed with a second parser.
Download from	http://wikilligramme.loria.fr/doku.php?id=lvf
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/602.html
Edition	LREC 2014

Name	slTwitterCorpus
Resource type	Corpus
Size	<Not Specified>
Languages	Slovenian (slv)
Production status	Newly created-finished
Resource usage	<Not Specified>
License	CC-BY-SA 3.0
Conditions of use	Attribution, ShareAlike
Description	<Not Specified>
Download from	http://nlp.ffzg.hr/resources/corpora/twitter/
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/834.html
Edition	LREC 2014

Name	tweet-norm_es
Resource type	Corpus
Size	<Not Specified>
Languages	Spanish (spa)
Production status	Newly created-finished
Resource usage	Microtext normalization
License	CC-BY
Conditions of use	Attribution
Description	Corpus of annotated tweets for lexical normalization in Spanish. Two collections have been generated: the development corpus and the test corpus, which consist of 600 tweets each. A total of 775 and 724 OOV words were manually annotated respectively in both corpora.
Download from	http://komunitatea.elhuyar.org/tweet-norm/files/2013/11/tweet-norm_es.zip [68.20 KB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/442.html
Edition	LREC 2014

Name	wiki_zh_ja_corpus
Resource type	Corpus
Size	126.811 sentences
Languages	Mandarin Chinese (cmn) Japanese
Production status	Newly created-finished
Resource usage	Machine Translation, SpeechToSpeech Translation
License	CreativeCommons
Conditions of use	<Not Specified>
Description	This resource is a Chinese-Japanese parallel corpus automatically extracted from Wikipedia.
Download from	http://lrec2014.lrec-conf.org/sharedlrs2014/LREC_21_res_1.gz [10.94 MB]
Referring paper	http://www.lrec-conf.org/proceedings/lrec2014/summaries/21.html
Edition	LREC 2014

Search for LRs

Shared-LRs @ LREC 2014

Important Dates

Links

Latest Tweets

Share this page!

Search for LRs

Shared-LRs @ LREC 2014

Important Dates

Links

Latest Tweets