Language resources and tools.

Web services

  • Itzultzailea.eus

    Elhuyar intelligent machine translation.

  • Aditu.eus

    Speech recognition service in Basque and Spanish

  • Xuxen

    Spelling and grammar checker for Basque

  • TermKate

    Online platform for the creation of specialized dictionaries.

  • Elhuyar Dictionaries

    Online dictionaries: Basque<>Spanish, Basque<>French, Basque<>English

  • Automatic dictionaries

    Web for consulting bilingual dictionaries automatically built by pivot-techniques.

  • Elhuyar web corpusak

    Web to query two large corpora automatically compiled from the web, one Basque and one parallel Spanish-Basque.

  • CorpEus

    This website offers the possibility of searching for word s or terms in Basque on the web, with the results shown as corpus queries in context.

  • Elebila

    Search engine for Basque, the only one that allows you to limit the results to Basque.

Downloads

Opinion Mining - Sentiment Analysis

ElhPolar_es

Spanish polarity lexicon.

ElhPolar_eu

Basque polarity lexicon.

Basque Opinon Dataset

Polarity annotated Basque sentences.

BEC2016 opinion dataset

Basque regional election campaign 2016 opinion dataset - BEC2016. 25.000 Tweets with entity level polarity annotations (pos|neg).

Behagunea Opinion dataset

Tweet collection about the DSS2016 Cultural capital project. Tweets annotated with polarity at message level (pos|neg|neu) in Basque (3000) and Spanish (4754).

EliXa polarity classification models (EliXa 1.0.x)

Models for polarity classification, trained over cultural domain (Behagunea) tweets.
Previous versions: v 0.9.x

EliXa resources (EliXa 1.0.x <=)

language specific resources: polarity lexicons and other resources for text normalization. We currently provide such resources for 4 languages; Basque (eu), Spanish (es), English (en) and French (fr). Also includes pos tagging models for ixa-pipe-pos tool.
Previous versions: v 0.9.x ( Ixa-pipes pos models not  included)

Ixa-Pipes models for EliXa 0.9.x

Ixa-Pipes models used for lemmatization and POS tagging (1.5.0) by EliXa 0.9.x as default models.

Corpora

Basque-English Parallel corpus

Basque-English parallel corpus automatically gathered using the PaCo2 tool.

Basque-Spanish Parallel corpus

Basque-Spanish parallel corpus automatically gathered using the PaCo2 tool. It contains 640K segments.

Elhuyar web corpus

Corpus of 186M tokens in Basque. Automatically crawled and cleaned from the Web.
Ref: Leturia, I. 2014. The Web as a Corpus of Basque. PhD Thesis. Faculty of Informatics, UPV/EHU. Donostia.

ChatBots

FMTODelh dataset

Basque version of the Facebook Multilingual Task Oriented Dataset (López de Lacalle et al., 2020). Train and Dev sets have been translated using NMT. Test set has been manually translated.

SNIPSeu dataset

SNIPS Dataset (Coucke A. et al., 2018) test set manually translated for Basque (López de Lacalle et al., 2021)

Document classification

BHTC dataset

Basque Headlines Document Classification (BHTC) dataset. Collection containing 12,403 headlines extracted from the weekly newspaper Argia with topic annotations. Used for document classification task (Agerri et al., 2020).

Grammatical Error Correction

GEC-elh-eu dataset

Grammatical Error Correction (GEC) dataset for Basque. 9 million synthetic sentence pairs (incorrect - correct) as train dataset. For evaluation synthethic examples (6,000) and manual revised examples (672) are provided. If you use it, cite (Beloki et al., 2020) paper.

Software