Corpus

WTO public trilingual corpus

In light of the growing importance of research in natural language processing, in particular in machine translation, and also to help translators concerned with international trade subjects, the WTO is willing to allow access to its public trilingual corpus.

The corpus made available contains most of the WTO public documents produced from the creation of WTO in 1995 to December 2018. All documents have been human translated. It is sentence aligned automatically and contains metadata as explained in the table below. Even though the quality of the alignment is good, it has not been reviewed and thus the WTO cannot warranty accuracy or completeness of alignments.

ID

> Record ID

IdxID

> Record of the file

SegID

> Segment ID

BTYear

> BiText Year

BTPath

> BiText Path (correspond to Collection & Serie)

BTName

> BiText Name

SegSrc

> Source text

SegTgt

> Target text

Match

> Number of sentence(s) from Source/Target to make this segment

IdxDomain

> Domain name

The WTO Public Corpus is available in the three WTO working language bidirectional combinations, English-Spanish, English-French and French-Spanish. It is split into 22 files (of 1 000 000 lines each) and can be downloaded as a .zip file in each language pair.

Download the corpus

Disclaimer and terms of use of the WTO Public Trilingual Corpus

Use by anyone and in any context and purpose of the World Trade Organization (WTO) Public Trilingual Corpus (English-Spanish-French) shall be subject to the following disclaimer:

  • The WTO Public Trilingual Corpus is made available without warranty of any kind, explicit or implied. More particularly, the WTO specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the WTO Public Trilingual Corpus in any of its three languages, including the original language of the document, or regarding any functionalities metadata or software embedded in the WTO Public Trilingual Corpus.
  • Under no circumstances shall the WTO be held liable for any loss, liability, injury or damage (including in criminal proceedings) incurred or suffered that is claimed to have resulted from the use of the WTO Public Trilingual Corpus. The use of the WTO Public Trilingual Corpus is at the user's sole risk. The user specifically acknowledges and agrees that the WTO is not liable for the conduct of any user. If a user is dissatisfied with any of the material provided in the WTO Public Trilingual Corpus, the user's sole and exclusive remedy is to discontinue using the WTO Public Trilingual Corpus.
  • When using the WTO Public Trilingual Corpus, the user must acknowledge the WTO as the source of the information.
  • Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the WTO, which are specifically reserved.

 

 

Share


  

Problems viewing this page? If so, please contact [email protected] giving details of the operating system and web browser you are using.