Opus bitext and monolingual data

1/3/2023

OPUS BITEXT AND MONOLINGUAL DATA VERIFICATION
OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD

Moreover, the website and OPUS related data are now stored on a dedicated server to reduce interference with other processes () and users. Furthermore, we also started a Wiki ( with further information about the corpus. We also integrated a special interface for searching the entire collection for specific language resources. We now provide all data in their native XML format (using the XCES Align DTD for sentence alignment), in Translation Memory eXchange format (TMX) and in plain text format (for Moses/GIZA++).

OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD

Another improvement of recent versions of OPUS is the availability of various download formats for all sub- corpora.

A fi- nal alternative provided for the OpenSubtitles2011 corpus is an alignment based on hunalign (Varga et al., 2005). These improved sentence alignments can now be found together with the bilingual dictionaries used for synchronization at the following URL. The dictionary-based synchronization techniques presented in (Tiedemann, 2008) were then used to re-align all subtitle pairs. We did not spend much time optimizing these parameters but for most language pairs, this procedure gave us a decent amount of reliable word translations that could be used to find lexical matches in subtitle pairs. In particular, we extracted one-to-one word alignments of words containing at least three, exclusively alphabetic characters that occurred at least twice in the corpus and obtained condi- tional phrase translation probabilities φ ( f | e ) and φ ( e | f ) of equal or more than 0.1. We applied heavy filtering using probability thresholds, fre- quency thresholds and string patterns. From the phrase translation tables we can now extract highly reliable lexical translations even though they are based on the alignment of partially noisy corpora. These word alignments are also freely available from OPUS ( ). We used GIZA++ (Och and Ney, 2003) and the symmetrization heuristics (grow-diag-final-and) imple- mented in Moses (Koehn et al., 2007) to extract proba- bilistic phrase tables used in statistical machine translation. This was done by running automatic word alignment on the entire parallel data set created in the previ- ous step. For this it was necessary to create bilingual dictionaries for all language pairs involved. There- fore, we ran a second alignment for all subtitle pairs identi- fied in the first run using lexical synchronization as proposed in (Tiedemann, 2008). However, the actual alignment based on time information is still not perfect due to synchronization differences. These models are also released on our website ( ). For the latter we trained appropriate language models for Unicode UTF-8 texts for all languages involved in the corpus.

OPUS BITEXT AND MONOLINGUAL DATA VERIFICATION

To further clean the data, we also applied automatic detection of language-dependent character encoding using chared (Pomikálek and Suchomel, 2011) and automatic language verification using textcat (van Noord, 2010). these criteria we could largely filter out non- matching documents.

0 Comments

Opus bitext and monolingual data

OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD

OPUS BITEXT AND MONOLINGUAL DATA VERIFICATION

Leave a Reply.

Author

Archives

Categories