Using parallel corpora in translation

Author: Raphael Salkie


Parallel corpora are large collections of texts in two languages. They can be used for teaching and research in translation, bilingual lexicography, and linguistics.

Table of contents

What is a parallel corpus

A corpus is a large collection of texts, stored on a computer. A parallel corpus contains texts in two languages. We can distinguish two main types of parallel corpus:

Comparable corpus: the texts are of the same kind and cover the same content. An example would be a corpus of articles about football from English and Danish newspapers; or legal contracts in Spanish and Greek.

Translation corpus: the texts in one language (L1) are translations of texts in the other language (L2).

(Not everyone uses exactly this terminology for different types of corpus). Here I will concentrate on translation corpora. Many researchers have built translation corpora in the past decade, though unfortunately most of them are not easily available. For a useful survey of parallel corpora round the world, look at Michael Barlow's parallel corpora web page (Barlow n.d).

To use a translation corpus you need a special piece of software called a Parallel Concordancer. With this software you can ask the computer to find all the examples of a word or phrase in L1, along with all the corresponding translated sentences in L2. Two widely-used parallel concordancers are ParaConc (see Michael Barlow's ParaConc web page (Barlow n.d2) for details) and Multiconcord (information at the Multiconcord web page (Johns 1998)).

Parallel corpora and translation

Here are some examples of the French word PRÉCISER from the INTERSECT French -English corpus (see Salkie 2000 for information about this corpus):

  1. Les biens des organisations visées sont transférés dans le domaine de l'État, et le ministère de l'Intérieur doit par décret suprême [[préciser]] quels sont les biens en question ...
  2. The assets of the organisations concerned are transferred into the State domain and the Ministry of Interior is called upon to specify by supreme decree the assets in question ... [INTORGS\ILO]
  3. Cette dame, rencontrée dans le centre de Moscou, déclare, dans un premier temps, gagner 8,000 roubles par mois. Elle [[précisera]], dans un deuxième temps, que son mari gagne aussi à peu près la même somme ...
  4. A woman I met in central Moscow at first said her income was 8,000 roubles a month. Subsequently, she added that her husband earned about the same amount... [NEWS\LM-GW93]
  5. Le comité prie le gouvernement de [[préciser]] si les trois travailleurs de l'entreprise Vianini Entrecanales mentionnés par les plaignants ont été licenciés.
  6. The Committee requests the Government to indicate whether the three workers of the Vianini Entrecanales undertaking mentioned by the complainant were dismissed. [INTORGS\ILO]
  7. La combinaison no 32 peut être utilisée dans certaines séquences de signalisation de commutation; ces usages sont [[précisés]] dans les Recommandations U.11, U.20, U.22 et S.4.
  8. Combination No. 32 can be used in certain sequences of switching signals; these uses are set out in Recommendations U.11, U.20, U.22 and S.4. [SCI-TECH\TELECOM]
  9. Un autre acte pourtant, à vos yeux ridicule peut-être, mais que je redirai, car il [[précise]] en sa puérilité le besoin qui me tourmentait ...
  10. I will tell you, however, about one other action of mine, though perhaps you will consider it ridiculous, for its very childishness marks the need that then tormented me ... [FICTION\GIDE]

Bilingual dictionaries tend to concentrate on the most obvious translations, in this case specify. In the corpus, specify is actually quite rare as a translation of préciser: well over twenty different translations appear, not just the five shown here. What this gives us is a wider and richer set of possible translations, reflecting the skill and creativity of the translators who produced the texts in the corpus. We can use this information in teaching translation, where it encourages students to think of a wider range of possible equivalents. We can also use it to remedy the limitations of bilingual dictionaries (see Teubert 2001 for discussion).

Parallel corpora and linguistics: translating 'should' into German

We can also use data from a parallel corpus to compare the grammar and vocabulary of two languages. This can enable us to ask questions which we could not investigate if we just looked at one language.

Here is an example, this time from English and German. One of the most controversial problems in linguistics is the analysis of modal verbs like can, must, may and should. Linguists disagree about how many different meanings each modal has, and about whether each modal has a single underlying meaning (for a recent contribution, see Papafragou 2000).

We took 100 random examples of should from the INTERSECT German – English corpus, and noted how they were translated into German. We found eight different types of translation, listed as (a – h), with the number of instances of each type in brackets:

a. Past tense form of sollen (31)

(11) While such action [[should]] only be taken when all peaceful means have failed, the option of taking it is essential to the credibility of the United Nations as a guarantor of international security.

(12) Obwohl diese Maßnahmen erst durchgeführt werden sollten, wenn alle friedlichen Mittel versagt haben, so ist die Möglichkeit ihrer Inanspruchnahme doch unabdingbar für die Glaubwürdigkeit der Vereinten Nationen als Garant der internationalen Sicherheit. [INTORGS\UN]

b. Present tense form of sollen (25)

(13) State Premier Erwin Teufel announced today in Stuttgart that the coalition agreement between CDU and FDP in Baden-Württemberg [[should]] be finalized by next Thursday.

(14) Die Koalitionsvereinbarung von CDU und FDP in Baden-Württemberg soll am kommenden Donnerstag stehen. Das kündigte Ministerpräsident Teufel am Mittag in Stuttgart an. [NEWS\NEWAP96]

c. Conditional construction (12)

(15) [[Should]] any of them be before the courts for money matters, only residents of the region may be called as witnesses.

(16) Wenn einer von ihnen jemanden wegen einer Geldsache gerichtlich belangen will, soll er vor dem Richter nur solche Personen als Zeugen benennen können, die in ihrem Gebiet ansässig sind. [MISC\SIEBS]

d. müssen (9)

(17) The association's vice president, Mr. Lau, said in an interview that the government had spent too much and [[should]] have started saving much earlier.

(18) Vizepräsident Lau sagte in einem Interview, der Staat habe zuviel ausgegeben und hätte viel früher sparen müssen. [NEWS\NEWAP96]

e. dürfen + NEG (2)

(19) Mr Beck, State Premier of Rhineland-Palatinate, said when interviewed by the magazine FOCUS that the ability to form a coalition between SPD and FDP [[should]] not be lost.

(20) Der rheinland-pfälzische Ministerpräsident Beck sagte in einem Interview mit dem Nachrichtenmagazin FOCUS, die Koalitionsfähigkeit zwischen der Sozialdemokratie und den Liberalen dürfe nicht verloren gehen. [NEWS\NEWAP96G]

f. Subjunctive (6)

(21) In requiring the proletariat to carry out such a system, and thereby to march straightaway into the social New Jerusalem, it but requires in reality that the proletariat [[should]] remain within the bounds of existing society, ...

(22) Wenn er das Proletariat auffordert, seine Systeme zu verwirklichen und in das neue Jerusalem einzugehen, so verlangt er im Grunde nur, daß es in der jetzigen Gesellschaft stehenbleibe ... [POLITICS\MANIFESTO]

g. Indicative (5)

(23) I was vexed that the boor [[should]] have waked me, and I started up and cried, "Hold your tongue!

(24) Mich ärgerte es nur, daß mich der Grobian aufgeweckt hatte. Ich sprang ganz erbost auf und versetzte geschwind: "Was, Er will mich hier ausschimpfen? [FICTION\TAUGEN]

h. Other (10)

(25) The specific problems of countries with economies in transition with respect to their twofold transition to democracy and a market economy [[should]] also be recognized.

(26) Ebenso gilt es, die spezifischen Probleme der Umbruchländer in bezug auf ihren zweifachen Übergang zur Demokratie und zur Marktwirtschaft anzuerkennen. [INTORGS\UN]

The data showed that only about half of the examples of should in the corpus were translated by a form of sollen. This makes it very hard to extend any of the recent analyses of English modals to their German counterparts. (For further discussion see Salkie 2002a).


Parallel corpora are a recent development, so translators and linguists have barely begun to exploit their full potential. For a good survey of research and outstanding issues in this area, see Lawson (2001). Botley et al (2000) contains a number of useful papers. Salkie (2002b) discusses some implications for translation theory. For some recent research using parallel corpora, see Ebeling (1998), Kenning (1998), Salkie and Oates (1999), and Aijmer & Altenberg (2002).


