Regular Sound Changes for Cross-Language Information Retrieval
                                       Michael P Oakes and Souvik Banerjee

Abstract
The aim of this project is the automatic conversion of query terms in one language into their equivalents in a
second, historically related, language, so that documents in the second language can be retrieved. The method is
to compile lists of regular sound changes which occur between related words of a language pair, and substitute
these in the source language words to generate target language words. For example, if we know b in Italian often
corresponds with a v in Spanish, an unaccented o in Italian with ó in Spanish, and a terminal e in Italian is
replaced with a null in Spanish, we can construct the Spanish word autómovil (car) from the Italian automobile.

A bilingual word list or dictionary is needed at first to first discover the set of regular sound changes, but once
this is known, there is no further need for a dictionary to look up individual query words. The method is
language pair independent, as long as the two languages belong to the same language family, such as the
Romance languages. Buckley et al. (2000) proposed a related method based on knowledge of regular
orthographic changes between languages, when corresponding words in two languages are pronounced alike, but
the method proposed here is novel in that it also incorporates regular, linguistically attested, sound changes
between historically related languages.

1. First experimental run: cognates sought by edit distance
Buckley et al. (2000) wished to determine how effective cross-lingual information retrieval (CLIR) could be
with a minimal amount of linguistic information. They made no use of dictionaries, but instead treated the
English query words as “potentially mis-spelled French words”. In order to use a set of English query words to
retrieve French documents, they first stored all the distinct vocabulary (word types) found at least five times in
the entire set of French documents in a trie. They considered two equivalence classes of characters, one being
vowels, where any combination of vowels could be substituted for any other, the other being k sounds, where
any combination of c-k-qu could substitute for any other. Thus music and musique would be considered
equivalent. If it was possible to transform any of the English query words into any of the words in the trie of
French words using either of the allowable substitutions, or by adding or deleting one letter, those French words
were added to the query.

For our basic method, we assume that cognate words are vocabulary items which occur in two or more
languages, such that they have both similar meanings and similar orthography. The degree to which two words,
one Italian and its Spanish equivalent, are orthographically similar can be estimated using edit distance, which
has for some time been used for automatic spelling correction (Wagner and Fischer, 1974). Character level
alignment may be performed by the technique of dynamic programming. The difference between two word
forms (called the edit distance) is taken to be the smallest number of operations required to transform one word
into the other. The allowable operations include substitution (a single character in one word is replaced by a
single character in the other), deletion of a single character from the first word, and insertion of a single character
into the second word. For example, we can align the words automobile and automóvil as shown in figure 1.

Figure 1. Character-level alignment of automobile and automóvil
e   -> 0 deletion :
l   -> l match :
i   -> i match :
b   -> v substitution :
o   -> ó substitution :
m   -> m match :
o   -> o match :
t   -> t match :
u   -> u match :
a   -> a match :
edit distance = 3
In order to transform the Italian automobile into the Spanish automóvil, three operations (two substitutions and a
deletion) are required, and thus the edit distance is 3. The technique of dynamic programming finds the
alignment which minimises the edit distance. To convert the edit distance into a measure of the closeness
between two words, we use the following formula (McEnery & Oakes, 1996). This matching coefficient will be
1 for two identically spelled words, and 0 for two words with no characters at all in common.

                                        edit _ dis tan ce
matching _ coefficient = 1 −
                                  length _ of _ longer _ word

A vocabulary list of over 56,000 words was produced consisting of all the Spanish words occurring twice or
more in a subset of the Spanish document set. Program clef5.c was written to align each word in the Italian query
sets with each Spanish word in the lexicon in turn, using dynamic programming. A subjectively assigned
threshold of 0.8 was chosen, such that any word in the Spanish lexicon with a matching coefficient of 0.8 or
more with respect to the Italian query word was assumed to be a possible translation of that query word. This
method is similar to that used by the Kantrowitz et al. stemmer (2000), an edit distance stemmer with complex
edits corresponding to the most common suffixes. The first five Italian queries are shown in figure 2.

Figure 2. Five original Italian queries
*c141 lettera bomba per kiesbauer
*c142 christo impacchetta parlamento tedesco
*c143 conferenza pechino sulle donne
*c144 ribellioni sierra leone diamanti
*c145 importazioni giapponesi riso

Figure 3. Translation of Italian queries into Spanish queries
*c141
[bomba] [bomba] match = 1.000000
[bomba] [bombas] match = 0.833333
[bomba] [bombay] match = 0.833333
*c142
[christo] [christa] match = 0.857143
[christo] [christi] match = 0.857143
[christo] [cristo] match = 0.857143
[christo] [hristo] match = 0.857143
[parlamento] [apartamento] match = 0.818182
[parlamento] [palamento] match = 0.900000
[parlamento] [parlament] match = 0.900000
[parlamento] [parlamento] match = 1.000000
[parlamento] [parlamentos] match = 0.909091
*c143
[conferenza] [conferencia] match = 0.818182
[donne] [donner] match = 0.833333
*c144
[sierra] [cierra] match = 0.833333
[sierra] [pierra] match = 0.833333
[sierra] [serra] match = 0.833333
[sierra] [sierra] match = 1.000000
[sierra] [tierra] match = 0.833333
[leone] [leone] match = 1.000000
[leone] [leonel] match = 0.833333
[leone] [leones] match = 0.833333
[diamanti] [diamant] match = 0.875000
[diamanti] [diamante] match = 0.875000
*c145
[importazioni] [importacion] match = 0.833333
The Italian query sets were “translated” into Spanish using program clef5.c, as shown in figure 3. Query words
of fewer than 4 characters were stoplisted. Query c141 shows three types of Spanish terms picked up in response
to the Italian query word bomba (bomb). Firstly, we find an exact match with bomba, the correct Spanish
equivalent. We also get an above threshold match with bombas, a grammatical variant (plural) of bomba. Finally
we pick up the incorrect translation Bombay, a source of noise in our system. The overall Spanish query to be
presented to the search engine is bomba, bombas, Bombay. No Spanish query terms were found corresponding to
the Italian words lettera or Kiesbauer.

2. Second experimental run: cognates sought by regular sound changes
For the second experimental run, we used a more linguistically accurate definition of cognates, namely that
cognate words are vocabulary items which occur in two or more historically related languages, such that they
have similar meanings, and one can be transformed into the other by a predictable series of phonological
changes. For example, Nothofer (1975) has manually produced tables showing the sound equivalences which
occur in four languages spoken in or near the Indonesian island of Java, namely Javanese, Madurese, Malay and
Sundanese, all of which originate from the common ancestor language Proto-Malayo-Javanic (PMJ). One
example of such an equivalence is Javanese d = Madurese jh = Malay j = Sundanese j, as in the words for road,
dalan, jhalan, jalan and jalan respectively. Such a system of sound correspondences was first described for
Indo-European languages where it is referred to as Grimm’s Law, and it was later shown that such systems are
found in all language families (Bloomfield, 1925). The task of identifying regular sound changes in bilingual
word lists has been described by Guy (1994) as “Given a sample word list from two related languages, extract
the probable rules for predicting any word of one language from that of the other”.

To find the regular sound changes found in a given language pair, the starting point is a bilingual word list where
each word in one language is aligned with its translation. Single character substitutions can be identified using
the Wagner & Fischer edit distance algorithm described in section 1. However, in their work on bilingual
sentence alignment, Gale & Church (1993) introduced additional operations into the dynamic programming
algorithm. While the original algorithm allows only for single character insertions, deletions and substitutions,
Gale and Church also considered for example 2:1 correspondence, denoting that two sentences of one language
correspond with just one of the other. They also allowed for the fact that some operations are more commonly
encountered in real data that others, by assigning higher edit distances to less frequently encountered operations.
In program Jakarta.c (Oakes, 2000) the allowed operations correspond to the list of the types of sound change
which typically occur between related languages throughout the world given by Crowley (1992, Chapter 2).
These include single character operations, operations with higher cardinality, operations which can only involve
certain characters such as vowels, and operations which can only take place at certain positions (such as initial)
in the two words. Program Jakarta.c was used to collate the sound changes discovered in a list of 63 word pairs
taken from the introductions to Collins Gem Italian and Spanish dictionaries, examples of which are shown in
figure 4 below:

Figure 4. Sample word pairs examined for regular sound changes
Abbreviazione               abreviatura
aggettivo                   adjetivo
amministrazione             administración
avverbio                    adverbio
agricoltura                 agricultura
anatomia                    anatomía
architectura                arquitectura
automobile                  automóvil
biologia                    biología
botanica                    botánica

The sound changes found in the word pair aggetivo and adjetivo are shown in figure 5. These changes were a
fusion of the double t in Italian to a single t in Spanish, and the dissimilation of the gg (producing a single sound)
into dj (producing two separate sounds) in Spanish. All word pairs in the bilingual list were compared in this
way, and the changes were collated to produce the list shown in figure 6, which includes the number of instances
where a character remained as itself. 0 represents a null character. Pairs of words which require an above
threshold edit distance of three to transform one into the other are deemed not to be cognate (such as
abbreviazione and abbreviatura), and do not contribute to the tally of discovered sound changes.
Figure 5. Alignment of aggettivo and adjetivo
o   -> o match :
v   -> v match :
i   -> i match :
tt -> t fusion :
e   -> e match :
gg -> dj dissimilation :
a   -> a match :
cost = 2

Figure 6. Sound substitutions found in the bilingual word list
a     -> a       : 50
a     -> o       : 1
az    -> ac      : 1
b     -> b       : 6
c     -> c       : 18
d     -> d       : 1
e     -> é       : 1
e     -> 0       : 9
e     -> e       : 29
f     -> f       : 10
g     -> g       : 19
gg    -> dj      : 1
gg    -> uj      : 1
i     -> í       : 12
i     -> i       : 47
l     -> l       : 23
m     -> m       : 13
mm    -> m       : 1
n     -> n       : 23
o     -> ó       : 3
o     -> 0       : 1
o     -> o       : 35
o     -> u       : 4
p     -> p       : 6
q     -> q       : 1
r     -> r       : 26
s     -> s       : 8
ss    -> s       : 1
st    -> s       : 1
t     -> d       : 1
t     -> t       : 26
tt    -> ct      : 1
tt    -> t       : 1
u     -> ú       : 1
u     -> o       : 1
u     -> u       : 7
v     -> v       : 9
vv    -> dv      : 1
z     -> z       : 1
za    -> te      : 1
0     -> 0       : trivial

Based on the sound changes seen in figure 6, the basic edit distance program used for the first experimental run
(clef5.c) was amended to form program sounds5.c which implements the following metric: a cost of 1 is
assigned for each insertion, deletion or single character substitution not recognised as being regular, and a cost of
0 for each exact character match, deletion or single character substitution listed as being regular. The
transformations regarded as being regular are shown in figure 7. As for the first experimental run, each Italian
query term was matched against the Spanish lexicon, and all Spanish terms matching the Italian terms with an
above threshold coefficient were included as Spanish query terms. A slightly higher threshold of 0.85 was used
for the second experimental run, as the incorporation of the regular sound changes meant that true cognates
matched with slightly higher coefficients, and raising the threshold would reduce the noise caused by false
cognates being selected. The first five Spanish query sets produced for the second experimental run are shown in
figure 8.

Figure 7. Sound changes used in the second experimental run
z -> c not initial
o -> u not initial
t -> c not initial
v -> d not initial
i -> í
o -> ó
u -> ú
g -> j not first or second character
delete terminal o
delete terminal e

Figure 8. Translation of Italian queries into Spanish queries using regular sound
changes
*c141
[lettera] [lectura] match = 0.857143
[lettera] [lecturas] match = 0.875000
[bomba] [bomba] match = 1.000000
[bomba] [bombas] match = 1.000000
*c142
[christo] [chris] match = 0.857143
[christo] [christa] match = 0.857143
[christo] [christi] match = 0.857143
[christo] [cristo] match = 0.857143
[christo] [hristo] match = 0.857143
[parlamento] [palamento] match = 0.900000
[parlamento] [parlament] match = 1.000000
[parlamento] [parlamento] match = 1.000000
[parlamento] [parlamentos] match = 1.000000
*c143
[conferenza] [conferencia] match = 0.909091
[conferenza] [conferencias] match = 0.916667
[donne] [dunn] match = 1.000000
*c144
[sierra] [sierra] match = 1.000000
[sierra] [tierras] match = 0.857143
[leone] [leon] match = 1.000000
[leone] [león] match = 1.000000
[leone] [leone] match = 1.000000
[leone] [leones] match = 1.000000
[diamanti] [diamant] match = 0.875000
[diamanti] [diamante] match = 0.875000
[diamanti] [diamantes] match = 0.888889
*c145
[importazioni] [importacion] match = 0.916667
[importazioni] [importación] match = 0.916667
[importazioni] [importaciones] match = 0.923077
3. The search engine
For both experimental runs, the task was to translate the Italian query sets to Spanish query sets, then match the
Spanish query sets against the Spanish document set using a search engine. Our search engine uses a very simple
algorithm. The Spanish query sets are submitted in turn, and for each document in the Spanish document set, a
score of 1 is given for each word in the document which matches a word in the query set. The overall score for
each document is normalised by dividing it by the number of words in the document, and the documents are
ranked, so that those with the best normalised matching score are presented first.

4. Conclusions
We believe that our results could be improved in future in a number of ways. Firstly a larger Spanish vocabulary
could be produced using the entire Spanish document set. Secondly, we need to determine optimal matching
coefficient thresholds which best discriminate between true and false word translations. Thirdly, we need a much
larger bilingual word list to better determine the set of sound changes found in Italian and Spanish cognate
words. Finally, program sounds5.c could be enhanced to allow regular multiple character substitutions.

We have demonstrated a method of generating target language query words using source language keywords and
a list of regular sound changes, if the source and target languages are historically related, as they are in the case
of Italian and Spanish, which share many cognate words. The differences with respect to Buckley et al.’s
approach are firstly that linguistically motivated sound substitutions are used, and secondly that regular sound
substitutions are used rather than just orthographic substitutions for homophones.

References
L Bloomfield, On the Sound System of Central Algonquian. Language 1, pp 130-156, 1925.

C Buckley, J Walz, M Mitra & C Cardi, “Using Clustering and Super Concepts Within SMART”, NIST Special
Publication 500-240: The Sixth Text Retrieval Conference (TREC6), 2000.
http://trec.nist.gov/pubs/trec6/t6_proceedings.html

T Crowley. An Introduction to Historical Linguistics. Oxford University Press, 1992.

W Gale & K Church. A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1),
pp 75-102, 1993.

J B M Guy. An Algorithm for Identifying Cognates in Bilingual Word Lists and its Applicability to Machine
Translation. Journal of Quantitative Linguistics 1(1), pp 34-42, 1994.

M Kantrowicz, M Behrang & V Mittal, “Stemming and its Effects on TFIDF Ranking”, Proceedings of the 23rd
ACM SIGIR Conference, Athens, Greece, 2000.

A M McEnery & M P Oakes ,“Sentence and Word Alignment in the CRATER Project”, in “Using Corpora for
Language Research”, ed. J Thomas and M Short, Longman, pp 211-231, 1996.

B Nothofer. The Reconstruction of Proto-Malayo-Javanic. ‘s-Gravenhage:Martinus Nijhoff, 1975.

M P Oakes, Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter
Languages, Journal of Quantitative Linguistics 7(3) 2000, pp 233-244.

R A Wagner & M J Fischer, The String to String Correction Problem. Journal of the ACM, 21, p 168, 1974.