Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 29 Measuring the Relatedness between Documents in Comparable Corpora Hernani Costaa , Gloria Corpas Pastora and Ruslan Mitkovb a LEXYTRAD, University of Malaga, Spain b RIILP, University of Wolverhampton, UK {hercos,gcorpas}@uma.es, r.mitkov@wlv.ac.uk Abstract linguistic resources and attempting a meaningful description of their content is often a perilous This paper aims at investigating the task (Corpas Pastor and Seghiri, 2009). Usually, use of textual distributional similarity a corpus is given a short description such as measures in the context of comparable corpora. We address the issue of measuring “casual speech transcripts” or “tourism specialised the relatedness between documents by comparable corpus”. Yet, such tags will be of extracting, measuring and ranking their little use to those users seeking for a representative common content. For this purpose, we and/or high quality domain-specific corpora. designed and applied a methodology Apart from the usual description that comes that exploits available natural language along with the corpus, like number of documents, processing technology with statistical tokens, types, source(s), creation date, policies methods. Our findings showed that using a list of common entities and a simple, of usage, etc., nothing is said about how similar yet robust set of distributional similarity the documents are or how to retrieve the most measures was enough to describe and related ones. As a result, most of the resources assess the degree of relatedness between at our disposal are built and shared without deep the documents. Moreover, our method has analysis of their content, and those who use them demonstrated high performance in the task blindly trust on the people’s or research group’s of filtering out documents with a low level name behind their compilation process, without of relatedness. By a way of example, one knowing nothing about the relatedness quality of the measures got 100%, 100%, 95% and 90% precision when injected 5%, 10%, of the documents. Although some tasks require 15% and 20% of noise, respectively. documents with a high degree of relatedness between each other, the literature is scarce on this matter. 1 Introduction Accordingly, this work explores this niche by Comparable corpora1 can be considered an taking advantage of several textual Distributional important resource for several research areas Similarity Measures (DSMs) presented in the such as Natural Language Processing (NLP), literature. Firstly, we selected a specialised terminology, language teaching, and automatic corpus about tourism and beauty domain that was and assisted translation, amongst other related manually compiled by researchers in the area of areas. Nevertheless, an inherent problem to those translation and interpreting studies. Then, we who deal with comparable corpora in a daily designed and applied a methodology that exploits basis is the uncertainty about the data they are available NLP technology with statistical methods dealing with. Indeed, little work has been done to assess how the documents correlate with each on semi- or automatically characterising such other in the corpus. Our assumption is that the 1 I.e. corpora that include similar types of original texts amount of information contained in a document in one or more language using the same design criteria (cf. can be evaluated via summing the amount of (EAGLES, 1996; Corpas Pastor, 2001)). information contained in the member words. For Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 30 this purpose, a list of common entities was used Having this in mind, we took advantage of two as a unit of measurement capable of identifying IR measures commonly used in the literature, the the amount of information shared between the Spearman’s Rank Correlation Coefficient (SCC) documents. Our hypothesis is that this approach and the Chi-Square ( 2 ) to compute the similarity will allow us to: compute the relatedness between between documents written in the same language documents; describe and characterise the corpus (see section 2.1 and 2.2). Both measures are itself; and to rank the documents by their degree particularly useful for this task because they are of relatedness. In order to evaluate how the DSMs independent of text size (mostly because both perform the task of ranking documents based on use a list of the common entities), and they are their similarity and filter out the unrelated ones, language-independent. we introduced noisy documents, i.e. out-of- The SCC distributional measure has been domain documents to the corpus in hand. shown effective on determining similarity The remainder of the paper is structured as between sentences, documents and even on follows. Section 2 introduces some fundamental corpora of varying sizes (Kilgarriff, 2001; Costa concepts related with DSMs, i.e. explains the et al., 2015; Costa, 2015). It is particularly useful, theoretical foundations, related work and the for instance to measure the textual similarity DSMs exploited in this experiment. Then, Section between documents because it is easy to compute 3 presents the corpora used in this work. After and is independent of text size as it can directly applying the methodology described in Section compare ranked lists for large and small texts. 4, Section 5 presents and discusses the obtained The 2 similarity measure has also shown results in detail. Finally, Section 6 presents the its robustness and high performance. By way final remarks and highlights our future work. of example, 2 have been used to analyse the conversation component of the British National 2 Distributional Similarity Measures Corpus (Rayson et al., 1997), to compare both documents and corpora (Kilgarriff, 2001; Costa, Information Retrieval (IR) (Singhal, 2001) is the 2015), and to identify topic related clusters in task of locating specific information within a imperfect transcribed documents (Ibrahimov et collection of documents or other natural language al., 2002). It is a simple statistic measure that resources according to some request. This field permits to assess if relationships between two is rich in statistical methods that use words variables in a sample are due to chance or the and their (co-)occurrence to retrieve documents relationship is systematic. or sentences from large data sets. In simple Bearing this in mind, distributional similarity words, these IR methods aim to find the most measures in general and SCC and 2 in particular frequently used words and treat the rate of usage have a wide range of applicabilities (Kilgarriff, of each word in a given text as a quantitative 2001; Costa et al., 2015; Costa, 2015). Indeed, attribute. Then, these words serve as features this work aims at proving that these simple, yet for a given statistical method. Following Harris’ robust and high-performance measures allow to distributional hypothesis (Harris, 1970), which describe the relatedness between documents in assumes that similar words tend to occur in similar specialised corpora and to rank them according to contexts, these statistical methods are suitable, their similarity. for instance to find similar sentences based on the words they contain (Costa et al., 2015) and 2.1 Spearman’s Rank Correlation automatically extract or validate semantic entities Coefficient (SCC) from corpora (Costa et al., 2010; Costa, 2010; Costa et al., 2011). To this end, it is assumed In this work, the SCC is adopted and calculated as that the amount of information contained in a in Kilgarriff (2001). Firstly, a list of the common document could be evaluated by summing the entities2 L between two documents dl and dm is amount of information contained in the document compiled, where Ldl ,dm ✓ (dl \dm ). It is possible words. And, the amount of information conveyed to use the top n most common entities or all by a word can be represented by means of the 2 In this work, the term ‘entity’ refers to “single words”, weight assigned to it (Salton and Buckley, 1988). which can be a token, a lemma or a stem. Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 31 common entities between two documents, where 3 Corpora n corresponds to the total number of common entities considered |L|, i.e. {n|n 2 N 0 , n  |L|} INTELITERM3 is a specialised comparable – in this work we use all the common entities for corpus composed of documents collected from the each document pair, i.e. n = |L|. Then, for each Internet. It was manually compiled by researchers document the list of common entities (e.g. Ldl and with the purpose of building a representative Ldm ) is ranked by frequency in an ascending order corpus (Biber, 1988, p.246) for the Tourism and (RLdl and RLdm ), where the entity with lowest Beauty domain. It contains documents in four frequency receives the numerical raking position different languages (English, Spanish, Italian and 1 and the entity with highest frequency receives German). Some of the texts are translations of the numerical raking position n. Finally, for each each other (parallel), yet the majority is composed common entity {e1 , ..., en } 2 L, the difference in of original texts. The corpus is composed of the rank orders for the entity in each document is several subcorpora, divided by the language and computed, and then normalised as a sum of the further for each language there are translated and ⇣Pn ⌘ original texts. For the purpose of this work, only square of these differences s2i . The final original documents in English, Spanish and Italian i=1 SCC equation is presented in expression 1, where were used, which for now on will be referred as {SCC|SCC 2 R, 1 SCC  1}. int en, int es, int it, respectively. In order to analyse how the DSMs perform P n the task of ranking documents based on their 6⇤ s2i similarity and filter out the unrelated ones, i=1 SCC(dl , dm ) = 1 (1) it is necessary to introduce noisy documents, n3 n i.e. out-of-domain documents to the various subcorpora. To do that, we chose the well- 2.2 Chi-Square ( 2 ) known Europarl4 corpus (Koehn, 2005), a parallel The Chi-square ( 2 ) measure also uses a list of corpus composed by proceedings of the European common entities (L). Similarly to SCC, it is also Parliament. As mentioned further in section 5.2, possible to use the top n most common entities we added different amounts of noise to the various or all common entities between two documents, subcorpora, more precisely 5%, 10%, 15% and and again, we use all the common entities for 20%. These noisy documents were randomly each document pair, i.e. n = |L|. The number selected from the “one per day” Europarl v.7 for of occurrences of a common entity in L that the three working languages: English, Spanish would be expected in each document is calculated and Italian (eur en, eur es, eur it, respectively). from the frequency lists. If the size of the types document dl and dm are Nl and Nm and the nDocs types tokens tokens entity ei has the following observed frequencies int en 151 11,6k 496,2k 0.023 O(ei , dl ) and O(ei , dm ), then the expected values eur en 30 3.4k 29,8k 0.116 are eidl = Nl ⇤(O(eN i ,dl )+O(ei ,dm )) and eidm = int es 224 13,2k 207,3k 0.063 l +Nm eur es 44 5,6k 43,5k 0.129 Nm ⇤(O(ei ,dl )+O(ei ,dm )) Nl +Nm . Equation 2 presents the int it 150 19,9k 386,2k 0.052 2 formula, where O is the observed frequency eur it 30 4,7k 29,6k 0.159 and E the expected frequency. The resulted 2 score should be interpreted as the interdocument Table 1: Statistical information per subcorpora. distance between two documents. It is also important to mention that { 2 | 2 2 R, 1 All the statistical information about both the 2 < 1}, which means that as more unrelated the INTELITERM subcorpora and the set of 20% common entities in L are, the lower the 2 score of noisy documents, randomly selected for each will be. working language, are presented in Table 1. In detail, this Table shows: the number of documents X (O E)2 3 2 (dl , dm ) = (2) http://www.lexytrad.es/proyectos.html 4 E http://www.statmt.org/europarl/ Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 32 (nDocs); the number of types (types); the number we will use the acronym NCE), a co- of tokens (tokens); and the ratio of types per occurrence matrix was built for each pair types tokens ( tokens ) per subcorpus. These values of documents. Only those that have at were obtained using the Antconc 3.4.3 (Anthony, least one occurrence in both documents are 2014) software, a corpus analysis toolkit for considered. As required by the DSMs (see concordancing and text analysis. section 2), their frequency in both documents is also stored within this matrix (Ldl ,dm = 4 Methodology {ei , (f (ei , dl ), f (ei , dm )); ej , (f (ej , dl ), This section describes the methodology employed f (ej , dm )); ...; en , (f (en , dl ), f (en , dm ))}, to calculate and rank documents based on where f represents the frequency of an entity their similarity using Distributional Similarity in a document). With the purpose of analysing Measures (DSMs). All the tools, libraries and and comparing the performance of different frameworks used for the purpose in hand are also DSMs, three different lists were created to be pointed out. used as input features: the first one using the Number of Common Tokens (NCT), another 1) Data Preprocessing: firstly all the using the Number of Common Lemmas INTELITERM documents were processed (NCL) and the third one using the Number of with the OpenNLP5 Sentence Detector and Common Stems (NCS). Tokeniser. Then, the annotation process 3) Computing the similarity between was done with the TT4J6 library, which is a documents: the similarity between Java wrapper around the popular TreeTagger documents was calculated by applying (Schmid, 1995) – a tool specifically designed three different DSMs (DSM s = to annotate text with part-of-speech and lemma {DSMN CE , DSMSCC , DSM 2 }, where information. Regarding the stemming, we N CE , SCC and 2 refer to Number of Common used the Porter stemmer algorithm provided Entities, Spearman’s Rank Correlation by the Snowball7 library. A method to remove Coefficient and Chi-Square, respectively), punctuation and special characters within the each one calculated using three different input words was also implemented. Finally, in order features (NCT, NCL and NCS). to get rid of the noise, a stopword list8 was compiled to filter out the most frequent words 4) Computing the document final score: the in the corpus. Once a document is computed document final score DSM (dl ) is the mean of and the sentences are tokenised, lemmatised the similarity scores of the document with all and stemmed, our system creates a new output the documents in the collection of documents, file with all this new information, i.e. a nP1 DSMi (dl ,di ) new document containing: the original, the i.e. DSM (dl ) = i=1 n 1 , where n tokenised, the lemmatised and the stemmed corresponds to the total number of documents text. Using the stopword list mentioned above in the collection and DSMi (dl , di ) the resulted a Boolean vector describing if the entity is a similarity score between the document dl with stopword or not is also added to the document. all the documents in the collection. This way, the system will be able to use only the tokens, lemmas and stems that are not 5) Ranking documents: finally, the documents stopwords. were ranked in a descending order according to their DSMs scores (i.e. NCE, SCC or 2 ). 2) Identifying the list of common entities between documents: in order to identify 5 Results and Analysis a list of common entities (from now on This experiment is divided into two parts. In the 5 https://opennlp.apache.org first part (section 5.1), we describe the corpus 6 http://reckart.github.io/tt4j/ 7 http://snowball.tartarus.org in hand by applying three different Distributional 8 Freely available to download through the following URL Similarity Measures (DSMs): the Number of https://github.com/hpcosta/stopwords. Common Entities (NCE), the Spearman’s Rank Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 33 Correlation Coefficient (SCC) and the Chi-Square are some exception that we will discuss along this ( 2 ). As a input feature to the DSMs, three section. Another interesting observation is related different lists of entities were used, i.e. the with the high Number of Common Tokens (NCT) Number of Common Tokens (NCT), the Number in English (int en) when compared with Italian of Common Lemmas (NCL) and the Number of and Spanish (int it and int es, respectively), see Common Stems (NCS). By a way of example, Table 2 and Figure 1a. Later in this section, we Table 2 shows the NCT between documents, the will try to explain this phenomenon. SCC and the 2 scores and averages (av) along with the associated standard deviations ( ) per SubC. Stats NCT SCC 2 measure and subcorpus. Figure 1 presents the av 163.70 0.42 279.39 int en resulted average scores per document in a box plot 83.87 0.05 177.45 format for all the combinations DSM vs. feature. av 31.97 0.41 40.92 int es Each box plot displays the full range of variation 23.48 0.07 38.21 av 101.08 0.39 201.97 (from min to max), the likely range of variation int it 55.71 0.05 144.68 (the interquartile range or IQR), the median, and the high maximums and low minimums (also Table 2: Average and standard deviation of know as outliers). It is important to mention common tokens scores between documents per that for the first part of this experiment (section subcorpus. 5.1) we did not use a sample, but instead the entire INTELITERM subcorpora in their original Although the NCT per document on average is size and form, which means that all obtained higher for the int en subcorpus, the interquartile results and made observations came from the range (IQR) is larger than for the other subcorpora entire population, in this case the English (int en), (see Table 2 and Figure 1a), which means that the Spanish (int es) and Italian (int it) subcorpora middle 50% of the data is more distributed and (for more details about the subcorpora see section thus the average of NCT per document is more 3). Regarding the second part of this experiment, variable. Moreover, longest whiskers (the lines we used the same subcorpora, but an additional extending vertically from the box) in Figure 1a percentage of documents was added to them in also indicates variability outside the upper and order to test how the DSMs perform the task of lower quartiles. Therefore, we can say that int en filtering out these noisy documents, i.e. out-of- has a wide type of documents and consequently domain documents (see 5.2). In detail, Figure some of them are only roughly correlated to the 2 shows how the average scores decrease when rest of the subcorpus. Nevertheless, the data is injecting noisy documents and Table 3 presents skewed left and the longest whisker outside the how the DSMs performed when that noise was upper quartile indicates that the majority of the injected. data is strongly similar, i.e. the documents have a high degree of relatedness between each other. 5.1 Describing the Corpus This idea can be sustained not only by the positive The first observation we can make from Figure average SCC scores, but also by the set of outliers 1 is that the distributions between the features above the upper whisker in Figure 1b. The average are quite similar (see for instance Figures 1a, of 0.42 SCC score and =0.05 also implies a 1d and 1g). This means that it is possible to strong correlation between the documents in the achieve acceptable results only using raw words int en subcorpus (Table 2). Likewise, the longest (i.e. tokens). Stems and lemmas require more whisker and the set of outliers outside the upper processing power and time to be used as features quartile in the 2 scores also indicate a high – especially lemmas due to the part-of-speech relatedness between the documents. tagger dependency and time consuming process Regarding the int it subcorpus, the SCC and the implied. In general, we can say that the scores for 2 scores (Figures 1b and 1c) and the average each subcorpus are symmetric (roughly the same of 101.08 common tokens per document and on each side when cut down the middle), which =55.71 (Figure 1a and Table 2) suggest that the means that the data is normally distributed. There data is normally distributed (Figure 1b) and highly Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 34 Common Tokens Spearman's rank correlation coefficient (tokens) Chi Square scores (tokens) 0.6 1000 300 800 0.5 Average of common tokens per document 250 Average score per document Average score per document 200 600 0.4 150 400 0.3 100 200 50 0.2 0 0 int_en int_es int_it int_en int_es int_it int_en int_es int_it Subcorpora Subcorpora Subcorpora (a) (b) (c) Common Lemmas Spearman's rank correlation coefficient (lemmas) Chi Square scores (lemmas) 0.6 300 1000 250 0.5 Average of common lemmas per document 800 Average score per document Average score per document 200 0.4 600 150 0.3 400 100 0.2 200 50 0 0 0.1 int_en int_es int_it int_en int_es int_it int_en int_es int_it Subcorpora Subcorpora Subcorpora (d) (e) (f) Common Stems Spearman's rank correlation coefficient (stems) Chi Square scores (stems) 0.6 300 1000 250 0.5 Average of common stems per document 800 Average score per document Average score per document 200 0.4 600 150 0.3 400 100 0.2 200 50 0 0 0.1 int_en int_es int_it int_en int_es int_it int_en int_es int_it Subcorpora Subcorpora Subcorpora (g) (h) (i) Figure 1: INTELITERM: average scores between documents per subcorpus. correlated. Although this subcorpus got lower and Figure 1a reveal a lower NCT compared with average scores for all the DSMs when compared int en and the int it subcorpora. to the English subcorpus, Table 2, Figure 1a, The subcorpus int en has 163 common tokens 1b and 1c show that the average scores and the per document on average with a =83, and the range of variation are quite similar to the English subcorpora int it and int es only have 101 and subcorpus. Therefore, we can conclude that the 31 common tokens per document on average with documents inside the Italian subcorpus are highly a =55 and =23, respectively (Table 2, NCT related between each other. column). This means that the int it and int es From the three subcorpora, the int es subcorpora are composed of documents with a subcorpus is the biggest one with 224 documents lower level of relatedness when compared with (Table 1). Nevertheless, the average scores per the English one. This fact could happen because document are slightly different from the other Italian and Spanish have a richer morphology box plots (see Figures 1a, 1b and 1c). The 2 compared to English. Therefore, due to bigger standard deviation practically equal to its average number of inflection forms per lemma, there (38.21 and 40.92, respectively) and the SCC is a larger number of tokens and consequently variability inside and outside the IQR indicates less common tokens per document in Spanish. some inconsistency in the data. Moreover, Table 2 Another explanation could come from the fact Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 35 that the tourism and beauty services are more Figure 2). As a result, at this point we have the developed in Italy and Spain than in the UK and documents ranked in a descending order according therefore there are more variety on the vocabulary to their DSMs scores. used as well as in the services offered. Indeed, Common Tokens Table 1 offers some evidences about the employed 300 vocabulary. The English subcorpus has a lower number of types and a higher number of tokens 250 Average of common tokens per document (11,6k and 496,2k, respectively) when compared 200 with the Italian (19,9k types and 386,2k tokens) 150 and Spanish subcorpora (13,2k types and 207,3k 100 tokens). The high difference on the average of 50 common tokens per document between Spanish 0 and the other two languages can also be related int_en05 int_en10 int_en15 int_en20 int_es05 int_es10 int_es15 int_es20 int_it05 int_it10 int_it15 int_it20 Subcorpora with 5%, 10%, 15% and 20% of noise with the marketing strategies used to advertise Figure 2: Average scores between documents tourism and beauty services, which is somehow when injecting 5%, 10%, 15% and 20% of noise hard to confirm. Despite that our method is able to the various subcorpora. to catch the lexical level of similarity between the documents, the semantic level is not taken into account, i.e. does not consider synonyms In order to evaluate the DSMs precision, we as similar words for example, and consequently analysed the first n positions in the ranking lists would result on slightly different similarity scores produced by the three DSMs (individually), and (again, another explanation difficult to confirm). in this case n is the number of original documents To conclude, we can state from the statistical in a given INTELITERM subcorpus. Table 3 and theoretical evidences that the int en and the presents the precision values obtained by the int it subcorpora look like they assemble highly DSMs when injecting different amounts of noise correlated documents. We can not say the same to the various original subcorpora. for the int es subcorpus. Due to the scarceness SubC Noise NCT SCC 2 of evidences, we can only not reject the idea that 5% 0.89 0.22 1.00 this subcorpus is composed of similar documents. 10% 0.73 0.33 1.00 Nevertheless, as we will see in the next section, int en 15% 0.73 0.36 0.95 the fact that int es is composed by low related 20% 0.80 0.37 0.90 documents (according to our findings) will affect 5% 0.00 0.00 0.38 the ranking task. 10% 0.07 0.07 0.20 int es 15% 0.09 0.09 0.17 5.2 Measuring DSMs Performance 20% 0.14 0.18 0.23 The second part of this experiment aims at 5% 0.88 0.13 0.88 assessing how the DSMs perform the task of 10% 0.82 0.06 0.82 int it 15% 0.74 0.09 0.83 filtering out documents with a low level of 20% 0.73 0.13 0.87 relatedness. To do that, we injected different sets of out-of-domain documents, randomly Table 3: DSMs precision when injecting different selected from the Europarl corpus to the original amounts of noise to the various subcorpora. INTELITERM subcorpora. More precisely, we injected 5%, 10%, 15% and 20%9 to the various As expected, none of the DSMs got acceptable subcorpora. As we can see in Figure 2, the more results for Spanish, being incapable of correctly noisy documents are injected, the lower is the identify noisy documents. However, we need to NCT. Then, the methodology described in Section be aware that this happened due to the pre-existing 4 was applied to these “new twelve subcorpora” low level of relatedness between the original (int en05, int en10, ..., int it15 and int it20, see documents in the int es subcorpus (see Section 9 The number of documents that correspond to these 5.1 for more details). On the other hand, the DSMs percentages can be inferred from Table 1. show promising results for English and Italian. By Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 36 a way of example, the 2 was capable of reaching performance. 100% when injected 5% and 10% of noise to the int en subcorpus, and even 90% when injected Acknowledgements 20%. Although the NCT got lower precision, Hernani Costa is supported by the People in general, when compared with the 2 , it still Programme (Marie Curie Actions) of the reached 80% and 73% when injected 20% of European Union’s Framework Programme noise to the English and to the Italian subcopora, (FP7/2007-2013) under REA grant agreement respectively. From the evidences shown in Table no 317471. The research reported in this 3, we can say that the NCT and the 2 are suitable work has also been partially carried out in the for the task of filtering out low related documents framework of the Educational Innovation Project with a high precision degree. The same cannot be TRADICOR (PIE 13-054, 2014-2015); the R&D say to the SCC measure, at least for this specific project INTELITERM (ref. no FFI2012-38881, task. 2012-2015); the R&D Project for Excellence TERMITUR (ref. no HUM2754, 2014-2017); and 6 Conclusions and Future Work the LATEST project (ref. 327197-FP7-PEOPLE- In this paper we presented a simple methodology 2012-IEF). and studied various Distributional Similarity Measures (DSMs) for the purpose of measuring References the relatedness between documents in specialised comparable corpora. As input for these DSMs, Laurence Anthony. 2014. AntConc (Version 3.4.3) Machintosh OS X. Waseda University. we used three different input features (lists of Tokyo, Japan. Available from http://www. common tokens, lemmas and stems). In the laurenceanthony.net. end, we conclude that for the data in hand these Douglas Biber. 1988. Variation across speech and features had similar performance. In fact, our writing. Cambridge University Press, Cambridge, findings show that instead of using common UK. lemmas or stems, which require external libraries, Gloria Corpas Pastor and Mı́riam Seghiri. 2009. processing power and time, a simple list of Virtual Corpora as Documentation Resources: common tokens was enough to describe our Translating Travel Insurance Documents (English- data. Moreover, we proved that it is possible to Spanish). In A. Beeby, P.R. Inés, and P. Sánchez- Gijón, editors, Corpus Use and Translating: Corpus assess and describe comparable corpora through Use for Learning to Translate and Learning statistical methods. The number of entities shared Corpus Use to Translate, Benjamins translation by their documents, the average scores obtained library, chapter 5, pages 75–107. John Benjamins with the SCC and the 2 measure resulted to Publishing Company. be an important surgical toolbox to dissect and Gloria Corpas Pastor. 2001. Compilación de un corpus microscopically analyse comparable corpora. ad hoc para la enseñanza de la traducción inversa Furthermore, these DSMs can be seen as especializada. TRANS, Revista de Traductologı́a, 5(1):155–184. a suitable tool to rank documents by their Hernani Costa, Hugo Gonçalo Oliveira, and Paulo similarities. A handy feature to those who Gomes. 2010. The Impact of Distributional manually or semi-automatically compile corpora Metrics in the Quality of Relational Triples. In 19th mined from the Internet and want to retrieve European Conf. on Artificial Intelligence, Workshop the most similar ones and filter out documents on Language Technology for Cultural Heritage, with a low level of relatedness. Our findings Social Sciences, and Humanities, ECAI’10, pages show promising results when filtering out noisy 23–29, Lisbon, Portugal, August. documents. Indeed, two of the measures got very Hernani Costa, Hugo Gonçalo Oliveira, and Paulo high precision results, even when dealing with Gomes. 2011. Using the Web to Validate Lexico- Semantic Relations. In 15th Portuguese Conf. on 20% of noise. Artificial Intelligence, volume 7026 of EPIA’11, In the future, we intend not only to perform pages 597–609, Lisbon, Portugal, October. Springer. more experiments with these DSMs in other Hernani Costa, Hanna Béchara, Shiva Taslimipoor, corpora and languages, but also test other Rohit Gupta, Constantin Orasan, Gloria DSMs, like Jaccard or Cosine and compare their Corpas Pastor, and Ruslan Mitkov. 2015. Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 37 MiniExperts: An SVM approach for Measuring Semantic Textual Similarity. In 9th Int. Workshop on Semantic Evaluation, SemEval’15, pages 96–101, Denver, Colorado, June. ACL. Hernani Costa. 2010. Automatic Extraction and Validation of Lexical Ontologies from text. Master’s thesis, University of Coimbra, Faculty of Sciences and Technology, Department of Informatics Engineering, Coimbra, Portugal, September. Hernani Costa. 2015. Assessing Comparable Corpora through Distributional Similarity Measures. In EXPERT Scientific and Technological Workshop, pages 23–32, Malaga, Spain, June. EAGLES. 1996. Preliminary Recommendations on Corpus Typology. Technical report, EAGLES Document EAG-TCWG-CTYP/P., May. http://www.ilc.cnr.it/EAGLES96/ corpustyp/corpustyp.html. Zelig Harris. 1970. Distributional Structure. In Papers in Structural and Transformational Linguistics, pages 775–794. D. Reidel Publishing Company, Dordrecht, Holland. Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova. 2002. The Performance Analysis of a Chi-square Similarity Measure for Topic Related Clustering of Noisy Transcripts. In 16th Int. Conf. on Pattern Recognition, volume 4, pages 285–288. IEEE Computer Society. Adam Kilgarriff. 2001. Comparing Corpora. Int. Journal of Corpus Linguistics, 6(1):97–133. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In MT Summit. Paul Rayson, Geoffrey Leech, and Mary Hodges. 1997. Social Differentiation in the Use of English Vocabulary: Some Analyses of the Conversational Component of the British National Corpus. Int. Journal of Corpus Linguistics, 2(1):133–152. Gerard Salton and Christopher Buckley. 1988. Term- Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5):513– 523. Helmut Schmid. 1995. Improvements In Part-of- Speech Tagging With an Application To German. In ACL SIGDAT-Workshop, pages 47–50, Dublin, Ireland. Amit Singhal. 2001. Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35–42.