=Paper= {{Paper |id=Vol-2563/aics_8 |storemode=property |title=A Topic-Based Approach to Multiple Corpus Comparison |pdfUrl=https://ceur-ws.org/Vol-2563/aics_8.pdf |volume=Vol-2563 |authors=Jinghui Lu,Maeve Henchion,Brian Mac Namee |dblpUrl=https://dblp.org/rec/conf/aics/LuHN19 }} ==A Topic-Based Approach to Multiple Corpus Comparison== https://ceur-ws.org/Vol-2563/aics_8.pdf
          A Topic-Based Approach to Multiple Corpus
                         Comparison

                     Jinghui Lu, Maeve Henchion, Brian Mac Namee

             Insight Centre for Data Analytics, University College Dublin, Ireland
           Tegasc the Agriculture and Food Development Authority, Dublin, Ireland
         Jinghui.Lu@ucdconnect.ie, Maeve.Henchion@teagasc.ie,
                                Brian.MacNamee@ucd.ie



        Abstract. Corpus comparison techniques are often used to compare different
        types of online media, for example social media posts and news articles. Most
        corpus comparison algorithms operate at a word-level and results are shown as
        lists of individual discriminating words which makes identifying larger underly-
        ing differences between corpora challenging. Most corpus comparison techniques
        also work on pairs of corpora and do need easily extend to multiple corpora. To
        counter these issues, we introduce Multi-corpus Topic-based Corpus Comparison
        (MTCC) a corpus comparison approach that works at a topic level and that can
        compare multiple corpora at once. Experiments on multiple real-world datasets
        are carried demonstrate the effectiveness of MTCC and compare the usefulness
        of different statistical discrimination metrics - the χ2 and Jensen-Shannon Di-
        vergence metrics are shown to work well. Finally we demonstrate the usefulness
        of reporting corpus comparison results via topics rather than individual words.
        Overall we show that the topic-level MTCC approach can capture the difference
        between multiple corpora, and show the results in a more meaningful and inter-
        pretable way than approaches that operate at a word-level.

        Keywords: Corpus Comparison, Topic Modelling, Jensen-shannon Divergence


1    Introduction

Many corpus comparison techniques are proposed in the literature to reveal the diver-
gence between corpora [8, 16], especially corpora of web-based content such as online
news, social media posts, and blog posts [5, 10]. Although these approaches have been
shown to be effective, almost all of them are limited by comparing corpus at a word
level. Consequently, the results are communicated to a user as a list of unrelated words
that are divergent across two corpora, which makes identifying larger underlying differ-
ences between the corpora challenging. Additionally, these studies focus on comparing
pairs of corpora instead of multiple corpora.
    In recent years, topic modelling [2] has become a widely used method for reveal-
ing thematic information, which is described by a series of high related words called
topic descriptors, in a collection of documents. We assume the combination of topic
modelling techniques and corpus comparison methods has the potential to eliminate the
problem described above that arise with word-based corpus comparison approaches. In




Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
other words, a approach uses topic modelling as a basis for corpus comparison to of-
fer users the divergence that can be explained at a topic-level rather than an individual
word-level.
    This paper describes the Multi-corpus Topic-based Corpus Comparison (MTCC)
approach to corpus comparison, that leverages topic modelling and statistical discrim-
ination metrics to conduct a topic-based comparison. We describe the approach and
demonstrate it on 8 real-world datasets. In a series of experiments we demonstrate that
the topics extracted by our models contain divergence information, as well as comparing
the effectiveness of different statistical discrimination metrics applied in the algorithm.
We also compare the output of MTCC with a word-based corpus comparison method
to show that the results output by MTCC are more meaningful and more interpretable
than those produced by the word-based method. The contributions of this paper area:

    – Multi-corpus topic-based Corpus Comparison, a new corpus comparison technique
      that leverages topic modelling and statistical discrimination metrics.
    – An experiment to show that the topics extracted by the statistical discrimination
      metrics are capturing divergence.
    – An experiment that investigates the effectiveness of different statistical discrimina-
      tion metrics applied in MTCC.
    – A demonstration of the usefulness of topic-based divergence explanations over
      multiple real-world datasets.

     The rest of this paper is organized as follows: Section 2 presents related work; Sec-
tion 3 describes the MTCC approach; Section 4 describes experiments to measure the
divergent information contained in topics found and the effectiveness of different sta-
tistical divergence metrics; Section 5 compares the output of the MTCC approach with
a word-based method; and, finally, Section 6 summarizes the work and suggests future
directions.


2     Related Work

Corpus comparison approaches extract the distinct content from two corpora [16]. Typ-
ically, researchers attempted to compute the contribution to the divergence of individual
words by applying many statistical discriminating metrics over word frequencies calcu-
lated from different corpora, and those words which highly contribute to the difference
are selected to be presented as the divergence [5, 8, 16]. Therefore, it is very intuitive to
show the comparison results using words.
     There is a variety of statistical discrimination metrics in the literature. For instance,
Leech and Fallon [11] use χ2 to measure the discriminative power of a word; log-
likelihood [16] and relevance-frequency (RF) [9] were utilized to extract divergent in-
formation over corpora of different domains. Besides, Information Gain (IG), Gain Ra-
tio (GR) [17], Kullback-Leibler divergence [4], and Jensen-Shannon divergence (JSD)
[5, 13] have been widely employed in corpus comparison.
     Latent Dirichlet Allocation[2], as a widely used strategy for exploiting topic in-
formation from texts, can automatically infer the distribution of membership to set of
topics in a large collection of documents. It has been shown to have a great ability to
find latent topics and cluster documents [1, 6].
     As far as we know, Zhao et al. [19] were the first to use topic modelling, specifically
LDA, in conjunction with statistical discrimination to find topics specific to a corpus in
a pair of corpora being compared, and to use these to explain the differences between
the corpora. Zhao et al. first performed independent topic modelling on the corpora
being compared, and then applied Jensen-Shannon divergence (JSD) [12] over the topic
descriptors to measure the pair-wise similarities between the sets of topics found in
the two corpora. If the similarity of the nearest match to a topic in one corpus to a
topic in the other corpus was below a specific threshold then that topic was said to be
discriminatory. The set of discriminatory topics was then used to explain the differences
between the two corpora. Zhao et al. demonstrated this approach by comparing corpora
from Twitter and the New York Times. Similarly, Murdock et al. [14] and Sievert et al.
[18] used JSD applied to the distributions of word frequencies in topics to measure the
distance between topics, but this was not done in a corpus comparison scenario.
     Zhao et al’s method trains independent LDA model for each corpus which increases
the instability of the whole system due to the non-deterministic nature of LDA. Besides,
the output highly relies on the setting of thresholds, namely, the small thresholds tend
to result in too many similar topics as divergence, but the large thresholds will lose
some discriminating topics. The MTCC approach proposed in this paper differs from
the work of Zhao et al in two key ways. First, in MTCC a global topic model is trained
across a combined corpus that contains all documents from all corpora being compared.
Second, because a single topic model is built, intuitively, JSD can be applied directly
to topic membership vectors rather than applying JSD to word distributions between
topics. As compared to matching similar topics, MTCC skips the empirical setting of
thresholds to further automate the comparison process and, since one global topic model
is trained, the topic proportion of each corpus can be inferred on which multiple corpus
comparison can be based.


                                                                                                   topic-term matrix: b
                                                          t1 t2 t3 … tk                     w1 w2 w3 … … … … …                  wm
         corpus   corpus           corpus
                                                                                                                                     t1
           P1       P2       ...     Pn
                                                                                                                                     t2
                                                                                                                                     t3
                  combined
                                                                                                                                      …




                   corpus                                     q                                                                       tk
                           train
                                              model                                                                  argmax(div(t1) , …, div(tk) )
                                                                          1. apply metric: JSD, χ2 etc.
                                             selection
                                                                          2. aggregation                  topic   score
                           ... …
                                                                                                           t1     div(t1)
                                                                                                           t2     div(t2)
       LDA 1      LDA 2              LDA j                                                                 …        …
                                                         document-topic matrix:q                           tk     div(tk)




Fig. 1. An overview of MTCC approach. θ is a document-topic matrix where each row denotes
a topical representation of a document. The vectors shaded by different colors denote the topical
representation of documents from different corpora. β is a topic-term matrix where each row
represents the word distribution of the corresponding topic. The rows shaded in grey in β imply
the discriminating topics selected to be presented.
3     The Multi-corpus Topic-based Corpus Comparison (MTCC)
      Approach
In this section, we will first provide a brief description of the use of LDA for topic
modelling, then describe the Multi-corpus Topic-based Corpus Comparison (MTCC)
approach in detail.

3.1   Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) [2] is proposed to infer topic distribution in a col-
lection of documents. The model generates a document-topic matrix θ, and a topic-
term matrix β (see Figure 1). Specifically, each row of the document-topic matrix is a
topic-based representation of a document where the ith entry determines the degree of
association between the ith topic and the document. Each row of the topic-term matrix
represents the word distribution of the corresponding topic. Usually, several most com-
mon words in one topic will be chosen to be presented as the topic descriptors. We use
the LDA implemented in Python Gensim package 1 .
     Properly setting the number of latent topics to be found plays a vital role in the
performance of LDA as well as other topic modelling algorithms. There are many ap-
proaches for seeking the appropriate choice of the number of latent topics, k, in the
literature. O’Callaghan et al [15] proposed a topic coherence metric via word embed-
dings reflecting the semantic relatedness of topic descriptors, which has been previously
used in [1] to determine the number of topics to be found. We also adopt this method, in
the experiments, we use word embeddings built across our own test corpora using the
FastText algorithm [3].2
3.2 Infer Topic Proportion for Corpus
Figure 1 shows the overview of applying the MTCC approach to comparing multiple
corpora. All corpora are first pre-processed, including tokenisation, conversion to lower
case, and lemmatisation using Python NLTK package 3 .
    At the outset, all corpora are combined into a single corpus. A set of topic models
are trained over this combined corpus with different values of k which is the number of
latent topics to be found. The topic coherence score for each topic model is calculated
following [15]. The topic model which have the highest averaged topic coherence score
is selected as the input for the next stage. The chosen topic model produces two matri-
ces, θ ∈ Rd×k
            +    and β ∈ Rk×m
                            +    , where m is the number of unique terms in vocabulary,
k is the number of topics and d is the number of documents in the combined corpus.
    Assuming the first p documents in θ are from the same text set Pi which is one of
the original corpus composing the combined corpus. The membership of the tth topic
to the individual corpus Pi , memt (Pi ), can be given by:
                                               Pp
                                                     θ[i, t]
                                memt (Pi ) = i=1                                      (1)
                                                    p
 1
   https://radimrehurek.com/gensim/models/ldamodel.html
 2
   https://radimrehurek.com/gensim/models/fasttext.html
 3
   https://www.nltk.org/api/nltk.html
where θ[i, t] denotes the the proportion of tth topic in the ith documents in the combined
corpus and p is the number of documents from corpus Pi .


3.3     Strategies for Comparing Multiple Corpora

Since most of the corpus comparison approaches listed in Section 2 is focused on com-
paring pairs of corpora, we adopt a one-versus-all strategy in order to conduct a multiple
corpus comparison.
    The algorithm is quite simple, a specific corpus Pi and a mixture corpus which
is a concatenation of all corpora except corpus Pi are considered two corpora being
compared. Hence, the proportion of any topic for corpus Pi and the mixture corpus can
be given by Equation 1, on which the statistical metrics are applied. Then we can derive
the divergence score of tth topic in terms of corpus Pi following [5, 9, 11, 17], which is
denoted by dt (Pi ) in this paper.
    However, in order to rank the discrimination power of each topic, t, across the full
set of corpora to be compared, a single divergence score per topic is required necessi-
tating an aggregation strategy. We define three aggregation strategy which is given as
follows:
                                       Pn
 – sum: divt (P1 || P2 || ... || Pn ) = i=1 dt (Pi )
                                                Pn
 – weighted sum: divt (P1 || P2 || ... || Pn ) = i=1 π(Pi )dt (Pi )
 – maximum: divt (P1 || P2 || ... || Pn ) = maxni=1 dt (Pi )

where n is the number of corpora for comparison and π(Pi ) is the proportion of corpus
Pi in the combined corpus, divt (P1 || P2 || ... || Pn ) denotes the global divergence score
of topic t.
    There is a modification to the Jensen-Shannon divergence metric, extended JSD [13]
that can be used directly across multiple corpora at a word-level without the need for
the one-versus-all approach. In extended JSD divergence score for the tth word over
multiple corpora P1 , P2 , ..., Pn is defined as:
                                                                    n
                                                                 1X
               divJS,t (P1 || P2 || ... || Pn ) = −mt log mt +         pit log pit      (2)
                                                                 n i=1

where pit is the probability of seeing word t in corpus Pi , and mt is the probability
   seeing word t in M . Here, M is a mixed distribution of n corpora where M =
of P
 1   n
n    i=1 Pi . In extended JSD, the global divergence score of the tth topic could be
derived from Equation 2 by replacing pit with memt (Pi ) calculated by Equation 1.
    Subsequently, the topics can be ranked by their global divergence score in descend-
ing order and the top n topics along with their topic descriptors can be selected to
represent the difference between multiple corpora.
    A github repository containing the code to implement the MTCC approach and all
experiments described in the following section is publicly available.4
 4
     https://github.com/GeorgeLuImmortal/topic-based corpus comparison
4     Comparing Topic-based Discrimination Metrics

In this set of experiments we compare the ability of different statistical discrimination
metrics to identify discriminative topics across corpora. Corpus comparison is an unsu-
pervised procedure which makes this evaluation somewhat challenging. To overcome
these challenges, following [17], we reframe the corpus comparison evaluation as a
document classification task. This section describes the experimental approach and the
datasets used, and discusses the results of these experiments.


4.1   Datasets

Our experiments are carried out on 8 real-world datasets described in [6]: 6 news article
datasets bbc, bbc-sport, guardian-2013, irishtimes-2013, nytimes-1999, nytimes-2003
and 2 Wikipedia datasets wikipedia-high and wikipedia-low. Each dataset has different
sections, for example, bbc includes sections: business, politics, entertainment, sports,
tech. Table 1 shows the total number of documents, the size of vocabulary, the total
number of terms, and the number of sections and the value of k in each dataset. We
divide each dataset into corpora following these sections.
Table 1. Summary statistics of the datasets used in the experiments. The rightmost column is the
number of latent topics to be found for each dataset.

          Dataset           No. Docs   Vocab. Size   No. Terms No. Sections Best k
          bbc                  2,225        3,125      484,600              5    150
          bbc-sport              737          969      138,699              5    150
          guardian-2013        5,414       10,801    2,349,911              6    300
          irishtimes-2013      3,093        4,832      916,592              6    150
          nytimes-1999         9,551       12,987    3,595,075              4    150
          nytimes-2003         5,283       15,001    4,426,313              5    100
          wikipedia-high       5,738       17,311    6,638,780              6    350
          wikipedia-low        4,986       15,441    5,934,766             10    150


4.2   Measuring the Discriminativeness of Topics Found

Following the approach used in [17] we base our evaluation on the assumption that dis-
criminative topics should be useful features for classifying documents as belonging to
different corpora. Thus, we reframe the evaluation of corpus comparison to a document
classification task.
    The procedure for estimating discriminativeness of a set of topics found by MTCC
is as follows:

 1. extract the top n most informative topics (MTCC is run and the n most discrimina-
    tive topics are extracted)
 2. construct vector representations of documents based on the selected n topics.
 3. build a classification model using the vector representation and measure its perfor-
    mance.
      (a) search for a set of optimal hyper-parameters for the classifier using 10-fold
          cross-validation.
      (b) run another 10 cross-validation shuffled by a different random seed using the
          obtained optimal hyper-parameters and report the result of the second 10 cross-
          validation based on micro-averaged f1 score [7].
The performance of the second 10 cross-validation is used to assess the discriminative-
ness of the n topics.
                                                  k
    We repeat this procedure for values of n ∈ [ 10 , k], where k is the number of topics
                                        k
used within MTCC and n increases by 10 every time. In other words, given a statistical
discrimination metric and a dataset, we will run the above procedure 10 times. Also, we
should note here, we rank the topics according to their divergence score from MTCC
model in descending order which means the first n topics are the most distinctive topics
whereas the remaining topics are not so informative. We also use a baseline, in which
n topics are selected randomly.

4.3     Experimental Configuration
We compare the performance of MTCC models utilizing the different statistical dis-
crimination metrics in Section 2, i.e. IG, GR, χ2 , RF, JSD and extended JSD (ext-
JSD) across 8 real-world datasets. For each dataset, we treat a section as an individ-
ual corpus—for instance, the bbc dataset has 5 corpora. Therefore, this is a multiple
corpora comparison task. After extensive preliminary experiments, we choose to use
Linear-SVM [7] classifier when measuring performance.5 Also, the maximum aggre-
gation strategy (see Section 3.3) has been adopted as it was shown to perform better
than the other strategies in preliminary experiments. For RF, we set the threshold to
0.01. The baseline method classifiers are trained with the document representations in
terms of n topics chosen arbitrarily. To reduce the effect of randomness, for a given n,
we run baseline methods 10 times with different random seeds and report the averaged
micro-averaged f1 score.
    We set the random seed for the LDA model to 1984; α = “auto” which indi-
cates the automatically tuning the hyperparameter α;6 and the random seed for twice
cross-validation to 2018 and 0 respectively to make sure the results are consistent and
independent of random initial states. The best choice for k found for each dataset using
the approach described in Section 3.1 is also reported in Table 1.

4.4     Results
We present the document classification results of MTCC models with respect to differ-
ent statistical discrimination metrics in different datasets in Figure 2. The x-axis denotes
n, the percentage of total topics used for training the classifier. The y-axis represents
the performance which is the micro-averaged f1 score in this case.
    We can observe that, almost in all situations, JSD, extended JSD or χ2 achieves
high micro-averaged f1 scores that outperform the baseline (random selection) by a
 5
     http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
 6
     https://rare-technologies.com/python-lda-in-gensim-christmas-edition/
Fig. 2. Micro-f1 score of different statistical discrimination metrics for 8 datasets, the horizontal
axis denotes the percentage of total topics used for training the classifier.
       EXT_JSD        JSD        RF        BASELINE          EXT_JSD        JSD        RF        BASELINE           EXT_JSD        JSD        RF        BASELINE           EXT_JSD        JSD        RF        BASELINE
       CHI            GR         IG                          CHI            GR         IG                           CHI            GR         IG                           CHI            GR         IG
1.0                                                   1.0                                                   1.0                                                    1.0

0.8                                                   0.8                                                   0.8                                                    0.8

0.6                                                   0.6                                                   0.6                                                    0.6

0.4                                                   0.4                                                   0.4                                                    0.4

0.2                                                   0.2                                                   0.2                                                    0.2

0.0    20        40         60        80        100 0.0      20        40         60        80        100 0.0       20        40         60        80        100 0.0       20        40         60        80        100

             (a) bbc                                         (b) bbc-sport                                        (c) guardian-2013                                      (d) irishtimes-2013

       EXT_JSD        JSD        RF        BASELINE          EXT_JSD        JSD        RF        BASELINE           EXT_JSD        JSD        RF        BASELINE           EXT_JSD        JSD        RF        BASELINE
       CHI            GR         IG                          CHI            GR         IG                           CHI            GR         IG                           CHI            GR         IG
1.0                                                   1.0                                                   1.0                                                    1.0

0.8                                                   0.8                                                   0.8                                                    0.8

0.6                                                   0.6                                                   0.6                                                    0.6

0.4                                                   0.4                                                   0.4                                                    0.4

0.2                                                   0.2                                                   0.2                                                    0.2

0.0    20        40         60        80        100 0.0      20        40         60        80        100 0.0       20        40         60        80        100 0.0       20        40         60        80        100

      (e) nytimes-1999                                      (f) nytimes-2003                                      (g) wikipedia-high                                     (h) wikipedia-low



large margin. This demonstrates that the discrimination metrics can identify the topics
that contain divergence information. We can also see from Figure 2 that the best results
are achieved by JSD, extended JSD or χ2 across 8 datasets with very low values for
n which denotes the percentage of the total topics used for training (usually <= 30).
Moreover at these low numbers of topics the performance from the other metrics and
from the random baseline is very low. For example, in the guardian-2013 (Figure 2 (c))
and wikipedia-high (Figure 2 (g)) datasets, classifiers nearly reach the best performance
(above 0.8) using only the top 10% of topics, meanwhile the performance of baseline is
only a little higher than 0.4. This implies that the top 10% topics selected by JSD, ex-
tended JSD and χ2 carry almost all of the divergence information in these two datasets.
Similarly, as we can see in bbcsport (Figure 2 (b)), the classifiers based on topics se-
lected using JSD, extended JSD and χ2 achieve the best results at n = 20 and these
results even surpass the result where all topics are used (n = 100). This indicates that
the top 20% of topics are very good at distinguishing between corpora while the re-
maining topics not only do not contain divergence information but even produce noise
for classification.

    It is also interesting to note that JSD, extended JSD and χ2 result in almost the same
performance across all datasets. These three metrics are reasonably similar to each other
so this is not too surprising. It is interesting to note, however, that although all three will
usually select the same topics at the very highest ranks, they do select different topics
further down the ordering.

    To conclude, JSD, extended JSD and χ2 can effectively select the discriminative
topics and outperform the random baseline and other metrics by a large margin.
5     Demonstrating Topic-based Corpus Comparison

Since we are interested in whether the results of corpus comparison at topic-level are
more meaningful and interpretable, in this section, we compare the output of a corpus
comparison based on the MTCC topic-level approach to a word-level approach over 4
real-world datasets. We describe the datasets used and an analysis of the results pro-
duced by each approach.


5.1   Setup

In this demonstration we perform a corpus comparison that compares the bbc dataset,
guardian-2013 dataset, irishtimes-2013 and nytimes-2003 dataset described in [1]. Each
dataset is treated as one corpus which is different from the settings in Section 4 and ex-
tended JSD is selected as the discrimination metric since its effectiveness shown in
Section 4. We therefore conduct a comparison over 4 corpora at one time. These four
datasets are all news articles dataset. Also there is both overlap and difference in the sec-
tions present in those four datasets (summarised in Table 2). This makes these four cor-
pora an interesting test case for corpus comparison approaches as there are differences
that we can expect a corpus comparison approach to discover—for example sections in
one corpus that are not in the others (e.g. music section in guardian-2013).
    Extended JSD, which is commonly used in extracting keywords from different cor-
pora and has proven its effectiveness in [5, 13], is adopted as the word-level corpus
comparison method. The contributions to divergence of individual words across all cor-
pora are computed by Equation 2.
    We set the random seed for the LDA model to 1984; α = “auto” following the
previous experiments; the best choice k are tested in preliminary experiments varies
from 100 to 300 in steps of 10. After pre-experiments, 300 is the optimal number for k
according to the topic coherence measure described in Section 3.1.


Table 2. The number of documents in each section of the four corpora. Potentially comparable
sections are aligned and the potentially distinct sections are highlighted. Entertainment is abbre-
viated to entmt.

            Sections bbc   Sections guardian   Sections    irishtimes   Sections    nytimes
                           books       1,107
            business 510   business    1,292   economy           364    business      1,024
                                               crime-law         699
                                                                        education      942
                           fashion      816
                                               health            273    health        1,046
                                                                        movies        1,247
                           music      1,403
            entmt 386
            politics 417   politics     844    politics          620
                           football   1,059    soccer            655
            sport   511                                                 sports        1,024
            tech    401
                                               rugby             482
5.2   Results

Table 3 shows the top 3 discriminative topics for each dataset found by MTCC (using
the extended JSD metric as the high performance reported in Section 4). The number
in front of the text denotes the index of the topic, and is followed by the guess of the
meaning of the topics as well as 10 topic descriptors.


Table 3. Three most discriminative topics extracted from the each corpus. The number in front
of the text denotes the index of a topic, followed by the 10 most common words to describe the
topic. Guesses of the implication of each topic are shown in square brackets.

                          bbc                                                 guardian
(289) [tech] people new technology also mobile         (42) [music] music song band street night
      would could make say many                             album sound love art young

(114) [sport] cup england world think like             (25) [fashion] fashion woman designer wear dress
      dont want know play football                          collection style clothes brand men

(31) [entmt] award best oscar nomination prize         (236) [books] book novel story read author
      actor actress category academy ceremony                writer writing reader fiction reading


                      irishtimes                                              nytimes
(109) [crime-law] court garda judge case justice       (299) [health] health study plan bill government
      man criminal charge law prison                         issue change official problem system

(146) [politics] european minister government exit euro (13) [education] school student education parent high
      bailout decision country finance programme             district teacher child city program

(275) [rugby] rugby australia zealand black leinster   (102) [movies] film movie director directed minute
      coach schmidt test saturday odriscoll                  character life picture story actor



    The topics extracted using MTCC can be compared to the most discriminative words
for each dataset found by the word-based JSD approach which are shown in Table 4.
    The first thing that is apparent from comparing Tables 3 and 4 is that the topic-
level approach has the advantage of presenting the user with coherent sets of terms in
the topic descriptors rather than a long list of disconnected words. This makes it much
easier for the user to understand the differences between different corpora.
    Looking more deeply at Table 3 we can see that the MTCC approach has success-
fully identified the divergence information of each dataset, which is evidenced by the
presence of distinct topics. For example, the topic descriptors of topic 109 suggest that
this is a topic relating to articles describing court proceedings. We can see from this that
MTCC has identified the fact that the irishtimes-2013 corpus has a crime-law section
not present in other corpora. Similarly, the presence of other discriminative topics (e.g.
topic 289, topic 31, topic 42 etc.) has demonstrated the effectiveness of the algorithm
in detecting the difference between multiple corpora.
    When we look at Table 4, we can find that though word-level corpus comparison
properly define the divergence information in some datasets such as guardian-2013 and
Table 4. 30 most distinct words from a comparison for each corpus using word-level extended
JSD.

                  Rank bbc      Rank guardian     Rank irishtimes   Rank nytimes
                  1    would    1    game         1    cent         1    school
                  2    also     2    team         2    garda        2    film
                  3    people   3    book         3    per          3    student
                  4    new      4    fashion      4    think        4    center
                  5    could    5    business     5    player       5    movie

                  6    say      6    little       6    point        6    program
                  7    make     7    novel        7    court        7    theater
                  8    world    8    long         8    end          8    yesterday
                  9    get      9    side         9    million      9    life
                  10   take     10   today        10   minister     10   college

                  11   made     11   album        11   even         11   medicare
                  12   month    12   song         12   home         12   yankee
                  13   way      13   cameron      13   know         13   tonight
                  14   like     14   story        14   hse          14   university
                  15   back     15   thatcher     15   thing        15   manhattan

                  16   week     16   club         16   another      16   patient
                  17   next     17   bst          17   dont         17   never
                  18   well     18   miliband     18   state        18   including
                  19   three    19   premier      19   play         19   might
                  20   many     20   osborne      20   win          20   drug

                  21   good     21   band         21   hospital     21   street
                  22   day      22   love         22   health       22   tomorrow
                  23   may      23   bank         23   need         23   without
                  24   right    24   guardian     24   leinster     24   district
                  25   come     25   york         25   four         25   medical

                  26   work     26   manchester   26   seanad       26   teacher
                  27   want     27   collection   27   taoiseach    27   ticket
                  28   going    28   dress        28   rugby        28   doctor
                  29   still    29   writer       29   put          29   night
                  30   since    30   form         30   kenny        30   study


nytimes-2013. This approach fails to depict the distinctive content in bbc corpus where
a set of unrelated words are presented.
6   Conclusions
In this paper, we introduce the Multi-Corpus Topic-based Corpus Comparison (TMCC)
approach to discover distinctive topics across multiple corpora for corpus comparison
tasks. We compared the performance of different discrimination metrics on 8 real-world
datasets. The results showing that using JSD, extended JSD, or χ2 can extract topics
that contain the most divergence information. We also demonstrated TMCC and com-
pared its output to a word-level approach. Overall we believe that the example presented
demonstrates the advantages of using topics for corpus comparison rather than using
word-level approaches.
    However, because we apply topic modelling on multiple corpora, we are likely to
take the risk of losing or blurring topics compared to applying topic modelling over a
single corpus (as done by Zhao et al. [19]). Hence, a further exploration that investigates
whether or not we are losing or blurring topics together in the multi-corpus topic-based
corpus comparison is scheduled in future work.
Acknowledgement. This research was kindly supported by a Teagasc Walsh Fellow-
ship award (2016053) and Science Foundation Ireland (12/RC/2289 P2).

References
 1. Belford, M., Mac Namee, B., Greene, D.: Stability of topic modeling via matrix factorization.
    Expert Systems with Applications 91, 159–169 (2018)
 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning
    research 3(Jan), 993–1022 (2003)
 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword
    information. arXiv preprint arXiv:1607.04606 (2016)
 4. Degaetano-Ortlieb, S., Kermes, H., Khamis, A., Teich, E.: An information-theoretic ap-
    proach to modeling diachronic change in scientific english. Selected papers from Varieng–
    From data to evidence (d2e) (2016)
 5. Gallagher, R.J., Reagan, A.J., Danforth, C.M., Dodds, P.S.: Divergent discourse be-
    tween protests and counter-protests:# blacklivesmatter and# alllivesmatter. PloS one 13(4),
    e0195644 (2018)
 6. Greene, D., O’Callaghan, D., Cunningham, P.: How many topics? stability analysis for topic
    models. In: Joint European Conference on Machine Learning and Knowledge Discovery in
    Databases. pp. 498–513. Springer (2014)
 7. Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of machine learning for predictive
    data analytics: algorithms, worked examples, and case studies. MIT Press (2015)
 8. Kilgarriff, A.: Comparing word frequencies across corpora: Why chi-square doesn’t work,
    and an improved lob-brown comparison. In: ALLC-ACH Conference (1996)
 9. Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categoriza-
    tion. In: AAAI. vol. 6, pp. 763–768 (2006)
10. Lawrence, E., Sides, J., Farrell, H.: Self-segregation or deliberation? blog readership, partic-
    ipation, and polarization in american politics. Perspectives on Politics 8(1), 141–157 (2010)
11. Leech, G., Fallon, R.: Computer corpora–what do they tell us about culture. ICAME journal
    16 (1992)
12. Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Informa-
    tion theory 37(1), 145–151 (1991)
13. Lu, J., Henchion, M., MacNamee, B.: Extending jensen shannon divergence to compare
    multiple corpora. In: 25th Irish Conference on Artificial Intelligence and Cognitive Science,
    Dublin, Ireland, 7-8 December 2017. CEUR-WS. org (2017)
14. Murdock, J., Allen, C.: Visualization techniques for topic model checking. In: Twenty-Ninth
    AAAI Conference on Artificial Intelligence (2015)
15. O’ Callaghan, D., Greene, D., Carthy, J., Cunningham, P.: An analysis of the coherence of
    descriptors in topic modeling. Expert Systems with Applications 42(13), 5645–5657 (2015)
16. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of
    the workshop on Comparing corpora-Volume 9. pp. 1–6. Association for Computational Lin-
    guistics (2000)
17. Sajgalik, M., Barla, M., Bielikova, M.: Searching for discriminative words in multidimen-
    sional continuous feature space. Computer Speech & Language 53, 276–301 (2019)
18. Sievert, C., Shirley, K.: Ldavis: A method for visualizing and interpreting topics. In: Pro-
    ceedings of the workshop on interactive language learning, visualization, and interfaces. pp.
    63–70 (2014)
19. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and
    traditional media using topic models. In: European conference on information retrieval. pp.
    338–349. Springer (2011)