<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering Coherent Topics from Urdu Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mubashar Mustafa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Feng Zeng</string-name>
          <email>fengzeng@csu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hussain Ghulam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenjia Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Central South University</institution>
          ,
          <addr-line>Changsha</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New York Institute of Technology</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Topic modeling (TM), detection of theme or aspect from documents is an important text processing method in natural language processing (NLP) for helping users to get insights from a large number of documents. In recent years, many unsupervised models have been used in TM, and these models often produce aspects that are not interpretable. To gure out this issue, few semi-supervised methods have been developed that allow users to input some prior domain knowledge to produce coherent aspects. Most of them are well adapted to the English corpus, but there is very little work in Urdu. TM becomes a challenge for Urdu language having their own morphological structure, semantics, and syntax. In this paper, we rst propose an e ective semi-supervised topic model "Seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA)" for Urdu language. The model is proposed to produce coherent topics dealing with the morphological structure of Urdu language. The proposed Urdu topic model Seeded-ULDA combines preprocessing, seeded-LDA, and Gibbs sampling. Second, we introduce word2vec word embedding in Urdu and discover topics through clustering of semantic space. This work aims to evaluate and compare various topic modeling frameworks in the Urdu news dataset. After comprehensive experiments and evaluation, the results show that word embedding is unable to extract coherent topics in Urdu language. The proposed seeded-ULDA model is more than 39% e cient as compared to existing ULDA model based on coherence measure.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic Modeling</kwd>
        <kwd>Coherent topics</kwd>
        <kwd>Word embedding</kwd>
        <kwd>Seeded-LDA</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this era, the explosive growth of electronic le archives has attracted a lot of attention. The
report predicts that data storage capacity will increase to 40 trillion gigabytes by 2022, 50 times
more than in early 2010 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The most important concern now is to determine e ective tools
or methods that automatically organize, index, search, and browse this unstructured electronic
text data. TM is one of the most widely used technology to organize these types of data. TM
is a well-known advanced Machine learning technology. Using this technology, we can discover
patterns that usually re ect basic themes. Given D is a set of documents composed of a set
of terms W and T is a set of latent topics, TM will nd T based on statistical inference on
the term W. Thus, the document is a mixture of topics where topics represent a statistical
distribution of words. A graphical representation of topic modeling is shown in Figure 1.
      </p>
      <p>
        In machine learning and NLP, the topic model is a statistical model used to discover the
theme "topic" that occurs in the documents collection. TM is a widely used text mining tool
for nding hidden semantic structures in text documents. A popular algorithm for modeling
the text data is LDA, and di erent extensions have been proposed So far: Online variational
inference for LDA in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Correlated topics Model (CorrLDA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Hierarchical Topic Model
(hLDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], etc. Recently, Word2Vec [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] words Embedding has been used for theme extraction
and achieved hopeful results [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the most commonly used topic model is LDA,
which provides a powerful framework to extract hidden topics from text documents. But, the
researchers found that extracted topics of unsupervised models are unexplainable or meaningless
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This is not a problem with LDA only: it is potentially a problem with any extension.
Several knowledge-based models have been proposed to address this problem, such as
seedLDA, in which seed words are used as input to guide the model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This model can produce
coherent topics of particular interest to users.
      </p>
      <p>Most topic models are designed, developed, and implemented for English text corpus.
Therefore, these techniques are very e ective for the English corpus. But Urdu has distinct nature
from famous languages (such as Chinese, English, Arabic, etc.), and has distinct grammatical
forms, Synonyms, antonyms of various words, morphological structure, Semantics, and syntax.
Therefore, TM becomes a challenging task in Urdu and the limited contribution is committed to
Urdu in NLP. There are many research communities for English and most software, application
programming interface (API), and tools are specially developed for English which do not work
e ectively for Urdu language. To use these tools and software in Urdu language, further work
will be required for better performance.</p>
      <p>
        According to the literature review, there is little work on topic modeling in Urdu Language
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ][
        <xref ref-type="bibr" rid="ref16">16</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, there is no work to extract coherent topics in Urdu language and it is
rst work for the extraction of coherent topics from Urdu documents. In this paper, rst, we
apply our proposed semi-supervised topic model Seeded-ULDA. Second, we introduce word
embedding in Urdu language to discover topics by clustering of semantic space; generated
through word2vec word embedding. After intensive examination, the results show that word
embedding is unable to extract coherent topics in Urdu language and the semi-supervised model
Seeded-ULDA outperforms ULDA based on coherence measure.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>In this section, we present the techniques used in this study for topic modeling. We will
rst focus on using the Word2Vec word embedding model and then discuss the Seeded-ULDA
approach. All these techniques are implemented and made the performance comparison in our
experiment.
2.1</p>
      <sec id="sec-2-1">
        <title>Seeded-Urdu Latent Dirichlet Allocation (Seeded-ULDA)</title>
        <p>
          We start with a short explanation of the e ectiveness of the Seeded-ULDA. It is considered
a challenging task to develop an e cient semi-supervised Urdu topic model "Seeded-ULDA"
that combines preprocessing, seeded-LDA and Gibbs sampling [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Figure 2 gives a
complete overview of the proposed model Seeded-ULDA. We introduce the technologies involved in
Seeded-ULDA in the following subsections.
        </p>
        <sec id="sec-2-1-1">
          <title>Text PreProcessing</title>
          <p>Text preprocessing aims to standardize the representation of texts to improve the accuracy of
the topic detecting models. In this step, encoding UTF-8, diacritics removal, tokenization, and
stop words removal will be performed to standardize the input dataset.</p>
          <p>
            Enconding UTF-8 Computer programs face the problem of character recognition in Urdu
text. We are using Unicode Transformation Format 8 (UTF-8) encoding for Urdu character
recognition. Unicode is one of the most widely used encoding in the computer industry. UTF-8
means that every character is mapped by unique variable-width numeric code.
Removal of Diacritics A diacritic is a sign which is added with a letter to change the
pronunciation. Urdu diacritics are a subset of Arabic diacritics. The most widely used diacritics
in Urdu are Zabar, Pesh, and Zer which are called Aerab [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], and other diacritics are seldom
used. When it is attached to a word, the sound and meaning of the word change [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. Urdu
is usually written with words only and diacritics are left to personal choice. But we discard
diacritics to form the dataset standardise. Finger 3 shows some Urdu diacritics examples.
Tokenization Tokenization plays an important role in text analysis tasks. Tokenization is
the process to split text documents into tokens, and the token is an individual instance of a
sequence of characters in natural language. Tokenization is a component of the methodology,
and then we use it to teach machines to understand words. Many tokenization techniques have
been proposed, but in this work, we use count-vectorizer to split the text into tokens.
Stopwords Removal In NLP, stopwords removal is the part of pre-processing to remove
unworthy data and unworthy words (data) are regarded as stopwords. They do not make the
addition of meaning or information and are found frequently in a sentence. We can safely
ignore them without losing the information of the sentence. In order to get meaningful data,
we exclude stopwords from our corpus. Few mostly used stopwords of Urdu are shown in gure
4.
This approach allows a user to guide the topic discovery process by letting him provide seed
words that are representative of the corpus [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. This model can use the seed words in two
ways: to improve both topic-word and document-topic probability distributions. To improve
topic-word distributions, the model is set up in which each topic prefers to generate terms that
are same to the terms in a seed set. To improve document-topic distributions, the model is
encouraged to select document-level topics based on the existence of input seed words in that
document. Our work aim is to produce coherent topic. So, we use the rst way to improve
topic-word probability distribution. In traditional topic models, Multinomial distribution k
expresses each topic k over words. This notion is extended and the topic is de ned as an
intermixture of two di erent distributions: a regular topic distribution and a seed topic distribution.
In seed topic distribution, the words are generated from the given seed set. In regular topic
distribution, any words can be generated including seed words. It is emphasized that, like
ordinary topics, all words of seed topics have non-uniform probability distribution. The model
takes a set of seed words as input, and then outputs the probability distribution of these words.
          </p>
          <p>For simplicity, the model is explained by considering one-to-one compatibility between
regular and seed topics. when regular topics are more, then this consideration can be simply
relaxed by making copies corresponding to the seed topics. The gure 5 shows that documents
are a mixture of topics T where these topics are a mixture of seed topics s and regular topics
r. The probability of picking a term from the regular topic distribution versus the seed topic
distribution is controlled by parameter k. The graphical notation is shown in Figure 6 and
the generative process of seeded-LDA is as follow:</p>
          <p>For each topic k = 1:::T
{ Draw regular topic rk</p>
          <p>Dir( r).
{ Draw seed topic sk</p>
          <p>Dir( s).
{ Draw
k</p>
          <p>(1; 1).</p>
          <p>For each document d, Choose d</p>
          <p>Dir( )
{ Select a topic zi</p>
          <p>Mult( d).
{ Select a indicator xi</p>
          <p>Burn( zi).
{ if xi = 0, Select a word wi
{ if xi = 1, Select a word wi</p>
          <p>Mult( rzi). // choose from regular topic.</p>
          <p>
            Mult( szi). // choose from seed topic.
When direct sampling is hard, then Gibbs sampling is used to get a sequence of observations
by Markov Chain Monte Carlo (MCMC) which is a methodology to sample from statistical
distribution [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ][
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. It is an algorithm that uses the logic of randomness means it sample
randomly, and is widely employed as a statistical inference. In this paper, we apply Gibbs
sampling with the probabilistic generative model to discover coherent topics.
2.2
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Word2vec</title>
        <p>
          The Word2vec model is a two-layer neural network that can be trained to recreate the language
context of words [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It captures the context, semantic and syntactic similarity of words in
a document. Word2Vec is one of the most widely used method of learning word embedding
using shallow neural networks. It takes a large number of text corpora as input, and generates
a vector space that can have hundreds of dimensions, and assigns a corresponding vector in the
space to each unique word in the lexicon. This technique is di erent from other topic models,
which use documents as context. Word2Vec learns the distributed representation of each target
word by identifying the context as surrounding terms.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental studies</title>
      <p>We presented some experiments to demonstrate the e ectiveness of the above de ned topic
modeling techniques. These experiments were performed using two corpuses discussed in the
following section.
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>Urdu language is not owning any benchmark dataset for NLP tasks. Therefore, we create
our own corpus, which contains text articles in Urdu of a widely known news website
https://www.express.pk. It is now publicly available at
https://github.com/Mubashar331/Urducorpus for research purposes. The collected dataset has ve categories of documents named
Health, Sports, Science, Entertainment, and Business. We also collected a dataset of English
having four categories. After the completion of preprocessing steps that are discussed in the
above subsections, we applied above de ned topic modeling techniques on these corpora and
evaluated performance. Table 1 is brie y describing the corpus.</p>
        <p>Sr.</p>
        <sec id="sec-3-1-1">
          <title>Corpora</title>
          <p>1
2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Corpus 1</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Corpus 2</title>
          <p>We presented three experiments to evaluate the performance of topic modeling techniques.
These experiments are performed on dataset or corpus discussed in above subsection.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Experiment 1: Topic Modeling by word2vec</title>
          <p>The purpose of this experiment is to evaluate the accuracy of word2vec on Urdu text documents.
Word2vec has two types of architecture, Skip Gram Model and Continuous Bag of Words
(CBOW), to gain vectors of features. In this study, we use the CBOW model to build vectors
of a given pre-processed dataset. Then, we cluster the gained vectors of features using K-means
method. The process of topic modeling by word2vec is shown in gure 7.
Seeded-ULDA is a semi-supervised technique that integrates previous domain knowledge into
topic models to contribute in producing consistent topics. In order to extract coherent topics,
it allows a user to guide the model by giving a set of seed words as input that are representative
of the given corpus. In this experiment, we use Urdu dataset which contains 5 classes. So, we
incorporate ve seeded topics manually into topic models to contribute in generating consistent
topics. The rst ten words of the seeded topic of each class are shown in gure 8.</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Experiment 3: Topic Modeling by ULDA</title>
          <p>
            In this experiment, we compare the accuracy of our proposed model with the ULDA model
proposed in an article [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. This model was proposed to discover topics from a corpus of Urdu
news articles. We employ the ULDA model on our own corpus. We ran this model several
times to evaluate results on topics discovered from Urdu text documents.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Result Discussion</title>
      <p>
        In the NLP research community, evaluation of discovered topics from the topic model is
considered an open challenge [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Some researchers use internal evaluation methods such as perplexity
or likelihood for evaluating topic models. But these evaluation methods cannot measure the
consistency of discovered topics. Through large-scale user studies, an author [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] argued that
the topic model which performed well on perplexity or likelihood failed to produce coherent
topics. Therefore, we do not use these evaluation methods to evaluate above de ned topics
models.
      </p>
      <p>
        We evaluate topic models by using a manual evaluation technique named Coherence Measure
(CM), it is the ratio of relevant words to total candidate words of a topic [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For this manual
evaluation technique, we take 20 words with the highest values of every topic and request 5
students to examine and label them. First, they need to examine the words to label a topic as
interpretable or irrelevant. When the topic is interpretable, then they need to inquire about
the words of a topic that are relevant. CM is calculated using equation 1 where x is relevant
words and n is total candidate words.
      </p>
      <p>We presented the experiments to demonstrate the e ectiveness of topic modeling techniques
based on CM. First, we nd topics by word2vec and mostly extracted topics are labeled
irrelevant. Second, we extract ve topics by LDA and ULDA from Urdu corpus and nd out that
all topics extracted by LDA are labeled irrelevant. We show seven words of one topic in gure
9 which is extracted by LDA and ULDA. Topic extracted by LDA is irrelevant and does not
belong to any class of our given dataset. But Topic extracted by ULDA is relevant and belongs
to the Health class of our given dataset.</p>
      <p>CM =
x
n
(1)</p>
      <p>Third, we extract ve topics by Seeded-LDA and our proposed model Seeded-ULDA from
Urdu corpus and nd out that all topics extracted by Seeded-LDA are labeled irrelevant. As
shown in gure 10, the topic extracted by Seeded-LDA is irrelevant and does not belong to any
class of our given dataset. But the topic extracted by Seeded-ULDA is relevant and belongs to
the Health class of our given dataset.
Finally, we apply LDA and sedded-LDA topic modeling techniques to the English corpus
and nd out that all extracted topics are labeled irrelevant. Then, we combine both models
with Gibbs Sampling(GS) and extract topics from the English corpus. The results demonstrate
that topics extracted by seeded-LDA(GS) are more coherent as compare to LDA(GS). As shown
in table 2, the topic extracted by LDA and seeded-LDA does not belong to any class of given
corpus. But the topic extracted by LDA(GS) and seeded-LDA(GS) belong to the health class
of the English corpus.</p>
      <p>LDA
Study
Cancer
year
People
Mice
Week
Blood
Now, we calculate the CM of Seeded-ULDA and ULDA. As can be seen in Table 3, the average
CM calculated by Seeded-ULDA surpasses the ULDA model. We calculate CM of
SeededLDA(GS) and LDA(GS) from English corpus and results are shown in Table 4. The results
demonstrate that Seeded-LDA(GS) gives better results than LDA(GS) based on CM. Now, we
examine the in uence of minimum documents frequency (min df ) parameter on both model
Seeded-ULDA and ULDA from Urdu corpus. We set the value of min df 1 and 2, then we
examine the in uence based on CM. As shown in gure 11, Both models produce more coherent
topics with min df = 2 and our proposed model Seeded-ULDA is better than ULDA.
Regular unsupervised topic models might not produce coherent topics due to their pure
unsupervised nature. Several knowledge-based topic models have been proposed to address this
problem, but most of them are for English. Therefore, NLP research involving the Urdu
language is comparatively hard, compared to other popular languages, due to the speciality of
Urdu such as syntax, semantics, and morphological structure. The main motivation behind
this research is the lack of NLP resources and tools for the Urdu language. There are no
important studies for extracting coherent topics from Urdu texts in literature. To meet the
challenges of Urdu, we have proposed a topic model Seeded-ULDA for Urdu language to
produce coherent topics. In order to evaluate the performance and e ectiveness of our proposed
Seeded-ULDA model, we conducted three experiments using the Urdu dataset generated by
ourselves. The results demonstrate that unsupervised models produce less coherent or meaningless
topics compared to semi-supervised framework. First, we apply word2vec word embedding and
result shows that it is unable to extract coherent topics. In the second and third experiments,
we apply Seeded-ULDA and ULDA respectively and results show that semi-supervised
model Seeded-ULDA produces more than 39% coherent topics compared to unsupervised model
ULDA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>David</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew.NG</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,.(
          <year>2003</year>
          ).
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          .
          <volume>3</volume>
          .
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
          <fpage>10</fpage>
          .1162/jmlr.
          <year>2003</year>
          .
          <volume>3</volume>
          .4-
          <fpage>5</fpage>
          .
          <fpage>993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ganz</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Reinsel</surname>
          </string-name>
          ,
          <article-title>THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the far East</article-title>
          ,
          <source>Technical Report 1</source>
          . IDC,
          <string-name>
            <surname>Framingham</surname>
          </string-name>
          , Dec.
          <year>2012</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gri</surname>
          </string-name>
          <string-name>
            <given-names>ths</given-names>
            , M.
            <surname>Jordan</surname>
          </string-name>
          , J.
          <source>T.-A. in neural, and unde ned</source>
          <year>2004</year>
          ,
          <article-title>Hierarchical topic models and the nested chinese restaurant process, papers</article-title>
          .nips.cc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. L.</surname>
          </string-name>
          <article-title>the 18th I. C. on N., and unde ned 2005, Correlated topic models, papers</article-title>
          .nips.cc.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Paisley</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. B. the F. I. C.</surname>
          </string-name>
          <article-title>on, and unde ned 2011, Online variational inference for the hierarchical Dirichlet process, jmlr</article-title>
          .org.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jagarlamudi</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daum</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>R.</given-names>
            <surname>Udupa</surname>
          </string-name>
          ,
          <article-title>Incorporating lexical priors into topic models</article-title>
          ,
          <source>in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon</source>
          , France: Association for Computational Linguistics, Apr.
          <year>2012</year>
          , pp.
          <fpage>204</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ur Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aftab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rehman</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>"Hierarchical Topic Modeling for Urdu Text Articles," 2019 25th International Conference on Automation and Computing (ICAC), Lancaster</article-title>
          , United Kingdom,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .23919/IConAC.
          <year>2019</year>
          .8895047
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sabitra</given-names>
            <surname>Sankalp</surname>
          </string-name>
          <string-name>
            <surname>Panigrahi</surname>
          </string-name>
          ,
          <source>Modelling of Topic from Hindi Corpus using Word2Vec 2018 Second International Conference on Advances in Computing, Control and Communication Technology (IAC3T)</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Esposito</surname>
          </string-name>
          , Fabrizio et al.
          <article-title>Topic Modelling with Word Embeddings</article-title>
          . CLiC-it/EVALITA (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T</given-names>
            <surname>Mikolov</surname>
          </string-name>
          and
          <string-name>
            <given-names>J</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>Advances in neural information processing systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Marco</surname>
            <given-names>Baroni</given-names>
          </string-name>
          , Georgiana Dinu, and
          <string-name>
            <given-names>German</given-names>
            <surname>Kruszewski</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dont count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors</article-title>
          .
          <source>In ACL (1)</source>
          , pages
          <fpage>238</fpage>
          -
          <lpage>247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Mimno</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talley</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leenders</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Optimizing semantic coherence in topic models</article-title>
          .
          <source>EMNLP</source>
          ,
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jordan</given-names>
            <surname>Chang</surname>
          </string-name>
          , Sean Gerrish,
          <string-name>
            <given-names>Chong</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>David</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Reading tea leaves: how humans interpret topic models</article-title>
          .
          <source>In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , Pengtao and Xing, Eric. (
          <year>2013</year>
          ).
          <article-title>Integrating Document Clustering</article-title>
          and
          <string-name>
            <given-names>Topic</given-names>
            <surname>Modeling</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Shakeel</surname>
          </string-name>
          , Khadija and Tahir, Ghulam Rasool and Tehseen, Irsha and Ali, Mubashir. (
          <year>2018</year>
          ).
          <article-title>A framework of Urdu topic modeling using latent dirichlet allocation (LDA)</article-title>
          .
          <volume>117</volume>
          -
          <fpage>123</fpage>
          .
          <fpage>10</fpage>
          .1109/CCWC.
          <year>2018</year>
          .
          <volume>8301655</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Rehman</surname>
            ,
            <given-names>Anwar</given-names>
          </string-name>
          <string-name>
            <surname>Ur</surname>
          </string-name>
          and Rehman, Zobia and Akram, Junaid and Ali, Waqar and Shah, Munam and Salman, Muhammad. (
          <year>2018</year>
          ).
          <article-title>Statistical Topic Modeling for Urdu Text Articles</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Walter</surname>
            <given-names>R</given-names>
          </string-name>
          <string-name>
            <surname>Gilks.</surname>
          </string-name>
          <article-title>Markov chain monte carlo</article-title>
          .
          <source>Wiley Online Library</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Robert</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Dobrow</surname>
          </string-name>
          .
          <article-title>Markov chain monte carlo</article-title>
          . Introduction to Stochastic Processes With R, pages
          <fpage>181</fpage>
          -
          <lpage>222</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Wells</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <article-title>Orthographic Diacritics and Multilingual Computing</article-title>
          .
          <source>In Proceedings of Language Problems and Language Planning</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Daud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Khan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Che</surname>
          </string-name>
          ,
          <article-title>Urdu language processing: a survey</article-title>
          ,
          <source>Artif. Intell. Rev.</source>
          , vol.
          <volume>47</volume>
          , no.
          <issue>3</issue>
          ,pp.
          <fpage>279</fpage>
          -
          <lpage>311</lpage>
          , Mar.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Mustafa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ghulam</surname>
            ,
            <given-names>H.; Muhammad</given-names>
          </string-name>
          <string-name>
            <surname>Arslan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling</article-title>
          .
          <source>Information</source>
          <year>2020</year>
          ,
          <volume>11</volume>
          ,
          <fpage>518</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>