<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Discovering Coherent Topics from Urdu Text</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mubashar</forename><surname>Mustafa</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Central South University</orgName>
								<address>
									<settlement>Changsha</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Feng</forename><surname>Zeng</surname></persName>
							<email>fengzeng@csu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Central South University</orgName>
								<address>
									<settlement>Changsha</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hussain</forename><surname>Ghulam</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Central South University</orgName>
								<address>
									<settlement>Changsha</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wenjia</forename><surname>Li</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">New York Institute of Technology</orgName>
								<address>
									<settlement>New York</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">School of Computer Science and Engineering</orgName>
								<orgName type="institution">Central South Universi-ty</orgName>
								<address>
									<settlement>Changsha</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">New York Institute of Technology</orgName>
								<address>
									<settlement>New York</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Discovering Coherent Topics from Urdu Text</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">19DB40E22AE6E14DA6C6ED1D2D79228B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T09:08+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Topic Modeling</term>
					<term>Coherent topics</term>
					<term>Word embedding</term>
					<term>Seeded-LDA</term>
					<term>Natural Language Processing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Topic modeling (TM), detection of theme or aspect from documents is an important text processing method in natural language processing (NLP) for helping users to get insights from a large number of documents. In recent years, many unsupervised models have been used in TM, and these models often produce aspects that are not interpretable. To figure out this issue, few semi-supervised methods have been developed that allow users to input some prior domain knowledge to produce coherent aspects. Most of them are well adapted to the English corpus, but there is very little work in Urdu. TM becomes a challenge for Urdu language having their own morphological structure, semantics, and syntax. In this paper, we first propose an effective semi-supervised topic model "Seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA)" for Urdu language. The model is proposed to produce coherent topics dealing with the morphological structure of Urdu language. The proposed Urdu topic model Seeded-ULDA combines preprocessing, seeded-LDA, and Gibbs sampling. Second, we introduce word2vec word embedding in Urdu and discover topics through clustering of semantic space. This work aims to evaluate and compare various topic modeling frameworks in the Urdu news dataset. After comprehensive experiments and evaluation, the results show that word embedding is unable to extract coherent topics in Urdu language. The proposed seeded-ULDA model is more than 39% efficient as compared to existing ULDA model based on coherence measure.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In this era, the explosive growth of electronic file archives has attracted a lot of attention. The report predicts that data storage capacity will increase to 40 trillion gigabytes by 2022, 50 times more than in early 2010 <ref type="bibr" target="#b1">[2]</ref>. The most important concern now is to determine effective tools or methods that automatically organize, index, search, and browse this unstructured electronic text data. TM is one of the most widely used technology to organize these types of data. TM is a well-known advanced Machine learning technology. Using this technology, we can discover patterns that usually reflect basic themes. Given D is a set of documents composed of a set of terms W and T is a set of latent topics, TM will find T based on statistical inference on the term W. Thus, the document is a mixture of topics where topics represent a statistical distribution of words. A graphical representation of topic modeling is shown in Figure <ref type="figure" target="#fig_0">1</ref>. In machine learning and NLP, the topic model is a statistical model used to discover the theme "topic" that occurs in the documents collection. TM is a widely used text mining tool for finding hidden semantic structures in text documents. A popular algorithm for modeling the text data is LDA, and different extensions have been proposed So far: Online variational inference for LDA in <ref type="bibr" target="#b4">[5]</ref>, Correlated topics Model (CorrLDA) <ref type="bibr" target="#b3">[4]</ref>, Hierarchical Topic Model (hLDA) <ref type="bibr" target="#b2">[3]</ref>, etc. Recently, Word2Vec <ref type="bibr" target="#b9">[10]</ref> words Embedding has been used for theme extraction and achieved hopeful results <ref type="bibr" target="#b7">[8]</ref> <ref type="bibr" target="#b8">[9]</ref>. However, the most commonly used topic model is LDA, which provides a powerful framework to extract hidden topics from text documents. But, the researchers found that extracted topics of unsupervised models are unexplainable or meaningless <ref type="bibr" target="#b11">[12]</ref>. This is not a problem with LDA only: it is potentially a problem with any extension. Several knowledge-based models have been proposed to address this problem, such as seed-LDA, in which seed words are used as input to guide the model <ref type="bibr" target="#b5">[6]</ref>. This model can produce coherent topics of particular interest to users.</p><p>Most topic models are designed, developed, and implemented for English text corpus. Therefore, these techniques are very effective for the English corpus. But Urdu has distinct nature from famous languages (such as Chinese, English, Arabic, etc.), and has distinct grammatical forms, Synonyms, antonyms of various words, morphological structure, Semantics, and syntax. Therefore, TM becomes a challenging task in Urdu and the limited contribution is committed to Urdu in NLP. There are many research communities for English and most software, application programming interface (API), and tools are specially developed for English which do not work effectively for Urdu language. To use these tools and software in Urdu language, further work will be required for better performance.</p><p>According to the literature review, there is little work on topic modeling in Urdu Language <ref type="bibr" target="#b14">[15]</ref>[16] <ref type="bibr" target="#b6">[7]</ref>. However, there is no work to extract coherent topics in Urdu language and it is first work for the extraction of coherent topics from Urdu documents. In this paper, first, we apply our proposed semi-supervised topic model Seeded-ULDA. Second, we introduce word embedding in Urdu language to discover topics by clustering of semantic space; generated through word2vec word embedding. After intensive examination, the results show that word embedding is unable to extract coherent topics in Urdu language and the semi-supervised model Seeded-ULDA outperforms ULDA based on coherence measure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methodology</head><p>In this section, we present the techniques used in this study for topic modeling. We will first focus on using the Word2Vec word embedding model and then discuss the Seeded-ULDA approach. All these techniques are implemented and made the performance comparison in our experiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Seeded-Urdu Latent Dirichlet Allocation (Seeded-ULDA)</head><p>We start with a short explanation of the effectiveness of the Seeded-ULDA. It is considered a challenging task to develop an efficient semi-supervised Urdu topic model "Seeded-ULDA" that combines preprocessing, seeded-LDA and Gibbs sampling <ref type="bibr" target="#b20">[21]</ref>. Figure <ref type="figure" target="#fig_1">2</ref> gives a complete overview of the proposed model Seeded-ULDA. We introduce the technologies involved in Seeded-ULDA in the following subsections. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1">Text PreProcessing</head><p>Text preprocessing aims to standardize the representation of texts to improve the accuracy of the topic detecting models. In this step, encoding UTF-8, diacritics removal, tokenization, and stop words removal will be performed to standardize the input dataset.</p><p>Enconding UTF-8 Computer programs face the problem of character recognition in Urdu text. We are using Unicode Transformation Format 8 (UTF-8) encoding for Urdu character recognition. Unicode is one of the most widely used encoding in the computer industry. UTF-8 means that every character is mapped by unique variable-width numeric code.</p><p>Removal of Diacritics A diacritic is a sign which is added with a letter to change the pronunciation. Urdu diacritics are a subset of Arabic diacritics. The most widely used diacritics in Urdu are Zabar, Pesh, and Zer which are called Aerab <ref type="bibr" target="#b19">[20]</ref>, and other diacritics are seldom used. When it is attached to a word, the sound and meaning of the word change <ref type="bibr" target="#b18">[19]</ref>. Urdu is usually written with words only and diacritics are left to personal choice. But we discard diacritics to form the dataset standardise. Finger 3 shows some Urdu diacritics examples. Stopwords Removal In NLP, stopwords removal is the part of pre-processing to remove unworthy data and unworthy words (data) are regarded as stopwords. They do not make the addition of meaning or information and are found frequently in a sentence. We can safely ignore them without losing the information of the sentence. In order to get meaningful data, we exclude stopwords from our corpus. Few mostly used stopwords of Urdu are shown in figure <ref type="figure" target="#fig_3">4</ref>. This approach allows a user to guide the topic discovery process by letting him provide seed words that are representative of the corpus <ref type="bibr" target="#b5">[6]</ref>. This model can use the seed words in two ways: to improve both topic-word and document-topic probability distributions. To improve topic-word distributions, the model is set up in which each topic prefers to generate terms that are same to the terms in a seed set. To improve document-topic distributions, the model is encouraged to select document-level topics based on the existence of input seed words in that document. Our work aim is to produce coherent topic. So, we use the first way to improve topic-word probability distribution. In traditional topic models, Multinomial distribution φ k expresses each topic k over words. This notion is extended and the topic is defined as an intermixture of two different distributions: a regular topic distribution and a seed topic distribution. In seed topic distribution, the words are generated from the given seed set. In regular topic distribution, any words can be generated including seed words. It is emphasized that, like ordinary topics, all words of seed topics have non-uniform probability distribution. The model takes a set of seed words as input, and then outputs the probability distribution of these words. For simplicity, the model is explained by considering one-to-one compatibility between reg-ular and seed topics. when regular topics are more, then this consideration can be simply relaxed by making copies corresponding to the seed topics. The figure <ref type="figure" target="#fig_4">5</ref> shows that documents are a mixture of topics T where these topics are a mixture of seed topics φ s and regular topics φ r . The probability of picking a term from the regular topic distribution versus the seed topic distribution is controlled by parameter π k . The graphical notation is shown in Figure <ref type="figure" target="#fig_5">6</ref> and the generative process of seeded-LDA is as follow:</p><formula xml:id="formula_0">• For each topic k = 1...T -Draw regular topic φ r k ∼Dir(β r ).</formula><p>-Draw seed topic φ s k ∼Dir(β s ).</p><p>-Draw π k ∼ β(1, 1).</p><p>• For each document d, Choose θ d ∼Dir(α)</p><formula xml:id="formula_1">-Select a topic z i ∼Mult(θ d ).</formula><p>-Select a indicator x i ∼Burn(π zi ).</p><p>if x i = 0, Select a word w i ∼Mult(φ r zi ). // choose from regular topic. if x i = 1, Select a word w i ∼Mult(φ s zi ). // choose from seed topic. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3">Gibbs sampling</head><p>When direct sampling is hard, then Gibbs sampling is used to get a sequence of observations by Markov Chain Monte Carlo (MCMC) which is a methodology to sample from statistical distribution <ref type="bibr">[17][18]</ref>. It is an algorithm that uses the logic of randomness means it sample randomly, and is widely employed as a statistical inference. In this paper, we apply Gibbs sampling with the probabilistic generative model to discover coherent topics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Word2vec</head><p>The Word2vec model is a two-layer neural network that can be trained to recreate the language context of words <ref type="bibr" target="#b9">[10]</ref>. It captures the context, semantic and syntactic similarity of words in a document. Word2Vec is one of the most widely used method of learning word embedding using shallow neural networks. It takes a large number of text corpora as input, and generates a vector space that can have hundreds of dimensions, and assigns a corresponding vector in the space to each unique word in the lexicon. This technique is different from other topic models, which use documents as context. Word2Vec learns the distributed representation of each target word by identifying the context as surrounding terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experimental studies</head><p>We presented some experiments to demonstrate the effectiveness of the above defined topic modeling techniques. These experiments were performed using two corpuses discussed in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dataset</head><p>Urdu language is not owning any benchmark dataset for NLP tasks. Therefore, we create our own corpus, which contains text articles in Urdu of a widely known news website https://www.express.pk. It is now publicly available at https://github.com/Mubashar331/Urducorpus for research purposes. The collected dataset has five categories of documents named Health, Sports, Science, Entertainment, and Business. We also collected a dataset of English having four categories. After the completion of preprocessing steps that are discussed in the above subsections, we applied above defined topic modeling techniques on these corpora and evaluated performance. Table <ref type="table" target="#tab_0">1</ref> is briefly describing the corpus. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Experiments</head><p>We presented three experiments to evaluate the performance of topic modeling techniques. These experiments are performed on dataset or corpus discussed in above subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Experiment 1: Topic Modeling by word2vec</head><p>The purpose of this experiment is to evaluate the accuracy of word2vec on Urdu text documents. Word2vec has two types of architecture, Skip Gram Model and Continuous Bag of Words (CBOW), to gain vectors of features. In this study, we use the CBOW model to build vectors of a given pre-processed dataset. Then, we cluster the gained vectors of features using K-means method. The process of topic modeling by word2vec is shown in figure <ref type="figure" target="#fig_6">7</ref>.  In this experiment, we compare the accuracy of our proposed model with the ULDA model proposed in an article <ref type="bibr" target="#b14">[15]</ref>. This model was proposed to discover topics from a corpus of Urdu news articles. We employ the ULDA model on our own corpus. We ran this model several times to evaluate results on topics discovered from Urdu text documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation and Result Discussion</head><p>In the NLP research community, evaluation of discovered topics from the topic model is considered an open challenge <ref type="bibr" target="#b12">[13]</ref>. Some researchers use internal evaluation methods such as perplexity or likelihood for evaluating topic models. But these evaluation methods cannot measure the consistency of discovered topics. Through large-scale user studies, an author <ref type="bibr" target="#b12">[13]</ref> argued that the topic model which performed well on perplexity or likelihood failed to produce coherent topics. Therefore, we do not use these evaluation methods to evaluate above defined topics models.</p><p>We evaluate topic models by using a manual evaluation technique named Coherence Measure (CM), it is the ratio of relevant words to total candidate words of a topic <ref type="bibr" target="#b13">[14]</ref>. For this manual evaluation technique, we take 20 words with the highest values of every topic and request 5 students to examine and label them. First, they need to examine the words to label a topic as interpretable or irrelevant. When the topic is interpretable, then they need to inquire about the words of a topic that are relevant. CM is calculated using equation 1 where x is relevant words and n is total candidate words.</p><formula xml:id="formula_2">CM = x n<label>(1)</label></formula><p>We presented the experiments to demonstrate the effectiveness of topic modeling techniques based on CM. First, we find topics by word2vec and mostly extracted topics are labeled irrelevant. Second, we extract five topics by LDA and ULDA from Urdu corpus and find out that all topics extracted by LDA are labeled irrelevant. We show seven words of one topic in figure <ref type="figure" target="#fig_8">9</ref> which is extracted by LDA and ULDA. Topic extracted by LDA is irrelevant and does not belong to any class of our given dataset. But Topic extracted by ULDA is relevant and belongs to the Health class of our given dataset.  Finally, we apply LDA and sedded-LDA topic modeling techniques to the English corpus and find out that all extracted topics are labeled irrelevant. Then, we combine both models with Gibbs Sampling(GS) and extract topics from the English corpus. The results demonstrate that topics extracted by seeded-LDA(GS) are more coherent as compare to LDA(GS). As shown in table 2, the topic extracted by LDA and seeded-LDA does not belong to any class of given corpus. But the topic extracted by LDA(GS) and seeded-LDA(GS) belong to the health class of the English corpus.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>Regular unsupervised topic models might not produce coherent topics due to their pure unsupervised nature. Several knowledge-based topic models have been proposed to address this problem, but most of them are for English. Therefore, NLP research involving the Urdu language is comparatively hard, compared to other popular languages, due to the speciality of Urdu such as syntax, semantics, and morphological structure. The main motivation behind this research is the lack of NLP resources and tools for the Urdu language. There are no important studies for extracting coherent topics from Urdu texts in literature. To meet the challenges of Urdu, we have proposed a topic model Seeded-ULDA for Urdu language to produce coherent topics. In order to evaluate the performance and effectiveness of our proposed Seeded-ULDA model, we conducted three experiments using the Urdu dataset generated by ourselves. The results demonstrate that unsupervised models produce less coherent or meaningless topics compared to semi-supervised framework. First, we apply word2vec word embedding and result shows that it is unable to extract coherent topics. In the second and third experiments, we apply Seeded-ULDA and ULDA respectively and results show that semi-supervised model Seeded-ULDA produces more than 39% coherent topics compared to unsupervised model ULDA.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The graphical representation of topic modeling</figDesc><graphic coords="2,182.69,106.42,246.62,177.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Proposed methodology Seeded-ULDA</figDesc><graphic coords="3,203.24,311.14,205.51,337.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Urdu diacritics Example</figDesc><graphic coords="4,172.42,340.67,267.17,135.63" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Stopwords of Urdu</figDesc><graphic coords="5,203.24,106.42,205.51,195.34" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Tree representation of a document in seeded-LDA model</figDesc><graphic coords="5,203.24,546.92,205.51,93.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure6: The graphical model of seeded-LDA<ref type="bibr" target="#b5">[6]</ref> </figDesc><graphic coords="6,141.59,394.20,328.84,159.03" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: The process of topic modeling by word2vec</figDesc><graphic coords="8,162.14,197.42,287.72,196.26" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: First ten words of seeded topic of each class</figDesc><graphic coords="8,203.24,548.22,205.51,103.06" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Topic extracted by LDA and ULDA from Urdu corpus Third, we extract five topics by Seeded-LDA and our proposed model Seeded-ULDA from Urdu corpus and find out that all topics extracted by Seeded-LDA are labeled irrelevant. As</figDesc><graphic coords="9,203.24,505.23,205.52,125.79" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 10 :</head><label>10</label><figDesc>Figure 10: Topic extracted by Seeded-LDA and Seeded-ULDA from Urdu corpus</figDesc><graphic coords="10,182.69,152.71,246.62,149.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 11 :</head><label>11</label><figDesc>Figure 11: Influence of minimum documents frequency parameter</figDesc><graphic coords="11,182.69,397.15,246.61,148.17" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Detail of Dataset</figDesc><table><row><cell cols="2">Sr. Corpora</cell><cell>Description</cell><cell>No.</cell><cell>of</cell><cell>Total</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Classes</cell><cell></cell><cell>words</cell></row><row><cell>1</cell><cell>Corpus 1</cell><cell>Urdu news articles</cell><cell>5</cell><cell></cell><cell>20289</cell></row><row><cell></cell><cell></cell><cell>from Urdu Express</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>Newspaper</cell><cell></cell><cell></cell></row><row><cell>2</cell><cell>Corpus 2</cell><cell>English news articles</cell><cell>4</cell><cell></cell><cell>11771</cell></row><row><cell></cell><cell></cell><cell>from different News-</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>paper such as Dawn</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>news, Express</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Topic words extracted from English corpus calculate the CM of Seeded-ULDA and ULDA. As can be seen in Table3, the average CM calculated by Seeded-ULDA surpasses the ULDA model. We calculate CM of Seeded-LDA(GS) and LDA(GS) from English corpus and results are shown in Table4. The results demonstrate that Seeded-LDA(GS) gives better results than LDA(GS) based on CM. Now, we examine the influence of minimum documents frequency (min − df) parameter on both model Seeded-ULDA and ULDA from Urdu corpus. We set the value of min − df 1 and 2, then we examine the influence based on CM. As shown in figure11, Both models produce more coherent topics with min − df = 2 and our proposed model Seeded-ULDA is better than ULDA.</figDesc><table><row><cell>LDA</cell><cell>LDA(GS)</cell><cell>seeded-LDA</cell><cell>seeded-LDA</cell></row><row><cell></cell><cell></cell><cell></cell><cell>(GS)</cell></row><row><cell>Study</cell><cell>Brain</cell><cell>Pakistan</cell><cell>Cancer</cell></row><row><cell>Cancer</cell><cell>Health</cell><cell>Apple</cell><cell>Health</cell></row><row><cell>year</cell><cell>Risk</cell><cell>Tax</cell><cell>Blood</cell></row><row><cell>People</cell><cell>Says</cell><cell>Coronavirus</cell><cell>Found</cell></row><row><cell>Mice</cell><cell>People</cell><cell>Google</cell><cell>Disease</cell></row><row><cell>Week</cell><cell>Blood</cell><cell>Million</cell><cell>Studies</cell></row><row><cell>Blood</cell><cell>Research</cell><cell>People</cell><cell>Glucose</cell></row><row><cell>Now, we</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Results comparison of ULDA and seeded-ULDA with Urdu corpus</figDesc><table><row><cell>Class</cell><cell>ULDA</cell><cell>seeded-</cell></row><row><cell></cell><cell></cell><cell>ULDA</cell></row><row><cell>annotator 1</cell><cell>0.34</cell><cell>0.49</cell></row><row><cell>annotator 2</cell><cell>0.28</cell><cell>0.42</cell></row><row><cell>annotator 3</cell><cell>0.42</cell><cell>0.57</cell></row><row><cell>annotator 4</cell><cell>0.39</cell><cell>0.53</cell></row><row><cell>annotator 5</cell><cell>0.40</cell><cell>0.55</cell></row><row><cell>average</cell><cell>0.366</cell><cell>0.512</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Results comparison of LDA(GS) and seeded-LDA(GS) with English corpus</figDesc><table><row><cell>Class</cell><cell>LDA(GS)</cell><cell>seeded-</cell></row><row><cell></cell><cell></cell><cell>LDA(GS)</cell></row><row><cell>annotator 1</cell><cell>0.39</cell><cell>0.55</cell></row><row><cell>annotator 2</cell><cell>0.44</cell><cell>0.47</cell></row><row><cell>annotator 3</cell><cell>0.41</cell><cell>0.52</cell></row><row><cell>annotator 4</cell><cell>0.37</cell><cell>0.58</cell></row><row><cell>annotator 5</cell><cell>0.43</cell><cell>0.50</cell></row><row><cell>average</cell><cell>0.408</cell><cell>0.524</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Latent Dirichlet Allocation</title>
		<author>
			<persName><forename type="first">David</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Jordan</surname></persName>
		</author>
		<idno type="DOI">.10.1162/jmlr.2003.3.4-5.993</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Ganz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Reinsel</surname></persName>
		</author>
		<title level="m">THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the far East</title>
				<meeting><address><addrLine>Framingham</addrLine></address></meeting>
		<imprint>
			<publisher>IDC</publisher>
			<date type="published" when="2012-12">Dec. 2012</date>
			<biblScope unit="page" from="1" to="16" />
		</imprint>
	</monogr>
	<note type="report_type">Technical Report 1</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Griffiths</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jordan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename></persName>
		</author>
		<author>
			<persName><forename type="first">-A</forename></persName>
		</author>
		<ptr target="nips.cc" />
		<title level="m">Hierarchical topic models and the nested chinese restaurant process</title>
				<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
	<note>in neural, and undefined. papers.</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">C</forename></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename></persName>
		</author>
		<ptr target=".nips.cc" />
		<title level="m">Correlated topic models</title>
				<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note>papers</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Paisley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">I C</forename><surname>On</surname></persName>
		</author>
		<ptr target="org" />
		<title level="m">Online variational inference for the hierarchical Dirichlet process</title>
				<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>and undefined. jmlr.</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Incorporating lexical priors into topic models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jagarlamudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Daum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iii</forename></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Udupa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics</title>
				<meeting>the 13th Conference of the European Chapter of the Association for Computational Linguistics<address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012-04">Apr. 2012</date>
			<biblScope unit="page" from="204" to="213" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Hierarchical Topic Modeling for Urdu Text Articles</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ur Rehman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">H</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Aftab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Rehman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Shah</surname></persName>
		</author>
		<idno type="DOI">10.23919/IConAC.2019.8895047</idno>
	</analytic>
	<monogr>
		<title level="m">25th International Conference on Automation and Computing (ICAC)</title>
				<meeting><address><addrLine>Lancaster, United Kingdom</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Modelling of Topic from Hindi Corpus using Word2Vec</title>
		<author>
			<persName><forename type="first">Sabitra</forename><surname>Sankalp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Panigrahi</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Second International Conference on Advances in Computing, Control and Communication Technology (IAC3T)</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Topic Modelling with Word Embeddings</title>
		<author>
			<persName><forename type="first">Fabrizio</forename><surname>Esposito</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>CLiC-it/EVALITA</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Distributed Representations of Words and Phrases and their Compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Dont count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors</title>
		<author>
			<persName><forename type="first">Marco</forename><surname>Baroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georgiana</forename><surname>Dinu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">German</forename><surname>Kruszewski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL (1)</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="238" to="247" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Optimizing semantic coherence in topic models</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mimno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Wallach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Talley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Leenders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EMNLP</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="262" to="272" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Reading tea leaves: how humans interpret topic models</title>
		<author>
			<persName><forename type="first">Jonathan</forename><surname>Boyd-Graber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jordan</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sean</forename><surname>Gerrish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chong</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Blei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd Annual Conference on Neural Information Processing Systems</title>
				<meeting>the 23rd Annual Conference on Neural Information Processing Systems</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Integrating Document Clustering and Topic Modeling</title>
		<author>
			<persName><forename type="first">Pengtao</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eric</forename><surname>Xing</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A framework of Urdu topic modeling using latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">Khadija</forename><surname>Shakeel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ghulam</forename><surname>Tahir</surname></persName>
		</author>
		<author>
			<persName><surname>Rasool</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Irsha</forename><surname>Tehseen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mubashir</forename><surname>Ali</surname></persName>
		</author>
		<idno type="DOI">10.1109/C-CWC.2018.8301655</idno>
	</analytic>
	<monogr>
		<title level="j">LDA</title>
		<imprint>
			<biblScope unit="page" from="117" to="123" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Statistical Topic Modeling for Urdu Text Articles</title>
		<author>
			<persName><forename type="first">Anwar</forename><surname>Rehman</surname></persName>
		</author>
		<author>
			<persName><surname>Ur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zobia</forename><surname>Rehman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Junaid</forename><surname>Akram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ali</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Waqar</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Munam</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Salman</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Muhammad</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Markov chain monte carlo</title>
		<author>
			<persName><surname>Walter R Gilks</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<publisher>Wiley Online Library</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Markov chain monte carlo</title>
		<author>
			<persName><surname>Robert P Dobrow</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Introduction to Stochastic Processes With R</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="181" to="222" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Orthographic Diacritics and Multilingual Computing</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Wells</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Language Problems and Language Planning</title>
				<meeting>Language Problems and Language Planning</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Urdu language processing: a survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Daud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Che</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artif. Intell. Rev</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="279" to="311" />
			<date type="published" when="2017-03">Mar. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ghulam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Muhammad Arslan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page">518</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
