<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analyzing the Punjabi Language Stemmers: A Critical Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harjit Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>APS Neighbourhood Campus, Punjabi University</institution>
          ,
          <addr-line>Patiala</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>250</fpage>
      <lpage>258</lpage>
      <abstract>
        <p>Stemming is a procedure used to reduce the inflected words by removing affixes from them. It is widely used in Natural Language Processing systems developed for various languages like English, Hindi etc. Punjabi is a low resource availability language spoken in northern regions of India and Pakistan. In Pakistan, Shahmukhi script is used to write Punjabi. While in India, Gurmukhi script is used. Various approaches are used for stemming natural language words such as Brute Force approach, Rule based approach, Statistical approach etc. In Gurmukhi scripted Punjabi language stemming, either the Brute Force approach or Rule based approach or a combination of both of these approaches have been used till now. This paper presents a comparative analysis of various stemmers developed so far for stemming Gurmukhi scripted Punjabi language words based on their methodology and accuracy. It will motivate further research in developing more efficient stemmers for Punjabi language. It will benefit the researchers to compare and understand the used approaches and the problems faced in adopting each approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Stemming</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Suffix Striping</kwd>
        <kwd>Punjabi Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A procedure adopted to reduce the inflected
words by removing any affixes attached to
them is called Stemming. Stemming is a very
common and crucial task performed in Natural
Language Processing (NLP) applications.
Many morphologically similar words can be
stemmed to the same stem word after
removing any suffixes or prefixes from them.
It improves the efficiency of further processes
adopted for information retrieval. It improves
the effectiveness of information retrieval by
helping to identify redundant information, to
remove irrelevant information from retrieval
and to optimize the retrieval by placing high
ranked information on the top [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Various approaches are used for stemming
natural language words such as Brute Force
approach, Rule based approach, Statistical
approach etc. In brute force approach, a table
of inflected words along with their stem words
is maintained. To stem a word, that word is
searched in the table for a match. If match
occurs with some word, the related stem word
is picked from the table. The approach is easy
to implement and the accuracy is dependent on
the table of words available in the database. In
rule based approach, the affixes are removed
by using some fixed rules. The input words are
checked to have a group of characters at the
end or beginning of each word. That set of
characters (affix) is removed to get the stem
word. In some words, an alternative set of
characters is attached to the word to complete
it.</p>
      <p>
        In statistical approach, the distance
measure is calculated between the two words.
If the words are morphologically similar, they
will have low distance measure. For
morphologically unrelated words, distance
measure will be high. Using this measure, the
words of a document are grouped to make
clusters. In this way, morphologically similar
words will be grouped in each cluster. Then
the central word of each cluster is taken as a
stem word [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In Gurmukhi scripted Punjabi language
stemming, either the Brute Force approach or
Rule based approach or a combination of both
of these approaches have been used till now.
This paper analyses all those approaches and
compares them with each other based on the
methodology and accuracy. It will motivate
further research in developing more efficient
stemmers for Punjabi language. It will benefit
the researchers to compare and understand the
used approaches and the problems faced in
adopting each approach.
1.1.</p>
    </sec>
    <sec id="sec-2">
      <title>Punjabi Language</title>
      <p>Punjabi language belongs to Indo-Aryan
group of languages. Its word order in sentence
is Subject-Object-Verb and is spoken by
almost 100 million people residing in India,
Pakistan and many other countries. It may be
called Punjabi community. Table 1 shows the
number of Punjabi speakers in some countries.</p>
      <p>Punjabi is basically written in two different
scripts; Perso-Arabic and Gurmukhi.
PersoArabic is related to Persian script which is also
called Shahmukhi, and Gurmukhi is derived
from Landa, an old script of Indian
subcontinent.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Literature Review</title>
      <p>
        Jain et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used NLP technique to find
root word in Hindi language. The Hindi
inflected word was stemmed to remove affix
from the end of the word. The stemmer was
tested using 800 inflected words. They used
prefix, suffix and root word lists to remove the
affixes and validate the root word. The
stemmer provided the hit ratio of 0.93. Yadav
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a method for automatic
summarization of text documents. They
generated a network of related sentences based
on semantic and lexical relation. Two
sentences are related to each other using a
certain strength that represents the number of
relationships between the two. Gros et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
performed an analysis to test the effect on the
performance of information retrieval with
anaphora resolution. They claimed the positive
effect on the results. In this way, the quality of
information retrieval results can be
significantly improved.
      </p>
      <p>
        Agrawal et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed a translation
tool to translate Sanskrit language text to
Hindi. They used stemming in preprocessing
the Sanskrit words before converting them to
Hindi. It was used as a major phase in
translation tool development. Any affixes from
input Sanskrit words were removed before
using them for generating the equivalent Hindi
text. A Sanskrit corpus was used to
successfully stem the words by searching them
in the corpus and validate.
      </p>
      <p>
        Bhadwal et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] developed
HindiSanskrit bilingual translation system. The
system was able to translate Hindi text to
Sanskrit and vice-versa using a number of
modules. For Sanskrit to Hindi translation
flow, they used stemming as a major step
before syntax analysis. The system was
developed in Java. Bhadwal et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed
a Hindi to Sanskrit machine translation
system. The system includes a number of NLP
modules such as transliteration, POS-tagging,
root verb extraction etc. In root verb
extraction, the algorithm finds the root verb
before mapping to equivalent Sanskrit word.
      </p>
      <p>
        Although there are positive effects of
stemming in various NLP applications, but
Wahbeh et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] claimed negative effects on
text classification in Arabic. They used
stemming as a preprocessing module in Arabic
text classification and found that results were
negatively affected when stemming was used.
      </p>
      <p>Review of literature shows that stemming
is used for a variety of NLP applications. It
motivated the analysis of stemmers developed
for Punjabi language. Only a handful of
research papers found for this low resource
availability Asian language.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Stemming Linguistics</title>
    </sec>
    <sec id="sec-5">
      <title>Gurmukhi Punjabi in</title>
      <p>In linguistics, an affix refers to either suffix
or prefix. In Punjabi language, both types of
affixes are used to make proper words for
sentences. But in most cases, prefix changes
the meaning of the word. So, in NLP
applications, only suffixes are considered for
removal. Table 2 shows some inflected words
with their stem words. The table shows how
the Punjabi words are dealt with linguistically
to remove some characters and if required,
substitute some alternative set of characters to
complete the word.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Analysis of Gurmukhi Punjabi</title>
    </sec>
    <sec id="sec-7">
      <title>Stemmers</title>
      <p>Regardless of the language, stemming is an
NLP process to find root words. So, this paper
is understandable to the readers even without
the knowledge of Punjabi and Gurmukhi
script, because function of a stemmer is just to
remove any affixes attached to a word.
Therefore only brief introduction to Punjabi
and Gurmukhi script is enough. This paper
discusses the stemmers developed for Punjabi
language Gurmukhi script. The paper
discusses the methodologies used by
researchers to stem the inflected words. In this
paper, Gurmukhi scripted Punjabi language
henceforth will be called Punjabi.</p>
      <p>The significance of this analysis is that it
discloses the gap in the NLP research for
Punjabi language. It will motivate further
research in developing more efficient
stemmers for Punjabi using new ideas or by
extending existing approaches.
4.1.</p>
    </sec>
    <sec id="sec-8">
      <title>Earlier Stemmer for Punjabi</title>
      <p>
        Kumar et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed a Punjabi
stemmer based on combination of brute force
approach and suffix stripping technique. The
researchers divided the system in three units
and named these units as Input unit, Output
unit and Processing unit.
      </p>
      <p>The Input unit takes a Punjabi word and
that word is sent to the Processing unit which
searches for the word in a database of Punjabi
inflected words and their stem words. If a
word is found in the database, the
corresponding stem word is considered as
output stem word. If no match is found in
database, then suffix removal technique is
used to stem the word. The word ending is
searched in a list of possible Punjabi suffixes.
If the word ending matches with any one of
the suffix present in the suffix list then the
ending (suffix) is removed to get the stem
word. The steps are shown in Algorithm 1 and
graphically represented in Figure 1.</p>
      <p>Algorithm 1: Earlier Stemmer for Punjabi
1: Input Punjabi Word for stemming.
2: Search the word in database.
3: If Found; goto (6).
4: Search the suffix of word in suffix list.
5: If Found; remove the suffix.
6: Display stem word.
7: End</p>
      <p>Proposed stemmer was evaluated using
three parameters i.e. Correctness,
Effectiveness and Performance. Its accuracy is
dependent on the number of words stored in
the database. Researchers stored 28000 words
in the database with 7100 root words for Brute
Force search to find matching root word at the
first step. So it reduced the chances of using
the second step of removing suffix. More the
number of words in the database more will be
the correctness of the system. For the database
they took words from a Punjabi-English
dictionary and some online blogging and news
websites.</p>
      <p>Effectiveness of the system is determined
by the behavior of the system in abnormal
conditions, such as entering a word that does
not exist. This stemmer simply displays the
same word as the output. The problems of
over-stemming and under-stemming can be
further reduced by providing the large
database of words. But if the word is not
available in the database, these problems may
arise in the second step i.e. removing suffix
from the inflected word.</p>
      <p>Performance of the system is determined
by its output accuracy. This system is totally
dependent on its database of 28000 words with
7100 root words. The accuracy was calculated
by dividing the number of correct outputs by
the total number of inputs. The calculated
average accuracy of the system was 80.73%.</p>
    </sec>
    <sec id="sec-9">
      <title>4.2. Improved Version of Earlier</title>
    </sec>
    <sec id="sec-10">
      <title>Stemmer</title>
      <p>
        Kumar et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed an improved
version of Punjabi stemmer based on Brute
force search with suffix stripping technique. In
this version of stemmer a database is prepared
containing inflected words along with their
stem words. The stemmer tries to match the
Punjabi word with words stored in the
database. On finding a match, the stem word
retrieved from the database is given as output.
Algorithm 2 shows the working of the
stemmer. Graphically representation is shown
in Figure 2.
      </p>
      <p>Algorithm 2: Improved Version of Earlier
Stemmer
1: Input Punjabi word for stemming.
2: Search the Word in Exception List.
3: If Found; goto (9).
4: Search the word in database.
5: If Found; goto (8).
6: Search the suffix of word in suffix list.
7: If Found; remove the suffix.
8: Display stem word.
9: End</p>
      <sec id="sec-10-1">
        <title>Punjabi Word</title>
      </sec>
      <sec id="sec-10-2">
        <title>No Match</title>
        <p>Exception Found</p>
        <p>List
Matching</p>
      </sec>
      <sec id="sec-10-3">
        <title>Not Found</title>
      </sec>
      <sec id="sec-10-4">
        <title>Database Found</title>
        <p>Matching</p>
      </sec>
      <sec id="sec-10-5">
        <title>Not Found</title>
      </sec>
      <sec id="sec-10-6">
        <title>Matched</title>
      </sec>
      <sec id="sec-10-7">
        <title>Suffix Matching</title>
      </sec>
      <sec id="sec-10-8">
        <title>Suffix Stripping</title>
      </sec>
      <sec id="sec-10-9">
        <title>Stem Word</title>
      </sec>
      <sec id="sec-10-10">
        <title>Exit</title>
        <p>The proposed stemmer was evaluated for
correctness, effectiveness and its performance.
This stemmer used a database of 52000 words
from various fields which contained 14400
root words. So it decreased the number of
chances to execute the procedure of removing
suffix from the inflected word. Increasing the
number of words in the database will result in
improved correctness of the output. Words
were added to the database from various
sources such as
www.punjabionlinedictionary.com, Jag Bani
Newspaper, Pardeep Punjabi to English
Dictionary and some online blogging and news
websites.</p>
        <p>To measure the effectiveness of the
stemmer, its behavior was verified in unusual
conditions. For invalid input word, this
stemmer provides the same word as output. A
non-existing word in the database may result
in over or under-stemming at suffix removal
process. To overcome this problem, an
exception list of approximately 200 words was
created. These are those words that cause
overstemming and under-stemming.</p>
        <p>The output accuracy is assured if the output
word is present as root word in the database.
The accuracy was calculated by dividing the
number of correct outputs by the total number
of inputs. The calculated average output
accuracy of the stemmer was 81.27%.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>4.3. Stemmer for Only Nouns &amp;</title>
    </sec>
    <sec id="sec-12">
      <title>Proper Names</title>
      <p>
        Gupta and Lehal [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposed a technique
to stem nouns. Proper nouns are the words that
are unavailable in dictionaries. A total of
17598 words were identified as proper names
from Punjabi corpus which covered 13.84% of
the entire corpus.
      </p>
      <p>The original stemming algorithm composed
of 19 steps and in first 18 steps suffix removal
and/or suffix substitution was done based on
the suffix attached to the word. Some suffixes
were just removed from the end of the word
but in some cases, suffix was removed and
then another suffix was substituted. In 19th
step of the algorithm, the stemmed word was
searched in the list. If matched, it was taken as
a Punjabi noun or proper name. The Algorithm
3 shows the reconstructed steps taken by the
stemming process. Figure 3 shows these steps
graphically.</p>
      <p>Algorithm 3: Reconstructed Algorithm of
Stemmer for Only Nouns &amp; Proper Names
1: Input Punjabi word for Stemming.
2: Match end of the word with expected suffix.
3: If Not Matched; More expected suffix exist;
goto (2) with next expected suffix.
4: Remove matched suffix from word.
5: If required, substitute a suffix.
6: Find word in noun morph/names list.
7: If Not Found; goto (9).
8: Display stem word.
9: End.</p>
      <p>The stemmer was analyzed with 50 Punjabi
documents obtained from various Punjabi
news articles and other resources. It was tested
to have 87.37% accuracy rate. The accuracy
was calculated by dividing the number of
correct outputs by the total number of inputs.
The error percentage of the stemmer was also
analyzed. The error percentage was also
calculated separately for three different
reasons.</p>
      <sec id="sec-12-1">
        <title>Punjabi</title>
        <p>Word</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>4.4. Automatic Stemmer for All</title>
    </sec>
    <sec id="sec-14">
      <title>Category of Words</title>
      <p>
        Gupta [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed a complete stemmer
for each category of Punjabi language words
including proper names, nouns, adjectives,
verbs, pronouns and adverbs. The suffixes in
each category are identified manually by
analysis of Punjabi news corpus.
      </p>
      <p>
        For each category of suffixes separate rule
based algorithm was written. But for stemming
of proper names and nouns, the researcher
reused the stemming algorithm from his
already published paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The Punjabi
word was taken by the algorithm and was
searched in Punjabi dictionary. If matched, the
same word was returned as stem word.
Otherwise, four separate sub-stemmers were
used one by one starting with the “Proper
Names and Nouns Stemmer”. The working of
other three sub-stemmers was similar to that
stemmer except that they work for separate
category of words. After each sub-stemmer
was applied, the output word was searched in a
dictionary. If the word was matched in
dictionary then it was taken as the output stem
word, otherwise an error message was
displayed due to the inability to find the actual
stem word. The overall stemming process
takes the steps shown in Algorithm 4. Figure 4
represents the steps graphically.
      </p>
      <p>Algorithm 4: Automatic Stemmer for All
Category of Words
1: Input Punjabi Word for stemming.
2: Find Word in noun morph OR Names
dictionary OR Dictionary.
3: If Found; goto (17).
4: Call (Proper Names and Nouns
Sub</p>
      <p>Stemmer).
5: Find Stemmed Word in noun morph OR</p>
      <p>Names dictionary.
6: If Found; goto (17).
7: Call (Verb Sub-Stemmer).
8: Find Stemmed Word in dictionary.
9: If Found; goto (17).
10: Call (Adjective/Adverb Sub-Stemmer)
11: Find Stemmed Word in dictionary.
12: If Found; goto (17).
13: Call (Pronoun Sub-Stemmer)
14: Find Stemmed Word in dictionary.
15: If Found; goto (17).
16: Show Error.
17: Display Word.
18: End.</p>
      <sec id="sec-14-1">
        <title>Punjabi Word</title>
      </sec>
      <sec id="sec-14-2">
        <title>Matching in Noun Morph/ Names Dictionary/Dictionary Matched</title>
      </sec>
      <sec id="sec-14-3">
        <title>No Match</title>
      </sec>
      <sec id="sec-14-4">
        <title>Proper Names and Nouns Sub-Stemmer</title>
      </sec>
      <sec id="sec-14-5">
        <title>Verb SubStemmer</title>
      </sec>
      <sec id="sec-14-6">
        <title>Adjective/ Adverb SubStemmer</title>
      </sec>
      <sec id="sec-14-7">
        <title>Pronoun SubStemmer</title>
      </sec>
      <sec id="sec-14-8">
        <title>Error Message</title>
        <p>Match in Noun</p>
        <p>Morph/
Names Dictionary</p>
      </sec>
      <sec id="sec-14-9">
        <title>No Match</title>
      </sec>
      <sec id="sec-14-10">
        <title>No Match</title>
      </sec>
      <sec id="sec-14-11">
        <title>No Match</title>
        <p>Match in
Dictionary</p>
      </sec>
      <sec id="sec-14-12">
        <title>Match in Dictionary</title>
      </sec>
      <sec id="sec-14-13">
        <title>Match in</title>
        <p>Dictionary
h
c
t
a
M
h
c
t
a
M
h
c
t
a
M
h
c
t
a
M</p>
        <p>Stem
Word
The overall accuracy percentage of the
stemmer was 87.2% and error percentage was
12.8%. The accuracy was calculated by
dividing the number of correct outputs by the
total number of inputs.</p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>4.5. Stemmer</title>
    </sec>
    <sec id="sec-16">
      <title>Database using</title>
    </sec>
    <sec id="sec-17">
      <title>WordNet</title>
      <p>
        Puri et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed a modified version
of Punjabi stemmer that was developed by
Gupta et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. They utilized the Punjabi
Wordnet database. A Punjabi dictionary
database was prepared from Punjabi Wordnet
taken from Indian Language Technology, IIT
Bombay and Punjabi news articles. The
Wordnet provided 53902 unique words of
Punjabi language and 147942 distinct Punjabi
words were obtained from news articles. The
purpose of Punjabi dictionary database was to
match the stemmed word to ensure the
correctness of stemming.
      </p>
      <p>A Named Entity database was prepared
with 26992 distinct names used in Punjabi
language. Most of the words were taken from
news articles and university rolls. This list was
prepared to ensure that these words should be
skipped while stemming, because stemming of
these Named Entities will change them
altogether. A suffix list consisting of 75
Punjabi language suffixes was created
manually by analyzing news articles. These
are the suffixes to be removed and the suffixes
to be substituted (if required) for stemming.
The substitution suffix was NULL if there was
no need of any suffix substitution.</p>
      <p>Some of the suffix stripping rules provided
false positive in stemming for a few words, yet
those rules were important to stem a good
quantity of words. Those rules were analyzed
and an Exception list of those words was
generated.</p>
      <p>The algorithm proceeds by checking the
word length. If its length is less than the
minimum length (constant) then it need not be
stemmed. If it is not, the word is searched in
Named Entity database. If a match occurs, the
word need not be stemmed. If the word is not
found in Named Entity database, then it is
searched in Exceptions list. If a match occurs,
the word is ignored and not stemmed. If the
word is not found in Exceptions list then its
suffix is removed by searching the suffix in the
suffix list. The substitution suffix (if available
.
n
i
M
=
&gt;</p>
      <sec id="sec-17-1">
        <title>Find in Named Entity List</title>
      </sec>
      <sec id="sec-17-2">
        <title>Exist</title>
        <p>t
s
i
x
toE Find in
N Excepti
on List
t
s
i
x
E
t
o
N</p>
      </sec>
      <sec id="sec-17-3">
        <title>Exist</title>
      </sec>
      <sec id="sec-17-4">
        <title>Not Exist</title>
      </sec>
      <sec id="sec-17-5">
        <title>Replace</title>
        <p>with
Substitution</p>
        <p>Suffix</p>
        <p>Stemmed Word
t
isx Find in
E Suffix List</p>
      </sec>
      <sec id="sec-17-6">
        <title>Check in</title>
        <p>Dictionary
in suffix list) is also added to the end of the
word. Finally the stemmed word is searched in
dictionary. The word is taken as stem word if
it is available. Algorithm 5 shows the
stemming process and Figure 5 graphically
represents the process.</p>
        <p>Algorithm 5: Stemmer using WordNet
Database
1: Input Punjabi Word for stemming.
2: If Length (Word) &lt; Min-Length; goto(9)
3: If Word Found in Named Entity List;
goto(9)
4: If Word Found in Exception List; goto(9)
5: If Word Suffix Not Exist in Suffix List;
goto(9)
6: Replace Suffix with Substitution Suffix
7: If Stemmed Word Not exist in dictionary;
goto(9)
8: Output Stem Word
9: End.</p>
        <p>The stemmer was tested with a sample of
1000 news articles. It was tested by computing
the F-Measure value using the counting of
words as follows:</p>
        <p>C = correctly stemmed</p>
        <p>U = under-stemmed
O = over-stemmed
(1)
(2)
Recall = C / ( C + U )
Precision = C / ( C + O )</p>
        <p>The calculated Recall = 94.83%</p>
        <p>Precision = 90.84%
From (1) and (2):</p>
        <p>F-Measure = 92.79%.</p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>5. Discussion and Analysis</title>
      <p>
        From the analysis of these stemmers, it is
clear that each stemmer developed for Punjabi
language provides accuracy improvements in
its own way. The stemmer by Puri et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
claims the most accurate stemmer according to
the F-Measure calculations done by these
researchers. But it is clear that [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] did not
provide accuracy percentage as done by other
researchers. By other researchers, the accuracy
was calculated by dividing the number of
correct outputs by the total number of inputs.
      </p>
      <p>
        We can calculate the accuracy of stemmer
developed by Puri et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] using this
formula for better comparison. Let the
counting of words as follows:
      </p>
      <p>C = correctly stemmed</p>
      <p>U = under-stemmed</p>
      <p>O = over-stemmed</p>
      <p>So, Input Word Count = C + U + O</p>
      <p>
        Using Calculations by Puri et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], from
equations (1) and (2) we
have:
      </p>
      <p>Recall = C / ( C + U ) × 100 = 94.83</p>
      <p>
        From the graph it is clear that the overall
accuracy of stemmers is reduced continuously
after the stemmer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. But the stemmer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
has its own limitations. It deals with only
nouns and proper names and does not stem
other category of words. So its accuracy is not
comparable to those stemmers that cover all
categories of words. This same stemmer is
reused by Gupta [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to develop a stemmer for
all categories of words. But if we read original
paper by Gupta [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], it discusses more and
more about the “stemmer for nouns and proper
names” and does not explore new research to
that extent. In the abstract, the researcher
mentioned the accuracy of only reused
stemmer for nouns and proper names.
Although the algorithms for verb,
adjective/adverb and pronoun are explained in
separate sections but those are not mentioned
in results and discussion section except from
the accuracy figures shown in a table. The
whole ‘results and discussion’ section details
about the same reused stemmer. The
conclusion section of the paper also concludes
with the reference of “stemmer for nouns and
proper names”. It means this field of research
demands for more efficient approaches that
could cope with limitations of existing
stemmers.
      </p>
    </sec>
    <sec id="sec-19">
      <title>6. Conclusion</title>
      <p>Stemming is a building block of NLP.
Research to stem Punjabi words is limited to a
handful of research papers. The initiatives by
researchers, no doubt, are praiseworthy, but
this research field requires more efforts as
compared to the research for other languages
such as English, Hindi etc. Each stemmer
developed for Punjabi language provides
accuracy improvements in its own way. The
table lookup approach is used in almost all
Punjabi language stemmers to check either the
word is already a stem word or the word is
correctly stemmed after stemming process. For
these stemmers, table lookup is the only way
to check for the correctness of the stemming
process. The rule based approach is also used
along with table lookup to obtain a stem word
by suffix stripping from the inflected words.
Suffix substitution is used to complete the
word, if required. This paper explored these
approaches in an analytical manner with their
accuracy comparisons. The significance of this
analysis is that it disclosed the gap in NLP
research for Punjabi language. There is a scope
of research to invent more efficient stemmers
for Punjabi language using new ideas or by
extending the existing approaches.</p>
    </sec>
    <sec id="sec-20">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Braschler</surname>
          </string-name>
          , Barbel Ripplinger, “
          <article-title>How Effective is Stemming and Decompounding for German Text Retrieval?”</article-title>
          , Information Retrieval,
          <volume>7</volume>
          ,
          <year>2004</year>
          , pp.
          <fpage>291</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Majumder</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datta</surname>
            <given-names>K.</given-names>
          </string-name>
          , '
          <article-title>Statistical vs. Rule-Based Stemming for Monolingual French Retrieval”</article-title>
          , In: Peters C. et al. (
          <article-title>eds) Evaluation of Multilingual and Multi-modal Information Retrieval</article-title>
          .
          <source>CLEF 2006. Lecture Notes in Computer Science</source>
          , vol
          <volume>4730</volume>
          . Springer, Berlin, Heidelberg,
          <year>2007</year>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>540</fpage>
          -74999-8_
          <fpage>14</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Leena</given-names>
            <surname>Jain</surname>
          </string-name>
          , Prateek Agrawal, '
          <article-title>Text independent root word identification in Hindi language using natural language processing"</article-title>
          ,
          <source>International Journal of Advanced Intelligence Paradigms (IJAIP)</source>
          , Vol.
          <volume>7</volume>
          , No.
          <issue>3</issue>
          /4,
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .1504/IJAIP.
          <year>2015</year>
          .073705
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Chandra</given-names>
            <surname>Shakhar</surname>
          </string-name>
          <string-name>
            <surname>Yadav</surname>
          </string-name>
          , Aditi Sharan,
          <article-title>"Automatic Text Document Summarization Using Graph Based Centrality Measures on Lexical Network"</article-title>
          ,
          <source>International Journal of Information Retrieval Research</source>
          ,
          <volume>8</volume>
          , 3
          <issue>(</issue>
          <year>July 2018</year>
          ),
          <fpage>14</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .4018/IJIRR.2018070102
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Gros</surname>
          </string-name>
          , Christine Meschede, Tim Habermann, Giulia Kirstein,
          <string-name>
            <given-names>S. Denise</given-names>
            <surname>Ruhrberg</surname>
          </string-name>
          , Adrian Schmidt, and
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Siebenlist</surname>
          </string-name>
          ,
          <article-title>"Anaphora Resolution: Analysing the Impact on Mean Average Precision and Detecting Limitations of Automated Approaches"</article-title>
          ,
          <source>International Journal of Information Retrieval Research</source>
          <volume>8</volume>
          ,
          <issue>3</issue>
          (
          <year>July 2018</year>
          ),
          <fpage>33</fpage>
          -
          <lpage>45</lpage>
          . doi:
          <volume>10</volume>
          .4018/IJIRR.2018070103
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Prateek</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Leena Jain, “
          <article-title>Anuvaadika: Implementation of Sanskrit to Hindi Translation Tool Using RuleBased Approach”</article-title>
          ,
          <source>Recent Advances in Computer Science and Communications</source>
          (
          <year>2020</year>
          )
          <volume>13</volume>
          :
          <fpage>1136</fpage>
          . doi:
          <volume>10</volume>
          .2174/2213275912666181226155829
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bhadwal</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madaan</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <article-title>"Bilingual Machine Translation System Between Hindi and Sanskrit Languages"</article-title>
          , In: Luhach A.,
          <string-name>
            <surname>Jat</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hawari</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            <given-names>XZ.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lingras</surname>
            <given-names>P</given-names>
          </string-name>
          . (eds)
          <article-title>Advanced Informatics for Computing Research</article-title>
          .
          <source>ICAICR 2019. Communications in Computer and Information Science</source>
          , vol
          <volume>1075</volume>
          . Springer, Singapore,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-15-0108-1_
          <fpage>29</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Bhadwal</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madaan</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <article-title>"A Machine Translation System from Hindi to Sanskrit Language using Rule based Approach"</article-title>
          ,
          <source>Scalable Computing: Practice and Experience</source>
          , Vol.
          <volume>21</volume>
          , pp.
          <fpage>543</fpage>
          -
          <lpage>554</lpage>
          .doi:
          <volume>10</volume>
          .12694/scpe.v21i3.
          <fpage>1783</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Abdullah</given-names>
            <surname>Wahbeh</surname>
          </string-name>
          , Mohammed Al-Kabi, Qasem Al-Radaideh, Emad Al-Shawakfa,
          <article-title>Izzat Alsmadi, "The Effect of Stemming on Arabic Text Classification: An Empirical Study"</article-title>
          <source>International Journal of Information Retrieval Research</source>
          , Vol.
          <volume>1</volume>
          , No.
          <volume>3</volume>
          ,
          <year>2011</year>
          . doi:
          <volume>10</volume>
          .4018/IJIRR.2011070104.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Dinesh</surname>
            <given-names>Kumar</given-names>
          </string-name>
          , Prince Rana, “
          <article-title>Design and Development of a Stemmer for Punjabi”</article-title>
          ,
          <source>International Journal of Computer Applications</source>
          (ISSN:
          <fpage>0975</fpage>
          -
          <lpage>8887</lpage>
          ), Vol.
          <volume>11</volume>
          , No.
          <volume>12</volume>
          ,
          <year>December 2010</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Dinesh</surname>
            <given-names>Kumar</given-names>
          </string-name>
          , Prince Rana, “
          <article-title>Stemming of Punjabi Words by using Brute Force Technique”</article-title>
          ,
          <source>International Journal of Engineering Science and Technology (IJEST) ISSN</source>
          :
          <fpage>0975</fpage>
          -
          <lpage>5462</lpage>
          , Vol.
          <volume>3</volume>
          No.
          <issue>2</issue>
          ,
          <string-name>
            <surname>Feb</surname>
            <given-names>2011</given-names>
          </string-name>
          , pp.
          <fpage>1351</fpage>
          -
          <lpage>1358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Vishal</surname>
            <given-names>Gupta</given-names>
          </string-name>
          , Gurpreet Singh Lehal, “
          <article-title>Punjabi Language Stemmer for nouns and proper names”</article-title>
          ,
          <source>Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP)</source>
          ,
          <source>IJCNLP</source>
          <year>2011</year>
          ,
          <string-name>
            <given-names>Chiang</given-names>
            <surname>Mai</surname>
          </string-name>
          , Thailand, November 8,
          <year>2011</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Vishal</surname>
            <given-names>Gupta,</given-names>
          </string-name>
          “
          <article-title>Automatic Stemming of Words for Punjabi Language”</article-title>
          ,
          <source>Advances in Signal Processing and Intelligent Recognition Systems</source>
          -Vol.
          <volume>264</volume>
          , Springer International Publishing Switzerland,
          <year>2014</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          , doi:10.1007/978-3-
          <fpage>319</fpage>
          - 04960-
          <issue>1</issue>
          _
          <fpage>7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Rajeev</surname>
            <given-names>Puri</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. P. S.</given-names>
            <surname>Bedi</surname>
          </string-name>
          , Vishal Goyal, “
          <article-title>Punjabi Stemmer Using Punjabi WordNet Database”</article-title>
          ,
          <source>Indian Journal of Science and Technology</source>
          , Vol.
          <volume>8</volume>
          (
          <issue>27</issue>
          ) ,
          <year>October 2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          , doi: 10.17485/ijst/2015/v8i27/82943.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>