=Paper=
{{Paper
|id=Vol-1353/paper_32
|storemode=property
|title=Towards Extracting Domains from Research Publications
|pdfUrl=https://ceur-ws.org/Vol-1353/paper_32.pdf
|volume=Vol-1353
|dblpUrl=https://dblp.org/rec/conf/maics/LakhanpalGA15
}}
==Towards Extracting Domains from Research Publications==
<pdf width="1500px">https://ceur-ws.org/Vol-1353/paper_32.pdf</pdf>
<pre>
                Towards Extracting Domains from Research Publications
                                               Shilpa Lakhanpal and Ajay Gupta
                                                       Department of Computer Science
                                             Western Michigan University, Kalamazoo, MI 49008
                                            shilpa.lakhanpal@wmich.edu; ajay.gupta@wmich.edu

                                                            Rajeev Agrawal
                                                 Department of Computer Systems Technology
                                          North Carolina A&T State University, Greensboro, NC 27411
                                                             ragrawal@ncat.edu


                               Abstract                                      Any given research paper is basically just a collection of
   Every research paper falls within some specific subject are-           words. When we read the paper, we might be able to deci-
   as called domains of a larger scientific field. In this paper,         pher what domains and subdomains it caters to. But the
   we present a technique for effectively mining scientific re-           ability to comprehend these topics could be based on our
   search papers for key domain areas. We combine techniques
                                                                          prior knowledge of what constitutes domain or subdomain
   from natural language processing and machine learning to
   create a unique method for extracting such domains. Using              areas. We will certainly be in error if we presumptuously
   preposition disambiguation helps us infer the meaning of               assume that any and every reader will be pre-equipped
   words or phrases based on their placement within text.                 with the correct understanding of whether a word or a
   Combining this knowledge with supervised learning such as              phrase is a domain, problem-area or technique.
   using a Naïve Bayes classifier helps us to classify phrases as
                                                                             In this work-in-progress paper, we propose an efficient
   domain areas within a scientific field. Thus in essence, our
   technique derives meaning from text and contributes effec-             technique for extracting domains from research papers.
   tively to the field of text analytics.                                 Our technique uses preposition disambiguation to provide
                                                                          insight into the meaning of text. We validate this meaning
                           Introduction                                   by using a supervised learning method. Promising domain
                                                                          extracting techniques can then be easily extended to dis-
Text analytics is the process of analyzing unstructured text
                                                                          cover trending domains in a scientific field.
data with a goal of deriving meaningful information. We
narrow our focus into a specific type of information that
we may seek from text and pay particular attention to data                                     Related Work
found in the research sphere in the form of scientific re-                A bootstrapping learning technique has been proposed by
search papers.                                                            (Gupta and Manning 2011) to extract items such as domain
   A domain refers to a particular branch of scientific                   areas, focus of research and techniques from research pa-
knowledge or scientific field. For example, the scientific                pers. Although the work provides key insights, their results
field of Computer Science has several domains such as                     are not that encouraging as they themselves claim that their
data mining, networking, operating systems, etc. The data                 system failed to correctly address patterns which it found
mining domain in turn has the subdomains such as pattern                  to be outside these three pre-defined categories. Analysis
recognition, machine learning, statistics, etc.                           of the results indicates that their technique for domain ex-
   The problem-area addressed in a paper is the focus of                  traction has high recall but suffers from low precision. Our
research described in that paper. Each research paper or a                proposed approach obtains good results of high precision
journal is written to demonstrate the work done by the au-                and high recall for correctly labelling domains.
thors to solve a particular problem, or to achieve a goal.                   Supervised learning for text classification has been
   For solving a problem, the researchers may apply known                 widely used in applications of Natural Language Pro-
techniques, or may even devise their own techniques.                      cessing (NLP). Hidden Markov Models (HMMs) are wide-
                                                                          ly used statistical tools for modeling generative sequences.
Copyright held by the author(s).                                          HMM has been used for sentence classification (Rong et
al. 2006), where the preferred sequential ordering of sen-     user’s need
tences in the abstracts of “Randomized Clinical Trial” pa-     Sentence
pers, facilitated its use. The sentences in the abstract are   A sequence of words that is complete in itself, containing a
supposed to be ordered in sequence of “background,” “ob-       subject and predicate, conveying a statement, or question,
jective,” “method,” “result” and “conclusion” and model        etc. and consisting of a main clause and, optionally, one or
states are aligned to these sentence types. Our approach       more subordinate clauses
does not depend on a generative process as the “domain”,
                                                               Clause
“problem-area” and technique can occur in any random
                                                               A unit of grammatical organization said to consist of a
order in a title. Hence our proposed approach targets more
                                                               subject and predicate
generic solutions.
   In our previous work (Lakhanpal, Gupta and Agrawal          Phrase
2014), we extracted the prevalent trends of research using a   A small group of words standing together as a conceptual
phrase-based approach. We take our work much further by        unit, typically forming a component of a clause
incorporating intelligent machine learning techniques to       m-gram:
extract meaningful domain areas from research papers.          A contiguous sequence of m words
                                                               Preposition:
                    Our Approach                               A word governing, and usually preceding, a noun or pro-
We describe a technique to extract meaning from the titles,    noun and expressing a relation to another word or element
keywords and abstracts of a collection of research papers.     in the clause
We extract this inherent meaning, which has been con-          Preposition with Intention Sense:
veyed by the respective authors themselves by making use       The preposition that indicates that the phrase following it
of the results of an NLP technique of preposition disam-       specifies the purpose (i.e., a result that is desired, intention
biguation. Thereafter our unique methodology succeeds in       or reason for existence) of an event or action
achieving good results using machine learning techniques.      Phrase of Interest (Interesting Phrase):
We effectively derive meaning from text without explicitly     A phrase that follows a preposition with intention sense
using the constructs of NLP.                                   and ends before the next preposition in the sentence or
   As described above, a problem-area is a current focus of    ends with the end of the sentence
any research, while the domain is the larger subject area
                                                               Derivative:
into which that and other related research work fall.
                                                               Keyword or keyword phrase which has one or more words
   But the distinction between a domain and a problem-
                                                               in common with the interesting phrases
area is not always well-defined. Sometimes a problem-area
that was initially a focus of small amount of research, over   Domain Word:
time, gains a lot of attention. Researchers begin to zoom in   A word that is or has a potential for naming a well-
on the minutiae and start generating new problem-areas.        accepted domain area, or is a part of a phrase denoting a
Thus what started as a problem-area has now become a           well-accepted domain area
domain in its own right.
   For the scope of this paper, however, we make no dis-       Using Preposition Sense Disambiguation
tinction between a domain and problem-area as our goal is      Semantics is a branch of linguistics that deals with the
to segregate the words / phrases depicting these two from      meaning of words and phrases in a particular context. For
the words depicting techniques or methods.                     the computer to understand language as humans do, one of
   Although our approach is extendible to any scientific       the steps is to elicit this semantic content. And towards
field, we conduct our preliminary experiments in the field     achieving this purpose, we need to understand how and in
of Computer Science.                                           what context, the prepositions are used.
                                                                  Various prepositions convey various meanings based on
Definitions                                                    the context they are used in. It is the placement and context
Word                                                           of prepositions that can provide valuable information to-
A single and distinct element of language which has a          wards the meaning of text. The “sense” (Boonthum, Toida,
meaning and is used with other words to form a sentence,       and Levinstein 2006) or the “relation” (Srikumar and Roth
clause or phrase                                               2013) communicated by the presence of various preposi-
                                                               tions within different group of words has been investigat-
Stopword
                                                               ed. We wish to draw attention to the “intention” sense. For
Word in the language, such as “and”, “the”, which is very      example the intention sense is conveyed by the preposition
common, but of little value in selecting text meeting a        “for” in the phrase “mining for information”. (Boonthum,
Toida, and Levinston 2006) refer to the “complement” of          above repository, we label it as a “Domain.” The non-
the preposition as conveying the “intention” or “purpose”.       appearance of any word from the domain list in a deriva-
In English the complement of a preposition refers to a noun      tive makes it a “Not Domain.”
phrase, pronoun, a verb, or adverb phrase following the             Next we delineate the features of the derivatives that
preposition. Technical paper titles generally focus on con-      help determine their likelihood of being the domains.
veying the gist of the paper, which will be achieved more           The various sessions of a conference group together the
likely by using technical terminology with less stress on        papers that deal with similar goals or topics. The session
nuances of English language such as adverbs or pronouns.         name or identifier captures each such topic for each group
Hence, for simplicity we pick the complement delimited at        in a synoptic form, logically making it a representative of
the other end by the next preposition or end of the title and    the domain of its group. While examining each derivative,
define it as an “interesting phrase”.                            if it has any word in common with a session identifier, we
   We hypothesize that the interesting phrases reflect the       record its feature as “Found in Session: True” and if not,
“purpose” or the goal of their respective papers as is vali-     we record its feature as “Found in Session: False”.
dated by their very definition and hence in most cases hint         Each derivative is a phrase of one or more words. The
upon the larger domains. This hypothesis is supported by         potential of any word of the derivative to be a domain
the important observation that the authors would probably        word can be heightened by its frequent occurrence across
want to highlight the goal of their research in their titles     different abstracts. We use abstracts because they are writ-
(Hertzmann 2010).                                                ten so as to contain an intelligent gist of a paper (Koopman
   We would like to emphasize that we want to retrieve the       1997), and hence are likely sections to look for domains.
generic part of the interesting phrase. Hence we fetch its       Different abstracts containing the same word can validate
part that is common with the keyword section of that pa-         the importance of a word, hence count of abstracts be-
per. The keyword section of a research paper is a section        comes a relevant feature. The count of abstracts containing
where the authors will enumerate the key phrases or key          at least one of the words in the derivative phrase is calcu-
words of their documents (Sherman 1996). Since titles tend       lated for each derivative.
to be unique, their constituents may not by themselves be
good representatives of general domain areas. The key-           Training the classifier
words on the other hand are commonly and widely used,            We extract the feature sets for the derivative data, and di-
well accepted set of general terms that authors use to label     vide them into a training set and a test set in the ratio of
their work. Hence they serve as generic terms that authors       70%-30% respectively. The training set is used to train a
might use to mention their domains, problem-areas and            new "naive Bayes" classifier.
techniques.
   Grammatically, the title of a paper could be a sentence,      Our Technique Exemplified
clause or phrase. We scan each title to find the prepositions
with intention sense. Next, we extract the interesting           We describe our process through an example. Figures 1 (a)
phrases that follow a preposition that conveys the intention     and (b) depict Use Case Diagrams showing the Steps in-
sense. The next step involves finding an intersection be-        volved in Extracting Derivatives. We use a title of a paper
tween the interesting phrases of each paper and its key-         from the ACM SIGKDD 2012 conference.
word section. In this step, we retain those keyword or key-
word phrases which have one or more words in common
with the interesting phrases. This resultant set or the deriv-
ative becomes the main element of our analysis.                                                Title

Supervised Classification
                                                                         A system for extracting top-K lists from the web
We classify each derivative as a “Domain” or “Not Do-
main”.
   We create a repository of domain areas in Computer                                    Interesting Phrase
Science from research and analysis of hot and trending
topics across various scientific conferences and journals.                          extracting top-K lists
This repository consists of a list of unigrams (1-grams).
These unigrams either as stand-alone or together with other
such members of this list signify well accepted domain                             Interesting phrase stemmed
areas and serve as domain words.                                                         extract top k list
   In analyzing each derivative, if it has any word from the


                                                                  Figure 1(a): Use Case Diagram showing the Steps involved in
                                                                                     Extracting Derivatives
                                                                   Although the final dataset of 272 is small, our results are
                                                                very encouraging. For 100 iterations, we get an average
                          Keywords
                                                                accuracy of 86.72 % for the classifier. Our point of conten-
                                                                tion was never the size of the dataset, rather the intelli-
        ["web information extraction", "top-k lists",
                                                                gence we derive from it, based on our technique. Our tech-
        "list extraction", "web mining"]                        nique has high precision and high recall as is demonstrated
                                                                by the values of precision = 0.90 and recall = 0.91from one
                                                                such iteration.
                       Keywords Stemmed

           ["web inform extract", "top k list",                             Conclusions and Future Work
           "list extract", “web mine”]                          We have obtained encouraging results from our technique,
                                                                even though the experiments are limited to Computer Sci-
                                                                ence papers following a fixed format. Using preposition
                                                                disambiguation has helped us in extracting keywords (de-
                                                                rivatives) that depict domains.
                                                                   As future work, we wish to test our technique on a much
                          Derivatives                           diverse dataset and evaluate technical robustness when the
                                                                papers do not have a fixed format. We further wish to ex-
        ["web inform extract", "list extract", "top k           tend this fusion of NLP with supervised classification and
        list"]                                                  develop methods for extracting techniques from scientific
                                                                papers. The keywords which were not recognized as deriv-
                                                                atives need to be evaluated as potential words denoting
                                                                techniques.
  Figure 1(b): Use Case Diagram showing the Steps involved in
                      Extracting Derivatives                                              References
                                                                Boonthum, C., Toida, S., and Levinstein, I. (2006) Preposition
                                                                  Senses: Generalized Disambiguation Model. In Proceedings of
                           Results                                the Seventh International Conference on Computational
We have programmed our technique in python and also               Linguistics and Intelligent Text Processing (CICLing), Lecture
have employed some readymade data mining packages.                Notes in Computer Science, Berlin: Springer, pp. 196-207.
Careful study of the preposition senses narrowed down by        Gupta, S., and Manning, C. D. 2011. Analyzing the Dynamics of
(Boonthum, Toida, and Levinston 2006) has allowed us to           Research by Extracting Key Aspects of Scientific Papers. In
                                                                  Proceedings of the International Joint Conference on Natural
create our set of prepositions with intention sense namely        Language Processing (IJCNLP).pp 1 – 9.
[“for”, “to”, “towards”, “toward”].
                                                                Hertzmann,       A.    2010.        Writing     Research     Papers.
   In order to find well-accepted domain areas, we have           http://www.dgp.toronto.edu/~hertzman/courses/gradSkills/201
collected the topics from the Calls for Papers sections from      0/writing.pdf.
the IEEE International Conference on Data Mining series         Koopman, P. (CMU) 1997. How to Write an Abstract.
(ICDM), the IEEE International Conference on Data Engi-           http://users.ece.cmu.edu/~koopman/essays/abstract.html.
neering (ICDE), and the ACM SIGKDD International Con-           Lakhanpal, S., Gupta, A., and Agrawal, R. 2014. On Discovering
ference on Knowledge Discovery and Data Mining (KDD)              Most Frequent Research Trends in a Scientific Discipline using
from 2010-2014. Call for papers for any conference con-           a Text Mining Technique. In Proceedings of the 52nd Annual
                                                                  ACM Southeast Conference, Kennesaw, GA: ACM, pp. 52:1-
tain topics under which papers are sought. Hence they are
                                                                  52:4.
one of the definitive sources of domains well-accepted by
                                                                Rong, X., Supekar, K., Huang, Y., Das, A., and Garber, A. 2006.
experts in the scientific field.                                  Combining Text Classification and Hidden Markov Modeling
   In a set of experiments, we collected data from the ACM        Techniques for Structuring Randomized Clinical Trial
SIGKDD International Conference on Knowledge Discov-              Abstracts. In Proceedings of the American Medical Informatics
ery and Data Mining (KDD) from years 2010-2014. This              Association (AMIA) Annual Symposium, pp. 824 - 828.
data includes 939 papers from all sessions including key-       Sherman, A 1996. Some Advice on Writing a Technical Report.
note, panel, demonstration, poster, industrial and govern-        http://www.csee.umbc.edu/~sherman/Courses/documents/TR_
                                                                  how_to.html.
ment track apart from the regular research track sessions.
Out of the 939 paper titles, 367 have prepositions with in-     Srikumar, V., and Roth, D. 2013. Modeling Semantic Relations
                                                                   Expressed by Prepositions. 	
  Transactions of the Association for
tention sense. From the 367, we get 272 non empty deriva-          Computational Linguistics, 1: 231-242.
tives.

</pre>