=Paper=
{{Paper
|id=Vol-1353/paper_32
|storemode=property
|title=Towards Extracting Domains from Research Publications
|pdfUrl=https://ceur-ws.org/Vol-1353/paper_32.pdf
|volume=Vol-1353
|dblpUrl=https://dblp.org/rec/conf/maics/LakhanpalGA15
}}
==Towards Extracting Domains from Research Publications==
Towards Extracting Domains from Research Publications Shilpa Lakhanpal and Ajay Gupta Department of Computer Science Western Michigan University, Kalamazoo, MI 49008 shilpa.lakhanpal@wmich.edu; ajay.gupta@wmich.edu Rajeev Agrawal Department of Computer Systems Technology North Carolina A&T State University, Greensboro, NC 27411 ragrawal@ncat.edu Abstract Any given research paper is basically just a collection of Every research paper falls within some specific subject are- words. When we read the paper, we might be able to deci- as called domains of a larger scientific field. In this paper, pher what domains and subdomains it caters to. But the we present a technique for effectively mining scientific re- ability to comprehend these topics could be based on our search papers for key domain areas. We combine techniques prior knowledge of what constitutes domain or subdomain from natural language processing and machine learning to create a unique method for extracting such domains. Using areas. We will certainly be in error if we presumptuously preposition disambiguation helps us infer the meaning of assume that any and every reader will be pre-equipped words or phrases based on their placement within text. with the correct understanding of whether a word or a Combining this knowledge with supervised learning such as phrase is a domain, problem-area or technique. using a Naïve Bayes classifier helps us to classify phrases as In this work-in-progress paper, we propose an efficient domain areas within a scientific field. Thus in essence, our technique derives meaning from text and contributes effec- technique for extracting domains from research papers. tively to the field of text analytics. Our technique uses preposition disambiguation to provide insight into the meaning of text. We validate this meaning Introduction by using a supervised learning method. Promising domain extracting techniques can then be easily extended to dis- Text analytics is the process of analyzing unstructured text cover trending domains in a scientific field. data with a goal of deriving meaningful information. We narrow our focus into a specific type of information that we may seek from text and pay particular attention to data Related Work found in the research sphere in the form of scientific re- A bootstrapping learning technique has been proposed by search papers. (Gupta and Manning 2011) to extract items such as domain A domain refers to a particular branch of scientific areas, focus of research and techniques from research pa- knowledge or scientific field. For example, the scientific pers. Although the work provides key insights, their results field of Computer Science has several domains such as are not that encouraging as they themselves claim that their data mining, networking, operating systems, etc. The data system failed to correctly address patterns which it found mining domain in turn has the subdomains such as pattern to be outside these three pre-defined categories. Analysis recognition, machine learning, statistics, etc. of the results indicates that their technique for domain ex- The problem-area addressed in a paper is the focus of traction has high recall but suffers from low precision. Our research described in that paper. Each research paper or a proposed approach obtains good results of high precision journal is written to demonstrate the work done by the au- and high recall for correctly labelling domains. thors to solve a particular problem, or to achieve a goal. Supervised learning for text classification has been For solving a problem, the researchers may apply known widely used in applications of Natural Language Pro- techniques, or may even devise their own techniques. cessing (NLP). Hidden Markov Models (HMMs) are wide- ly used statistical tools for modeling generative sequences. Copyright held by the author(s). HMM has been used for sentence classification (Rong et al. 2006), where the preferred sequential ordering of sen- user’s need tences in the abstracts of “Randomized Clinical Trial” pa- Sentence pers, facilitated its use. The sentences in the abstract are A sequence of words that is complete in itself, containing a supposed to be ordered in sequence of “background,” “ob- subject and predicate, conveying a statement, or question, jective,” “method,” “result” and “conclusion” and model etc. and consisting of a main clause and, optionally, one or states are aligned to these sentence types. Our approach more subordinate clauses does not depend on a generative process as the “domain”, Clause “problem-area” and technique can occur in any random A unit of grammatical organization said to consist of a order in a title. Hence our proposed approach targets more subject and predicate generic solutions. In our previous work (Lakhanpal, Gupta and Agrawal Phrase 2014), we extracted the prevalent trends of research using a A small group of words standing together as a conceptual phrase-based approach. We take our work much further by unit, typically forming a component of a clause incorporating intelligent machine learning techniques to m-gram: extract meaningful domain areas from research papers. A contiguous sequence of m words Preposition: Our Approach A word governing, and usually preceding, a noun or pro- We describe a technique to extract meaning from the titles, noun and expressing a relation to another word or element keywords and abstracts of a collection of research papers. in the clause We extract this inherent meaning, which has been con- Preposition with Intention Sense: veyed by the respective authors themselves by making use The preposition that indicates that the phrase following it of the results of an NLP technique of preposition disam- specifies the purpose (i.e., a result that is desired, intention biguation. Thereafter our unique methodology succeeds in or reason for existence) of an event or action achieving good results using machine learning techniques. Phrase of Interest (Interesting Phrase): We effectively derive meaning from text without explicitly A phrase that follows a preposition with intention sense using the constructs of NLP. and ends before the next preposition in the sentence or As described above, a problem-area is a current focus of ends with the end of the sentence any research, while the domain is the larger subject area Derivative: into which that and other related research work fall. Keyword or keyword phrase which has one or more words But the distinction between a domain and a problem- in common with the interesting phrases area is not always well-defined. Sometimes a problem-area that was initially a focus of small amount of research, over Domain Word: time, gains a lot of attention. Researchers begin to zoom in A word that is or has a potential for naming a well- on the minutiae and start generating new problem-areas. accepted domain area, or is a part of a phrase denoting a Thus what started as a problem-area has now become a well-accepted domain area domain in its own right. For the scope of this paper, however, we make no dis- Using Preposition Sense Disambiguation tinction between a domain and problem-area as our goal is Semantics is a branch of linguistics that deals with the to segregate the words / phrases depicting these two from meaning of words and phrases in a particular context. For the words depicting techniques or methods. the computer to understand language as humans do, one of Although our approach is extendible to any scientific the steps is to elicit this semantic content. And towards field, we conduct our preliminary experiments in the field achieving this purpose, we need to understand how and in of Computer Science. what context, the prepositions are used. Various prepositions convey various meanings based on Definitions the context they are used in. It is the placement and context Word of prepositions that can provide valuable information to- A single and distinct element of language which has a wards the meaning of text. The “sense” (Boonthum, Toida, meaning and is used with other words to form a sentence, and Levinstein 2006) or the “relation” (Srikumar and Roth clause or phrase 2013) communicated by the presence of various preposi- tions within different group of words has been investigat- Stopword ed. We wish to draw attention to the “intention” sense. For Word in the language, such as “and”, “the”, which is very example the intention sense is conveyed by the preposition common, but of little value in selecting text meeting a “for” in the phrase “mining for information”. (Boonthum, Toida, and Levinston 2006) refer to the “complement” of above repository, we label it as a “Domain.” The non- the preposition as conveying the “intention” or “purpose”. appearance of any word from the domain list in a deriva- In English the complement of a preposition refers to a noun tive makes it a “Not Domain.” phrase, pronoun, a verb, or adverb phrase following the Next we delineate the features of the derivatives that preposition. Technical paper titles generally focus on con- help determine their likelihood of being the domains. veying the gist of the paper, which will be achieved more The various sessions of a conference group together the likely by using technical terminology with less stress on papers that deal with similar goals or topics. The session nuances of English language such as adverbs or pronouns. name or identifier captures each such topic for each group Hence, for simplicity we pick the complement delimited at in a synoptic form, logically making it a representative of the other end by the next preposition or end of the title and the domain of its group. While examining each derivative, define it as an “interesting phrase”. if it has any word in common with a session identifier, we We hypothesize that the interesting phrases reflect the record its feature as “Found in Session: True” and if not, “purpose” or the goal of their respective papers as is vali- we record its feature as “Found in Session: False”. dated by their very definition and hence in most cases hint Each derivative is a phrase of one or more words. The upon the larger domains. This hypothesis is supported by potential of any word of the derivative to be a domain the important observation that the authors would probably word can be heightened by its frequent occurrence across want to highlight the goal of their research in their titles different abstracts. We use abstracts because they are writ- (Hertzmann 2010). ten so as to contain an intelligent gist of a paper (Koopman We would like to emphasize that we want to retrieve the 1997), and hence are likely sections to look for domains. generic part of the interesting phrase. Hence we fetch its Different abstracts containing the same word can validate part that is common with the keyword section of that pa- the importance of a word, hence count of abstracts be- per. The keyword section of a research paper is a section comes a relevant feature. The count of abstracts containing where the authors will enumerate the key phrases or key at least one of the words in the derivative phrase is calcu- words of their documents (Sherman 1996). Since titles tend lated for each derivative. to be unique, their constituents may not by themselves be good representatives of general domain areas. The key- Training the classifier words on the other hand are commonly and widely used, We extract the feature sets for the derivative data, and di- well accepted set of general terms that authors use to label vide them into a training set and a test set in the ratio of their work. Hence they serve as generic terms that authors 70%-30% respectively. The training set is used to train a might use to mention their domains, problem-areas and new "naive Bayes" classifier. techniques. Grammatically, the title of a paper could be a sentence, Our Technique Exemplified clause or phrase. We scan each title to find the prepositions with intention sense. Next, we extract the interesting We describe our process through an example. Figures 1 (a) phrases that follow a preposition that conveys the intention and (b) depict Use Case Diagrams showing the Steps in- sense. The next step involves finding an intersection be- volved in Extracting Derivatives. We use a title of a paper tween the interesting phrases of each paper and its key- from the ACM SIGKDD 2012 conference. word section. In this step, we retain those keyword or key- word phrases which have one or more words in common with the interesting phrases. This resultant set or the deriv- ative becomes the main element of our analysis. Title Supervised Classification A system for extracting top-K lists from the web We classify each derivative as a “Domain” or “Not Do- main”. We create a repository of domain areas in Computer Interesting Phrase Science from research and analysis of hot and trending topics across various scientific conferences and journals. extracting top-K lists This repository consists of a list of unigrams (1-grams). These unigrams either as stand-alone or together with other such members of this list signify well accepted domain Interesting phrase stemmed areas and serve as domain words. extract top k list In analyzing each derivative, if it has any word from the Figure 1(a): Use Case Diagram showing the Steps involved in Extracting Derivatives Although the final dataset of 272 is small, our results are very encouraging. For 100 iterations, we get an average Keywords accuracy of 86.72 % for the classifier. Our point of conten- tion was never the size of the dataset, rather the intelli- ["web information extraction", "top-k lists", gence we derive from it, based on our technique. Our tech- "list extraction", "web mining"] nique has high precision and high recall as is demonstrated by the values of precision = 0.90 and recall = 0.91from one such iteration. Keywords Stemmed ["web inform extract", "top k list", Conclusions and Future Work "list extract", “web mine”] We have obtained encouraging results from our technique, even though the experiments are limited to Computer Sci- ence papers following a fixed format. Using preposition disambiguation has helped us in extracting keywords (de- rivatives) that depict domains. As future work, we wish to test our technique on a much Derivatives diverse dataset and evaluate technical robustness when the papers do not have a fixed format. We further wish to ex- ["web inform extract", "list extract", "top k tend this fusion of NLP with supervised classification and list"] develop methods for extracting techniques from scientific papers. The keywords which were not recognized as deriv- atives need to be evaluated as potential words denoting techniques. Figure 1(b): Use Case Diagram showing the Steps involved in Extracting Derivatives References Boonthum, C., Toida, S., and Levinstein, I. (2006) Preposition Senses: Generalized Disambiguation Model. In Proceedings of Results the Seventh International Conference on Computational We have programmed our technique in python and also Linguistics and Intelligent Text Processing (CICLing), Lecture have employed some readymade data mining packages. Notes in Computer Science, Berlin: Springer, pp. 196-207. Careful study of the preposition senses narrowed down by Gupta, S., and Manning, C. D. 2011. Analyzing the Dynamics of (Boonthum, Toida, and Levinston 2006) has allowed us to Research by Extracting Key Aspects of Scientific Papers. In Proceedings of the International Joint Conference on Natural create our set of prepositions with intention sense namely Language Processing (IJCNLP).pp 1 – 9. [“for”, “to”, “towards”, “toward”]. Hertzmann, A. 2010. Writing Research Papers. In order to find well-accepted domain areas, we have http://www.dgp.toronto.edu/~hertzman/courses/gradSkills/201 collected the topics from the Calls for Papers sections from 0/writing.pdf. the IEEE International Conference on Data Mining series Koopman, P. (CMU) 1997. How to Write an Abstract. (ICDM), the IEEE International Conference on Data Engi- http://users.ece.cmu.edu/~koopman/essays/abstract.html. neering (ICDE), and the ACM SIGKDD International Con- Lakhanpal, S., Gupta, A., and Agrawal, R. 2014. On Discovering ference on Knowledge Discovery and Data Mining (KDD) Most Frequent Research Trends in a Scientific Discipline using from 2010-2014. Call for papers for any conference con- a Text Mining Technique. In Proceedings of the 52nd Annual ACM Southeast Conference, Kennesaw, GA: ACM, pp. 52:1- tain topics under which papers are sought. Hence they are 52:4. one of the definitive sources of domains well-accepted by Rong, X., Supekar, K., Huang, Y., Das, A., and Garber, A. 2006. experts in the scientific field. Combining Text Classification and Hidden Markov Modeling In a set of experiments, we collected data from the ACM Techniques for Structuring Randomized Clinical Trial SIGKDD International Conference on Knowledge Discov- Abstracts. In Proceedings of the American Medical Informatics ery and Data Mining (KDD) from years 2010-2014. This Association (AMIA) Annual Symposium, pp. 824 - 828. data includes 939 papers from all sessions including key- Sherman, A 1996. Some Advice on Writing a Technical Report. note, panel, demonstration, poster, industrial and govern- http://www.csee.umbc.edu/~sherman/Courses/documents/TR_ how_to.html. ment track apart from the regular research track sessions. Out of the 939 paper titles, 367 have prepositions with in- Srikumar, V., and Roth, D. 2013. Modeling Semantic Relations Expressed by Prepositions. Transactions of the Association for tention sense. From the 367, we get 272 non empty deriva- Computational Linguistics, 1: 231-242. tives.