A Comparative Review of Text Mining & Related Technologies Roland Vasili Endri Xhina Ilia Ninka Dept. of Mathematics, Dept. of Informatics Dept. of Informatics Informatics & Physics Faculty of Natural Sciences Faculty of Natural Sciences Faculty of Natural Sciences University of Tirana University of Tirana University of Gjirokastra 1001 Tirana, Albania 1001 Tirana, Albania 6001 Gjirokastra, Albania endri.xhina@fshn.edu.al ilia.ninka@fshn.edu.al rvasili@uogj.edu.al Thomas Souliotis Dept. of Informatics University of Edinburgh s1778881@sms.ed.ac.uk 1. Introduction Abstract The Text Mining field covers a wide research area and its methods can be applied in different contexts and for Text mining has become an established several purposes, depending on the needs of the discipline in both research and business specific task and the availability of data and expertise. intelligence. It refers commonly to the method To this aim, rather than being an exhaustive list of of extracting interesting information and techniques and research directions, the Figure 1 shows knowledge from unstructured text. Society's that the text mining is a composite discipline that future will be closely connected to handling overlaps several branches of science. In the Figure 1 of large amount of data. Information may be [Tal16] Knowledge Data Discovery field from Fig. 4.1 available in various ways, either freely on the in [Dea14] is added, that will help us understand that Web or on social networks. Text mining is a multi-disciplinary field in view of Data Mining, Computational Linguistics, Artificial Intelligence and Machine Learning, Statistics, Statistics Databases, Library and Information Sciences, AI & Machine and actually the new field of Big Data. Some Learning Data Mining of these disciplines will be compared based * Document Classification * Information Extraction on the goals, data, algorithms, techniques and the tools they use, as well as the their outcome. * Document Clustering * Natural All these subjects are similar, which is based Language Processing on two fundamental facts: (1) all of them Text Mining * Concept develop methods and procedures to process * Information * Web Extraction Retrieval data, and (2) any data processing algorithm or Databases Mining procedure may belong to some or even all. The differences are in their perspectives. This difference in perspectives does not affect the Computational procedures but it does affect the choice of Library & KD Linguistics them and, even more so, interpretation of Information D Science concepts and results. Figure 1: Multidisciplinary Nature of Text Mining (Composition of Fig.1 in [Tal16] and Fig. 4.1 in [Dea14]) the goal of all of these disciplines is knowledge  Document Preprocessing discovery, and in this base they will be compared.  Text Mining Operations So, the goal of this paper is to outline the Text The steps of TM process regarding the output of results Mining landscape which in contrast to encompassed are shown in the Figure 2. technologies like Data Mining, Natural Language A TM system receives a collection of documents as Processing, Information Retrieval, Information input and then pre-processes each document by Extraction, Artificial Intelligence and Machine checking its format and set of characters. Next, these Learning., it tries to depict the scale and potential pre-processed documents go through the text analysis scientific interaction with classic scientific areas, such phase, by repeating the techniques until the required as Statistics. information is extracted. Figure 3 shows three techniques of text analysis, but other techniques, 1.1 Definition of Text Mining Text mining (TM), also called Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery Retrieve & Analyze Text Information in Text (KDT), is mainly used to define the procedure Documents Collection preprocess Information Extraction Management Knowledge documents of extracting interesting and non-trivial data and Clustering Summarization System / Wisdom knowledge from unstructured text ([Gök15]). There are many more definitions of text mining like the definition of the Oxford English Dictionary: "as the process or Figure 3: Text Mining Steps ([Lia12]) practice of examining large collections of written resources in order to generate new information, depending on the goal and the corporation, may also be typically using specialized computer software". It used. Information derived from the extraction can be widely covers a large set of related topics and accessed by an information management system, algorithms for analyzing text, spanning various producing valuable knowledge for the user of this communities, including information retrieval, natural system. Figure 4 analyzes the processing steps that a language processing, data mining, machine learning, typical TM System follows. many application domains web and biomedical Information Retrieval Information Extraction sciences. Retrieve & Preprocess Documents 1.2 Text Mining Process TM Techniques Information Management The text mining process (TM) can basically be Documents Feature Selection System Collection summarized in three (3) steps below ([Kar05]): Feature Selection Feature (Knowledge)  Document Collection Feature Generation Generation Text Summarization Topic Discovery Figure 4: Text Mining System ([Lia12]) Interpretation / Evaluation Text Mining Techniques / Pattern Discovery 1.2.1 Document-Text Collection Attribute Selection Text Transformation The basic element of TM is the collection of documents (Attribute Generation) of any text form. The number of texts in such Text Preprocessing collections may range from thousands to several Text millions. Text collection can be static or dynamic. At the static approach, the original textbook total remains Figure 2: Text Mining Steps unchanged while at the dynamic, the textbook over time (According to the process of Knowledge Discovery) is classified into new or gets updated. Extremely large collections and high-rate changing text collections are  Filtering (Removing terms based on their considered challenges and constitute the main object of frequency). Text Mining Systems. A peculiar example of a large  Part Of Speech Tagging (Syntactical and dynamic collection of texts, used by millions around Semantical Analysis) the world, is Pub Med (US National Library of  Stemming. Medicine 2018)1. It is an internet resource, which  N- grams. includes literature references related to biomedical and  Term weighting. health sciences. It is worth pointing out that it includes over 25 million research reports in the biomedical field Files in which they are added, roughly 35,000 with 40,000 (Html, Pdf) Text + new items each month. In addition to that, unstructured Structure Structure Identification Text Tokenization data and free text are usually most of the data we Structure Removal encounter and this includes over 40 million articles in Tokens Wikipedia, 4.5 billion Web pages, about 500 million Total Names tweets a day, and over 1.5 trillion queries on Google in Stemming POS Tagging Stopwords Removal a year. Therefore, to initiate the TM process, the user has to choose the desired collection of texts on which the Figure 5: Text Preprocessing Steps procedure will be based on, and the variety of texts that will constitute the source of the data. The following process involves the TM System, 1.2.3 Text Representation which has the ability (with the help of knowledge- discovery algorithms) to quickly and efficiently identify In order to apply TM techniques, the texts should be the patterns among a large number of natural texts. presented in a formatted form. We could say that the But the realization of this requires the existence of most familiar method of text representation is vector elaborate text collections. For this reason, the most model. There are two main ways used for vector text important TM process is the pre-processing phase of representation: the texts under examination, and then, the successful  Boolean Model. implementation of the knowledge-discovery algorithms.  Term-Weight Model. 1.2.2 Text Preprocessing 2 Text Mining and Data Mining Though this is considered to be the preliminary step to Data Mining (DM) is a subfield of computer science be conducted, before actually applying Text Mining which combines many techniques from statistics, data algorithms/methods, it is a very important process. This science, database theory and machine learning. routine itself is divided into a number of sub-methods DM is simply the process of gathering information which again have optional algorithms with their own set from huge databases that was previously of advantages and disadvantages. incomprehensible and unknown and then using that Most of the TM approaches are based on the idea information to make relevant business decisions. More that a text document can be described by the set of simply, data mining is a set of various methods that are words contained in it i.e. bag-of-words representation. used in the process of knowledge discovery for The preprocessing itself is made up of a sequence of distinguishing the relationships and patterns that were steps ([Gup09]) (Figure 5). The steps are: previously unknown. The final goal is the description  Text Structure Removal. of existing database data as well as forecasting and  Tokenization. clarification of new data. We can therefore define data  Stopwords Removal. mining as a combination of various other fields like  Filtering (Removing terms based on their artificial intelligence, data room virtual base length). management, pattern recognition, visualization of data, machine learning, statistical studies and so on. The primary goal of data mining is to extract information 1 https://www.ncbi.nlm.nih.gov/pubmed/: Accessed 1-4-2018 from various sets of data in an attempt to transform it in proper and meaningful structures for eventual use. It But, [Fra92] and [Raj97] concluded that the difference mainly includes procedures and tools of extracting between these two domains is the type of data they use patterns from the data set and relates exclusively to for Knowledge Discovery (KD). Thus, while DM uses structured data. But in recent years, interest has also data extraction techniques over structured data, TM shifted to unstructured data (e.g. texts, images, does the same thing but for unstructured or semi- paperwork, web pages, etc.) with the result of structured data, which is often referred to as textual knowledge discovery from text (Text Mining). This data ([Gup09]). shift is very important since most of the data nowadays Knowledge discovery from data is implemented in are in unstructured textual form ([Gri08]). For example, the databases where the data is structured and described a text file contains few structured elements such as by a unique structure where each instance of a problem author, title, date of creation etc. But it also contains is determined by a specific and fixed set of features large segments of unstructured text such as its summary ([Kan09]). and its contents. This requires both sophisticated Instead, in the case of knowledge discovery from linguistic and statistical techniques able to analyze text, the data is semi-structured or unstructured and unstructured text formats and techniques that combine cannot be described by any set of fixed features each document with actionable metadata. ([Liu11]). For this reason, the method tries to bring the TM is an intense cognitive process through which text in the appropriate form for the direct application of the user interacts with a collection of texts using a set its computing applications. of analysis tools ([Seh04]). Similarly, as well as DM, In the case of knowledge discovery from texts, there TM aims at extracting useful information from data are two approaches regarding the representation of the sources through recognition (identification) and text. In the first approach, the presence of a feature examination of interesting patterns. Meanwhile, in the (word) in a text is taken into consideration. Thus, when case of TM data sources are text collections, interesting a new instance of the problem occurs, what is motives are searched in unstructured textual data controlled is the presence of instances of the features ([Nah02]). (words) in different classes of the problem. The class in Given the above definition, it is argued that the TM which most words are present is the desired class. has its roots in the area of Knowledge Discovery (KD). In the second approach, for each feature we hold the Moreover, this is also used for the DM definition frequency of its appearance in a text. Thus a new reference. Consequently, TM is similar to DM, mainly instance class derives from the frequency of the because in both cases, knowledge detection is based on presence of text words in different classes of the processes of data preprocessing and pattern searching problem. The class in which the most displayed and the algorithms. However, this similarity may lead to most frequent word of the text is the desired class. overseeing their differences. Thus, the goal of the In addition to the data type, [Dör99] separated these majority of the studies in those two areas, is to identify fields from the complexity of the steps that followed for and analyze these differences. knowledge discovery. The general steps followed by DM are: 3 Text Mining vs. Data Mining (1) identifying the data collection, (2) preparation and features selection and The method of Knowledge Discovery from Data or (3) distribution analysis. Data Mining, namely finding useful patterns between Even though TM does not deviate from these steps, the data, is a very good solution for collecting and storing a selection of features is different, since it is not practical huge volume of data. Though the scope of its to be responsible for the examination of the features implementation is extensive it is not a developing and decide which of them should be used. technology. The other point where they differ is in distribution Instead, the knowledge discovery of textual data or analysis, where multi-dimensional vectors are to be Text Mining is a new method in the field of Knowledge treated. This implies that there must be special versions Discovery, which is feasible because the information to and implementations of DM algorithms. However, be extracted refers to text. these differences do not prevent [Hea99] from The knowledge discovery from text resembles a lot declaring that TM is an extension of DM. It is not clear to the classical method of knowledge discovery from to what extent this statement may be true as there are no data, since both are based on knowledge management. studies that agree or disagree with it. Some basic differences between TM and DM are also presented in 4 Text Mining vs. NLP [Ber09] work, which are seen in Table 1. In Table 2 we show some additional features : Natural language processing (NLP) is a subfield of computer science (CS), artificial intelligence (AI), and Table 1 : Differences between TM and DM [Ber09] linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer Text Mining Data Mining interaction. Many challenges in NLP involve natural language understanding ([Nav18]), that is, enabling Relies on unstructured or Relies on fielded semi-structured data (structured) data computers to derive meaning from human or natural language input, and others involve natural language Term extraction takes place generation. Involves numerically based on semantic based statistical analysis TM refers to a subset of data mining concerned with algorithm discovering knowledge from various sources; Documents containing especially, unstructured texts, which are still considered Allows for temporal the greatest easily accessible source of knowledge. In overlapping concepts analysis TM, the main problem arises when trying to extract can be organized together Documents containing explicit and implicit ideas and semantic links among Clustering based on different ideas using NLP methods. The objective is to overlapping concepts can coding obtain a full understanding of vast amounts of text data. be placed together partially Many of the text mining algorithms extensively make Involves co-occurrence use of NLP techniques, such as part of speech tagging matrices and histograms (POS tagging), syntactic parsing and other types of linguistic analysis ([Kao07]). TM is greatly connected Table 2 : TM vs. DM to NLP, but it is also related to processes in statistics, machine learning, information extraction, information management etc. During its procedure of finding out C/sion hidden secrets, TM has a very important part in Base for Data Mining Text Mining upcoming applications of NLP field, like Text Understanding ([Sal18]). Data mining is a Text mining is a Text Mining deals with the text itself, while NLP spectrum of different process required to deals with the underlying/latent metadata. Concept approaches, which turn unstructured text Answering questions like - frequency counts of words, searches for patterns document into valuable and relationships of structured information. length of the sentence, presence/absence of certain data. words etc. is actually text mining. NLP on the other hand allows you to answer With standard data With standard text questions like; - What is the sentiment? - What are the Retrieval mining techniques mining methods keywords? (using POS tagging & parsers) - What of data reveals business discovers a lexical & patterns in numerical syntactic feature in the category of content it falls under? - Which are the data. text. entities in the sentence? etc. Text mining is the process of mining text in the Discovery of Discovery of text from context of data mining, when we consider as data just knowledge from unstructured data text. Mining is about extracting useful information from Type of Data structured data, which which are are homogeneous and heterogeneous, more the available data. Information could be patterns in text easy to access. diverse. or matching structures but the semantics in the text is not considered. The goal is not about making the system understand what does the text convey, rather However, [Fan06] considers Text Mining as an about providing information to the user based on a interdisciplinary field based on other disciplines, such certain step by step process. as Data Mining, Information Retrieval, Computational Natural language is what humans use for Statistics, Computer Science and Linguistics. communication. Processing such data is NLP where the data could be speech or text. Thus, the main goal is Explanation of text Understanding what understanding what is the semantic meaning conveyed using statistical conveyed through text in it. Therefore, we can understand why we care about indicators like or speech like  Frequency of words  Conveyed sentiment Outcome grammatical part of speeches and the lexical relations among them.  Patterns of words  The semantic meaning Speech recognition systems could be a part of NLP,  Correlation within of the text so that it but it has nothing to do with TM. It may seem like NLP words can be translated to is a more general, significant concept, because it uses other languages TM, however, it's actually the other way round. TM  Grammatical structure uses NLP, because it makes sense to mine the data Performance measure Highly difficult to when we understand the data semantically. is direct and relatively measure system Table 3 shows the top 5 Comparison between Text simple. Here we have accuracy for machines. Mining vs. Natural Language Processing: System Accuracy clearly measurable Human intervention is mathematical needed most of the time. Table 3 : TM vs. NLP Comparison2 concepts. Measures For example, consider can be automated an NLP system, which translates from English Comp/son to Hindi. Automate the Basis of Text mining NLP measure of how accurately system doing translation is difficult. Extract high-quality Trying to understand information from what is conveyed in To conclude2, both TM and NLP try to extract unstructured & natural language by information from unstructured data. TM is concentrated structured text. human- may text or on text documents and mostly depends on a statistical Goal Information could be speech. Semantic and and probabilistic model to derive a representation of patterned in text or grammatical structures documents. NLP tries to get semantic meaning from all matching structure but are analyzed. the semantics in the means of human natural communication like text, text is not considered. speech or even an image. NLP has potential to revolutionize the way humans interact with machines  Text processing  Advanced ML models e.g. AWS Echo and Google Home.  Deep Neural Networks Tools languages like Perl  Statistical models  Toolkits like NLTK in  ML models Python 5 Text Mining vs. Web Search  Data sources are  Data source can be any Text Mining is different from the concept referred as documented form of natural human Web Search. In addition to the differences between TM collections communication and DM, explained above, [Gup09] tries to establish  Extracting method like text, boundaries between TM and web search. representative speech, signboard etc The main part of web search is the web engine. A features for natural  Extracting semantic web engine has three main parts: (1) Crawler: Gathers Scope language documents meaning and the contents of all web pages (using a program called a  Input for a corpus- grammatical structure crawler or spider), (2) Indexer: Organizes the contents based from the input of the pages in a way that allows efficient retrieval computational  Making all level of linguistic interaction with (indexing), and (3) Ranker: Takes in a query, machines more natural determines which pages match, and shows the results for human (ranking and display of results). The difference from TM lies in the fact that internet users are searching for something that exists, which has been found and was previously written by a person, 2 https://www.educba.com/important-text-mining-vs-natural- while TM is aimed at detecting previously unknown language-processing/: Accessed 5-9-2018 information ([Gök15]). So, the problem is to separate the material that is not related to your needs and keep beyond information access to further aid users to the essentials in order to find the information you need. analyze and understand information and ease the decision making. Table 4 illustrates some of the 6 Text Mining vs. Information Retrieval differences between TM and IR: Information retrieval ([Rij79]) is used to search Table 4 : Differences between TM and IR documents or information in documents. Generally, it is a subject of information science and computer science. Comp/son Basis of Its main uses are for access to books and journals from universities and public libraries and the most notable TM IR application is as web search engines. With the great improvement of the web, a huge amount of information is available online for the daily user. A user will try to Extract high-quality Finding answers and information from information that retrieve relevant information from web search engines unstructured & already exist in a with a question or a query. Information retrieval helps structured text. system the process to return a set of documents that meets the Goal Information could be Creating answers and requirements of the user’s query. patterned in text or new information by The concept of information retrieval is really old. matching structure but analysis and inference The first time that someone mentioned in a paper the the semantics in the text – based on query ability of a computer to retrieve relevant pieces of is not considered. information was in 1945 in the article “As We May  Data sources are  Unstructured Think” by Vannevar Bush [Sin01]. Since then, many documented information (text, other techniques have been shown until the last two collections images, sound, though decades in which web search engines have boosted the  Extracting spoken, image, video, need of a large-scale information retrieval system. Scope representative email, Web, There are different mathematical models for the features for natural multimedia, ...) information retrieval. Common models are set- language documents  Structured information theoretic, algebraic and probabilistic models. Set-  Input for a corpus- ( (DBMS), Data theoretic models represent documents as sets of words based computational analysis systems, or phrases. Algebraic models convert documents and linguistic Expert systems) words in vectors, matrices and tuples. Probabilistic Explanation of text  Text retrieval deals models treat the information retrieval as a probabilistic using statistical with computerized inference. indicators like retrieval of machine- It is important to differentiate between ΤΜ and  Frequency of words readable text, Speech Information Retrieval (IR). We can say that TM  Patterns of words retrieval deals with represents subsequent evolution (transformation) of IR.  Correlation within speech, Cross-language In retrieving information, the search is conducted words retrieval uses a query Outcome only for texts that already contain the answers to in one language & questions rather than search for new knowledge finds documents in ([Hea99] & [Seh04]). In general, IR’s goal is to extract other languages , Q-A all documents that are closer to the answer of a IR systems retrieve question. Thus, it is the activity of obtaining answers from a body of information resources (usually documents) relevant to text, Image retrieval an information need from a collection of information finds images on a resources ([Fal95], [Man08]). Searches can be based theme. IR dealing with either on metadata or on full-text indexing. Therefore, any kind of other entity IR mostly focuses on facilitating information access or object rather than analyzing information and finding hidden patterns, which is the main purpose of text mining. IR The most important distinction between TM and IR does not care a lot about processing or transforming is the output of each process. In the IR process the text, whereas text mining can be considered as going result consists of documents, some of which may be clustered, ordered or scored but at the end to get the  Assessing the effectiveness of results and to information we have to read the documents. In contrast evaluate uncertainties the results of TM process can be features, patterns, The methods provided by statistics include, connections, profiles or trends, and to find the  Design for planning and conducting research information we need, we don't necessary have to read  Descriptions which implies exploring and the documents. summarizing data  Making predictions and inference using the 7 Text Mining and Statistics phenomena represented by data. So, Statistics is essentially a part of the process of TM. It is the science of learning from data. Also, it 7.1 What is Statistics? provides tools and techniques for dealing with large Statistics consists of a set of mathematical methods amounts of data. Statistics includes a number of related to the collection, organization and analyzation processes, like: of data. These techniques (and more) are used so as to  The planning behind data collections extract some useful outcomes depending on our needs,  Data management while all the potential techniques used are categorized  Drawing inferences from numerical data facts in two main categories the descriptive and the inferential. 7.3 Text Mining vs. Statistics In the descriptive statistics the initial data are used Scientific literature suffers from lack of articles on only for processing reasons and producing some useful comparisons such as TM and Statistics, even on DM conclusions based on them. However, no potential and Statistics. So, since TM is a subfield of DM, we forecasts are made based on this data and no results are will base our comparison to DM and will check if it is really inferred other than some simple outcomes only valid for TM then will present it in the comparative for the current data. These predictions are actually part Table 53 bellow. of the second big category, the inferential statistics, In practice, comparing Statistics means comparing where useful estimations are made for future events what is defined in terms of a set of tools, namely those based on the current data. being taught in graduate programs, i.e. Probability theory, Real anlysis, Measure Theory, Asymptotics, 7.2 Statistics: The Science of Learning from Data Decision theory, Markov chains, Martingales, Ergotic Statistics is another broad subject which deals with theory, etc. The field of Statistics seems to be defined the study of data, that is widely applied and plays a as the set of problems that can be successfully very important role in all areas of science. Statistics addressed with these and related topics ([Fri98]). provides the methodology for making conclusions from For this reason, our comparison will not be a data. It gives different methods to gather data, analyze thorough one, based on multiple literature resources, them and interpret results and is widely used by but a little more simplified. Yet this analysis will still scientists, researchers, and mathematicians in solving based on some scientific criteria. Our sources will be problems. multiple web sources, but mainly three scientific Though statistics provides the methods for data articles: [Fri98] by Jerome H. Friedman of Stanford collection and analysis, it helps to obtain information University, that explains the connection between from numerical and categorical data. Categorical data Statistics and DM, [Sap00] by G. Saporta, that focuses refers to unique data, e.g. blood group of a person, on how DM could be used in official statistics and marital status, etc. [Has14] by Hassani, Saporta, and Silva, that presents a Statistics is highly significant in data related studies thorough review of published work to date on the because it helps in, application of data mining in official statistics, and on  Deciding the type of data required to address a identification of the techniques that have been given problem explored.  Organizing and summarizing data  Analysis to be done to draw conclusions from data 3 https://www.educba.com/data-mining-vs-statistics/: Accessed 5-9-2018 Table 5 : TM vs. Statistics Comparison Table sciences, while most of them spread over a wide domain where a statistical method is an essential Text Mining Statistics component. Text Mining has developed recently with Explorative: It digs out data Confirmative: It provides big data and will continue to grow in the following first, builds model to theory first and then tests it years as data growth seems to be never-ending. This discover novel patterns & using various statistical also applies to the other disciplines, which means that make theories. tools. the data driving the algorithms, methods and decisions need to be high-quality. Nonetheless, all disciplinary Data used is Numeric or Data used is Numeric. fields described briefly in this review, cover the major Non numeric. areas of working with data and problems on various Inductive Process Deductive Process (Does areas related to this data. The emerging picture reveals (Generation of new theory not involve making any a blend of theory and practice that reflects each from data) predictions) discipline rather than a unified system. Hopefully, a Data collection is less Data collection is more productive merging of TM approaches through important. important. ([Sap00]) increased cross-disciplinary research can develop and advance not only TM but all these fields. The rate of Involves Data Clean data is used to apply Cleaning. statistical method. change in the text mining field is so rapid that the information is likely to be measurably different in the Needs less user interaction Needs user interaction to following years. to validate model hence, validate model hence, easy to automate. difficult to automate. References Suitable for large data sets Suitable for smaller data sets [Ber09] C. Berkouwer. Master Thesis: The Reflection It’s an algorithm which Formalization of learns from data without relationship in data in the of Foresight in Defense Policy Making : A using any programming form of mathematical Comparative Study of the United Kingdom rule. equation. and the United States, March 2009 Use heuristics think (rules Does not have scope for [Dea14] J. Dean. Big Data, Data Mining, and Machine used to form judgments and heuristic think. Learning: Value Creation for Business make decisions) Leaders and Practitioners: pp 56. John Wiley Classification, Clustering, Descriptive Statistical, and Sons, Inc., 2014 Summarization, Estimation, Inferential Statistical [Dör99] J. Dörre, P. Gerstl, R. Seiffert. Finding Text Association Rules, Topic Mining: Nuggets in Mountains of Textual Modeling, Visualization Data. KDD ’99 Proceedings of the Fifth ACM Financial Data Analysis, Demography, Actuarial SIGKDD Intern. Conference on K. Discovery Retail Industry, Science, Operation and Data Mining: 398–401, August 1999 Telecommunication research, Biostatistics, Industry, Biological Data Quality [Fan06] W. Fan, L. Wallace, S. Rich, & Z. Zhang Analysis, Certain Scientific Control etc. ( [Has14]) Tapping the Power of Text Mining. Applications etc. Communications of the ACM, 49 (9): 76–82, September 2006 8 Conclusion [Fal95] C. Faloutsos, D. W Oard. A survey of information retrieval and filtering methods. In this article we attempted to briefly describe the Technical Report. University of Maryland at differences of Text Mining with other related College Park, MD, USA 1995 disciplines, while making a concise presentation. [Fra92] W. J. Frawley, G. Piatetsky-Shapiro, C. J. In summary, it is noted that TM and all these sciences (even statistics) may seem indistinguishable Matheus. Knowledge Discovery in Databases : due to its close connection. It is clear, however, that An Overview. AI Magazine, 13 (3): 57–70, statistics is actually a tool or method for all these September 1992 [Fri98] J. H. Friedman. Data Mining and Statistics: [Man08] C. D. Manning, P. Raghavan, H. Schütze. What's the Connection? Computing Science Introduction to Information Retrieval. and Statistics Vol. 29 (1): 3-9, Ed. D. Scott Cambridge University Press, New York 2008 1998 [Nah02] U. Y. Nahm, J. R. Mooney. Text Mining with [Gri08] S. Grimes. Unstructured data and the 80 Information Extraction. Technical Report SS- percent rule. Clarabridge Bridgepoints 02-06, Department of Computer Sciences, newsletter 23, column, “Experts Corner: Seth University of Texas, March 2002 Grimes.”, August 2008 [Nav18] Roberto Navigli. Natural Language [Gök15] A. Gök, A. Waterworth, P. Shapira. Use of Understanding: Instructions for (Present and web mining in studying innovation. Future) Use. Proceedings of the Twenty- Scientometrics 102 (1): 653–671, Jan. 2015 Seventh International Joint Conference on [Gup09] V. Gupta, G. Lehal. A survey of text mining Artificial Intelligence: 5697-5702, Early techniques and applications. Journal of Career, July 2018 Emerging Technologies in Web Intelligence, [Raj97] M. Rajman, R. Besançon. Text mining: Natural 1(1): 60–76, Academy Publisher, August 2009 language techniques and text mining [Has14] H. Hassani, G. Saporta, E. S. Silva. Data applications. Data Mining and Reverse Mining and Official Statistics: The Past, the Engineering: Searching for Semantics: IFIP Present and the Future. Big Data Vol. 2 (1): TC2 WG2.6 IFIP 7th Conference on Database 34-43, March 2014 Semantics (DS-7): 50-66, January 1997 [Hea99] M. A. Hearst. Untangling Text Data Mining. [Rij79] C. J. Van Rijsbergen. Information Retrieval, Proceedings of ACL ’99: the 37th Annual London: Butterworths, 2nd edition, November Meeting of the Association for computational 1979 Linguistics, University of Maryland, June [Sal18] S.A. Salloum, A.Q. AlHamad, M. Al-Emran, 1999 (invited paper) K. Shaalan. A Survey of Arabic Text Mining. [Kan09] Y. Kano, W. A. Baumgartner, L. McCrohon, Studies in Computational Intelligence, vol S. Ananiadou, K. B. Cohen, L. Hunter, T. 740: 417-431, Springer, Cham, January 2018 Tsujii. Data mining: concept and techniques. [Sap00] G. Saporta. Data Mining and Official Oxford Journal of Bioinformatics, Volume 25, Statistics. Quinta Conferenza Nationale di Issue 15: 1997-1998, August 2009 Statistica, ISTAT, Roma, November 2000 [Kao07] A. Kao, S. R. Poteet. Natural language [Seh04] A.K. Sehgal. Text Mining: The Search for processing and text mining. Springer, 2007 Novelty in Text. Ph.D. Comprehensive [Kar05] H. Karanikas, Th. Mavroudakis. Text Mining Examination Report, Dept. of Computer Software Survey. RANLP Text Mining Science, The University of Iowa, April 2004 Workshop No 1: 39-48, September 2005 [Sin01] A. Singhal. Modern Information Retrieval: A [Lia12] S. H. Liao, P. H. Chu, P. Y. Hsiao. Data Brief Overview. Bulletin of the IEEE Mining Techniques & Applications - A Decade Computer Society Technical Committee on Review from 2000 to 2011. Expert Systems Data Engineering 24 (4): 35–43, December with Applications, Vol. 39 (12): 11303–11311, 2001 Elsevier Ltd., September 2012 [Tal16] R. Talib, M. Kashif, Sh. Ayesha, F. Fatima. [Liu11] F. Liu, X. Lu. Survey on text clustering Text Mining: Techniques, Applications and algorithm. Proceedings of 2nd International Issues. International Journal of Advanced IEEE Conference on Software Engineering Computer Science & Applications Vol. 7(11): and Services Science (ICSESS), China, 901- 414-418, November 2016 904, 2011