Trend Mining with Semantic-Based Learning

                                  Olga Streibel

              Networked Information Systems, Free University Berlin,
                 Königin-Luise-Str.24-26 , 14195 Berlin, Germany
                           streibel@inf.fu-berlin.de
                              http://www.ag-nbi.de


      Abstract. Mining trends by analyzing text streams could enhance the
      standard trend analysis based on numeric data. The use of qualitative
      information in the process of trend recognition, in addition to that of
      quantitative data, requires new analysis techniques. Since Semantic Web
      enables the appropriate and advantageous formalization of knowledge, we
      propose to include formalized expert knowledge in the process of trend
      recognition. In this preliminary work, we introduce our approach based
      on Semantic Web technologies combined with Data Mining methods for
      mining trends in a given domain.

      Key words: trend mining, trend recognition, semantic technologies,
      pattern recognition, trend patterns, learning methods, trend pattern on-
      tology


1   Introduction

”Stock market news has gone from hard to find (in the 1970s and early 1980s),
then easy to find (in the late 1980s), then hard to get away from.”1
A huge amount of textual information like business news is freely available on the
Internet2 . This abundance of information makes the access of new information
far easier, as is also true of previously hidden knowledge. On the other hand,
in order to retrieve required information and discover the potential knowledge,
we need to utilize appropriate search and analysis techniques. Regarding busi-
ness news and the stock market, a ”human” specialist can deduce information
and knowledge she needs for the prediction of market movements. However, this
recognition and comprehension process is very complex and requires experience
as well as the initial context knowledge.
In our work, we concentrate on the trend mining process based on numeric data
and on textual information. Research projects like GIDA and TREMA have
shown that there is a huge demand for the research on and development of
1
  Peter Lynch,2000 ”One Up On Wall Street: How To Use What You Already Know
  To Make Money In The Market”
2
  i.e.   http://news.bbc.co.uk/2/hi/business,      http://www.tagesschau.de/
  wirtschaft/index.html, http://faz.net
2       Olga Streibel

useful trend mining methods that are able to include analyses of textual infor-
mation in the process of trend recognition. In our work, we define repositories
consisting of quantitative data and qualitative data as simple hybrid information
systems. Regarding specific application fields, i.e. financial markets, the qualita-
tive data is represented by financial news whereas the quantitative data means
the numeric values of different trade instruments. Consequently, we aim to use
text corpus consisting of financial news in German language3 and correlate this
corpus with the trading values of a chosen financial instrument. In particular,
we concentrate on the analysis of the business news filtered over a period of 12
months due to the trend segments deduced from the market values of a trading
instrument. The focus of our research is on developing a solution relevant to
the trend mining problem in simple hybrid information systems which combines
a Data Mining approach and adequate Semantic Web technologies. There are
many other examples of simple hybrid information systems in application areas
like weather forecasting, traffic analysis, customer opinion mining, etc. We will
work on a solution that will be applicable in those different systems.
In the following, we outline briefly the idea of our novel approach for trend min-
ing. Section 2 gives an insight into research relevant to our work. In section 3,
we specify our definition of a ”trend” and outline the issues of our research.
Describing briefly the different methods from Computer Linguistic which can be
partially applied to the trend mining difficulty, we introduce Extreme Tagging
System (ETS) in 3.2. We close the section with a short paragraph about learning
methods that we aim to apply in the future.


2   Related Work

The research project GIDA4 [6][1] and its follower, TREMA5 , concentrated on the
fusion of multimodal market data in order to mine trends on financial markets
(GIDA, TREMA) and in market research (TREMA). These projects provide us
with our research direction. Since we aim to focus only on a fraction of the whole
trend mining process, in particular, on the search for the trend indicating lan-
guage patterns in news, we are not going to concern ourselfs with the conception
of a complex trend mining framework as the project TREMA does. Similar to
TREMA, we are using the Semantic Web technologies in order to support the
textual trend recognition. The difference lies in our idea of applying an ETS, as
described in section 3.2, instead of applying classic and hierarchical ontologies.
In [3] the concept of velocity density estimation is discussed for the trend mining
in supermarket customers’ data.This work “provides the user generic tool to un-
derstand, visualize and diagnose the summary changes in data characteristics”.
The aspects of dynamics and evolving data included in this research, could also
3
  The corpus is available due to the cooperation with the German company, neofonie
  GmbH
4
  Description online: www.computing.surrey.ac.uk/ai/gida
5
  Project website: www.trema-projekt.de
                      Trend Mining in Texts with Semantic-Based Learning         3

be important for our work. The authors of [16] introduce a simple and inter-
esting knowledge-based approach for the kidney function monitoring in medical
diagnosis systems. In particular, the trends appear in the form of trend reports
which are counted on the numeric data and explained using a knowledge-base.
The use of a semantic knowledge-base will also be a part of our work. We are
going to use the knowledge base not only to explain the emerging trends but
also to learn from them. Trends based on keyword search statistics are well vi-
sualized by the Google-Trends [24] feature. Here, the trend mining of searches
actually shows anomalies appearing in the historic patterns of Google search on
the Web. Search for certain text patterns in the text corpus is also a part of our
work. The difference is that we aim to search for trend indicating keywords that
have been learned from historic data using semantic, not only statistic methods.
Another interesting tool is the BlogPulse [25] that identifies topics and subjects
that people are talking about in their blogs. BlogPulse shows the complex trend
concept. A trend is a phenomenon that consists of trend setters (blogs’ authors),
detected topics, “buzz” words, etc. In our work, we are assuming a simplified,
data and text oriented, trend definition that can be treated as a fraction of the
complex trend mining process.
As last, the work described in [10] could be very useful for us. In particular, the
definitions of theme, theme life cycle, and theme snapshot could be important
for our approach.


3   Mining Trends

In order to analyze trends, we have to define what is a trend. Since we aim firstly
to originate our trend recognition process in the numeric data, we will treat the
given text stream in a similar way as we might a data stream. With regard to the
trend analysis based on time series, the analysis process consists of four major
components or movements for characterizing time-series data [8]. We refer to
the long-term movements that can be visualized by a trend curve. Based on the
trend curve generated over quantitative data, we identify time segments for those
long-term movements that can have positive or negative trend values (”ups”
and ”downs” on the market). Correlating this segments to the news stream,
we identify a priori three trend classes: positive, negative and neutral class and
divide the news stream in the 3-category text corpus. Analyzing text corpus,
we will search for specific, so called trend-indicating keywords and statements.
Trend-indicating keywords from the financial market domain are i.e. cut, concern,
recession, etc. These simple keywords are subject to what we call trend indicating
language patterns.
When analyzing text corpus, we are concentrating on trend indicating language
structure and on the characterization of this structure. Firstly, we propose to
divide the identification of trend indicating language patterns (in the following
also called simple trend patterns) in the non-semantic feature extraction and in
semantic feature annotation (more in sections 3.1 and 3.2).
4        Olga Streibel

In the following, we briefly describe stages in our proposed approach for the
trend recognition method.


3.1    Non-semantic trend patterns

Since we analyze a given text corpus that is divided in trend classes, the ”sim-
plest” method for identifying trend patterns is the counting of the most frequent
keywords or the TFIDF-method[15]. Different methods from text mining can be
successfully applied in order to identify keywords or simple statements from the
text corpus. However, we assume that not every keyword or statement extracted
from the given trend class in text corpus is the trend-indicating one. The in-
teresting point is how to recognize whether given keywords or statements are
trend-indicating or not.
In particular we rely on the observation that there are characteristic words
used in different domains describing the customer’s opinion and/or her senti-
ment[2][9][19]. Following from this, since most sentiment indicating words are
adjectives whereas the nouns build the sentiment concepts, then a possible and
very simple trend pattern in the text could consist of an adjective-noun word
pair. Using WordNet6 or a Part-Of-Speech analysis, we could identify these pairs
in the text corpus. Regardless, we assume that the search for trend patterns re-
quires more complex text analysis then the POS. We assume, that we should
investigate taxonomic and non-taxonomic relations between identified keywords
or simple statements. Additionally, we should consider the semantic orientation
as described in [7] and [19].


3.2    Trend pattern ontology

The non-semantic trend feature extraction provides a basis for a trend pattern
structure. This can be useful for both, analyzing trend patterns on the non-
semantic level and creating a trend knowledge base that provides insight into
the general characteristic of the trend patterns. A knowledge base can be real-
ized as a classic ontology. We propose the application of an adapted Extreme
Tagging System (ETS) as a knwoledge base for trend recognition. An ETS as
introduced in [18], is an extension of collaborative tagging systems which allow
for the collaborative construction of knowledge bases. An ETS offers a superset
of the possibilities of collaborative tagging systems in that it allows us to collab-
oratively tag the tags themselves, as well as the relations between tags. ETS are
not destined to exclusively produce hierarchical ontologies but strive to allow
the expression and retrieval of multiple nuances of meaning, or semantic associ-
ations. Our propose in this research is to use these novel knowledge acquisition
techniques, which are based on lightweight annotations in social environments,
in order to generate a semantic description for the analyzed application field.
We will apply an adapted ETS in order to gain expert knowledge of trend recog-
nition in the business field. We expect that the use of an ETS will bring an easy
6
    http://wordnet.princeton.edu/
                      Trend Mining in Texts with Semantic-Based Learning         5

retrieval and extraction of the expert knowledge in the form of a RDF triple set.
An initial set of tags (which should be tagged by experts in a given domain)
will be generated from the selected trend features that are extracted in a non-
semantic way from the text corpus (described in 3.1). Experts using the ETS
will play the ”association game” on the initial tag set. Created association sets
will be automatically converted to RDF-data. Produced RDF triple set will be
then used to generate a trend scheme. Furthermore, we will use the data from
ETS as the input for another feature extraction from the texts.
Combining the non-semantic search for trend patterns with the association sets
based on expert knowledge, we aim to create an appropriate semantic trend
pattern scheme- a trend pattern ontology- that will be applied to a learning
algorithm.


3.3   Learning Trends

Regarding different possibilities of learning methods from machine learning [11][14][21],
we firstly propose to use the supervised learning approach. Hence we work with
strictly separable text classes- the texts with positive trend indicating patterns
cannot belong to the neutral or negative trend category at the same time- stan-
dard classification seems to be an appropriate learning form for the trend recog-
nition problem, particularly where the trend classes’ ranges are well separable.
With regard to the evaluation of the advantages achieved through applied se-
mantics to the learning process, we propose to use firstly decision trees (i.e.
C4.5) or decision rules [21] which both allow the vizualization of the learned
model. Learning trends with decision trees means here learning trend indicating
language patterns that are expressed in RDF-triples.
However, once the feature space has been created from the text corpus (as de-
scribed in 3.1 and 3.2), we can use the features in order to validate the as-
sumptions about the positive, negative and neutral trend indicating patterns.
Therefore, we can use clustering as the alternative learning method for auto-
matically assigning the trend classes’ ranges. In our research we are considering
also different alternative learning algorithms like rough sets, fuzzy case reason-
ing, neural networks or inductive learning approaches [14][21][13][8] in order to
find the most appropriate one for the semantic-based trend recognition.


4     Future work

Given the directions for research outlined in section 3, we have chosen to continue
our work on the theoretical and the practical solutions in order to create a pro-
totype of here described semantic-based learning method for trend recognition
in simple hybrid information systems.
6       Olga Streibel

5    Acknowledgments

This work has been partially supported by the ”InnoProfile-Corporate Seman-
tic Web” project funded by the German Federal Ministry of Education and
Research (BMBF) and the BMBF Innovation Initiative for the New German
Länder - Entrepreneurial Regions. The author would like to thank their super-
visor, Prof.Robert Tolksdorf and the TREMA-project partners for the support
of this work.


References
1. Ahmad, K.: Events and the Causes of Events, In Conference on Terminology and
   Knowledge Engineering 2002, online: http://www.computing.surrey.ac.uk/ai/TKE
2. Archak, K., Ghose, A., Ipeirotis, P. G.: Show me the Money! Deriving the Pricing
   Power of Product Features by Mining Consumer Reviews
3. Charu, C. Aggarwal: A framework for diagnosing changes in evolving data streams,
   SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference
   on Management of data, 575-586,(2003)
4. Hevner, A. R., March, S.T., Park, J., Ram, S.: Design Science in Information System
   Research,MIS Quarterly 2004
5. Esuli, A. and Sebastiani, F.: SentiWordNet: Publicly Available Lexical Resource for
   Opinion Mining
6. Gillam, L., Ahmad, K., Ahmad, S., Casey, M., Cheng, D., Taskaya, T., Oliveira,
   P.C.F. and Manomaisupat, P.: Economic News and Stock Market Correlation: A
   Study of the UK Market. In Conference on Terminology and Knowledge Engineering
   2002, online: http://www.computing.surrey.ac.uk/ai/TKE
7. Hatzivassiloglou, V. and McKeown, K. R.: Predicting the semantic orientation of
   adjectives. In Proceedings of the 35th Annual Meeting of the ACL
8. Han, J., Kamber, M.: Data Mining Concepts and Techniques, 2.Ed. Morgan Kauf-
   mann 2006
9. Hu, M., and Liu, B.: Mining and summarizing customers reviews. In Proceedings
   of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and
   Data Mining (KDD-2004) (2004), pp. 168-177
10. Mei, Q., Liu, C., Su, H., and Zhai, C.: A probabilistic approach to spatiotemporal
   theme pattern mining on weblogs. In Proceedings of the 15th International Confer-
   ence on World Wide Web (Edinburgh, Scotland) WWW’06 ACM Press, New York,
   NY, 533-542.
11. Mitchell, T.M.: Machine Learning, Mc-Graw-Hill, 1997
12. Morinaga, S., Yamanishi, K..: Tracking Dynamics of Topic Trends, KDD’04: Pro-
   ceedings of the tenth ACM SIGKDD international conference on Knowledge Dis-
   covery and Data Mining, 811-816, ACM NY
13. Pal, S.K. and Mitra, P.: Pattern Recognition Algorithms for Data Mining, CRC
   Press LLC 2004
14. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, Prentice Hall,
   2.Ed.2003
15. Salton, G., Buckley Ch.: Term Weighting Approaches in Automatic Text Retrieval,
   1988 Information Processing and Management: an International Journal archive
   Volume 24 , Issue 5 (1988) Pages: 513 - 523
                      Trend Mining in Texts with Semantic-Based Learning         7

16. Schleutermann, S. and Heidl, B. and Finsterer, U.: Trenderkennung beim Nieren-
   funktionsmonitoring auf der Intensivstation, GMDS 139-142, 1996
17. Simon, H.A.: The Science of the Artificial, Ch.4: Remembering and Learning, MIT
   Press, Third Edition (1996)
18. Tanasescu, V., Streibel, O.: Extreme Tagging: Emergent Semantics Through the
   Tagging of Tags. In International Workshop on Emergent Semantics and Ontology
   Evolution, ISWC2007
19. Turney, P.D., and Littman, M.L.: Measuring praise and criticism: Inference of
   semantic orientation from association. ACM Transactions on Information Systems
   21, 4 (2003), 315-346
20. Vejlgaard, H.: Anatomy of a Trend Mc-Graw-Hill, 1.Ed. 2007
21. Witten, I.h., Frank, E.: Data Mining Practical Machine Learning Tools and Tech-
   niques, 2.Ed.Morgan Kaufmann 2005
22. Witten, I.H., Gori, M., Numerico, T.: Web Dragons: Inside The Myths of Search
   Engine Technology, Morgan Kaufmann 2007
23. Wong, W.-K., Moore, A., Cooper, G., Wagner, M. What is Strange About Recent
   Events (WSARE) in Journal of Machine Learning Research 2005
24. www.google.com/trends
25. www.blogpulse.com
26. www.projekt-trema.de