=Paper=
{{Paper
|id=Vol-2412/paper7
|storemode=property
|title=PaloAnalytics Project Concept, Scope and Outcomes: An Opportunity for Culture
|pdfUrl=https://ceur-ws.org/Vol-2412/paper7.pdf
|volume=Vol-2412
|authors=Vassilis Poulopoulos,Manolis Wallace,Costas Vassilakis,George Lepouras
|dblpUrl=https://dblp.org/rec/conf/smap/PoulopoulosWVL19
}}
==PaloAnalytics Project Concept, Scope and Outcomes: An Opportunity for Culture==
<pdf width="1500px">https://ceur-ws.org/Vol-2412/paper7.pdf</pdf>
<pre>
        PaloAnalytics project concept, scope and
          outcomes: an opportunity for culture

                Vassilis Poulopoulos1[0000−0003−1707−3153] , Manolis
            1[0000−0002−4629−5946]
    Wallace                      , Costas Vassilakis2[0000−0001−9940−1821] , and
                     George Lepouras2[0000−0001−6094−3308] ?

                1
                    Knowledge and Uncertainty Research Laboratory
                University of the Peloponnese, Tripolis, Greece 221 31
                             {wallace, vacilos}@uop.gr
                                  http://gav.uop.gr
                           2
                              University of the Peloponnese
                                Tripolis, Greece 221 31
                                {costas, gl}@uop.gr


        Abstract. This paper describes the national funded project entitled
        PaloAnalytics, which develops an innovative platform that allows com-
        panies and organizations, that operate in several countries, to monitor
        and analyze, in depth, the markets’ interest to their products and suc-
        cessfully plan their marketing and communication strategy, with data
        and insights collected from all the local media, and focuses on its ap-
        plication to cultural spaces and museums. In this notion, we examine
        the effect that this project can have in cultural spaces or companies re-
        lated to arts and culture. PaloAnalytics platform allows organizations to
        investigate the impact of their products on consumers across different
        countries and this is achieved with the analysis of content from sites,
        blogs, social networks and open data. This implies that cultural organi-
        zation can benefit by adopting the implemented services, so that the can
        recognize and analyze their audience, their online marketing campaigns
        as well as examine the impact of their messages and the spread of their
        messages on the Internet. In this paper, we briefly describe the project
        and discuss on the impact on cultural related organizations.

        Keywords: big data, data monitoring, trending topics, influencers, info
        graphics, data visualization, deep learning


1     Introduction
The data that is generated daily in the world of the internet is vast. The amount
of information is such that it is impossible for companies and organizations to
fetch, analyze and learn from all the data produced. In this scope, PaloAnalytics
is a project that aims to perform the procedures of collecting, analyzing and
?
    Cultural Informatics 2019, June 9, 2019, Larnaca, Cyprus. Copyright held by the
    authors.
2         Poulopoulos, Wallace, Vassilakis and Lepouras

extracting useful information from different sources of the internet, web pages,
news portals, open sources, and social media. The procedure of collecting and
analyzing information from diverse sources is not something new, and has at-
tracted research during the last 20 years [10]. It resides to the area of Data
Mining [12], [11] and it focuses on Big Data analysis, which is based on multiple
custom Data Warehouses [13]. In this notion, we present a project that intends
to employ resources from all these sectors in order to produce its final results; in
depth analysis of social media and web data in order to support organizations
and companies.
    Market research has proven that companies and organizations are in strong
need for an holistic market monitoring and analysis service in several countries
and not solely the country of their origin; or at least they are convinced that they
can perform much better if they have such a tool. Besides, the competition of such
companies and organizations is usually international. Furthermore, it clear that
when analyzing data in an international environment each local information can
easily affect the whole organization, but it is usually difficult to become a part of
the organization’s international policy. In general, it seems possible that data can
be collected and analyzed in some extent locally but is usually not transferred
as knowledge to the international level. In fact, for such organizations it would
be extremely useful to utilize a unique language for all the data analyzed and it
seems that the English language is acceptable and consistent. A number of tools
have been developed including Mention 3 and Brandwatch 4 in order to collect
and analyze data internationally but they have some major disadvantages. They
focus mainly on social media and target experienced users, while in parallel
they do not provide translations of reports from local languages in a universal
language. Furthermore, they do not offer a homogeneous overall picture for all
the countries that are of interest for a business.
    In this ground, we introduce PaloAnalytics project, which intends to focus
on the basic challenges that organizations face and includes the ability to have
a universal monitoring tool, with links and interconnections between data col-
lected and analyzed from a number of different sources and different languages.
In this way the project will be the ideal solution for international companies (or
companies willing to become international) and companies that their interna-
tional competition affect their local business. An ideal solution, through which
the organization will be able to get information out of large sets of data.
    The proposed design and implementation, introduces a series of software
modules that will
    – analyze multilingual content posted on news sites, social networks and open
      data
    – extract knowledge and information about products and companies, including
      product characteristics
    – analyze sources, their influence and trends
3
  https://mention.com - Mention: Scour the web, social media, and more for powerful
  market insights
4
  https://www.brandwatch.com/ - Brandwatch: Know what your customers think
                                  Title Suppressed Due to Excessive Length        3

 – help in assessing the image of the business and its products as well as its
   competitors
 – visualize the knowledge in order to easily understand the analyzed informa-
   tion

    These procedures describe how this project can be used by any type of com-
pany. Research has shown that cultural spaces and organizations have started
to take seriously the world of the Internet and the Social Media. Consequently,
they find it attractive to spread their messages through these mediums, as it is
expected to reach a larger and global audience, they can make serious debates
and conversations, and, generally, have an alternative active role in order to
challenge the mass culture. Having the aforementioned as a base, it is evident
that the project can help all these organizations have a holistic presence in social
media and the internet; a presence that can be expected to be international.
    The rest of the paper is structured as follows: Section 2 presents the method-
ology of the project, while section 3 discusses the system architecture. In section
?? a detailed description of each component is presented, providing more em-
phasis on the Trending Topics software module and its results. Section 4 defines
the expected outcomes of the proposed system and the final section presents a
discussion on the project.


2    Methodology of the project

Due to the large number of different modules, the high complexity of their im-
plementation and the importance in precision of their algorithmic procedures,
an advanced methodology is employed. As such, the Rational Unified Process is
used. It is a software engineering procedure that ensures producing high qual-
ity software and achieving end user needs within a specific timetable and cost.
Two cycles of project evolution are followed, one that leads to the basic imple-
mentation and is longer, while a shorter one will be done in order to perform
refinements. Both of the cycles will go through the same steps of development.
During the first cycle the implementation will be ensured, while the second cycle
will focus on the quality of the outcomes. The cycle phases include:

 – Inception Phase
 – Elaboration Phase
 – Construction Phase
 – Transition Phase

    Figure 1 depicts the cycle of system design, implementation and integration.

    During the inception phase a general description of the key requirements
of the project is done; key points and the basic constraints are defined and
the system use cases are defined in brief. An initial business case including the
business framework, the success criteria and financial forecasting is the ones
that lead to the project plan and to a draft business model. While analyzing the
4       Poulopoulos, Wallace, Vassilakis and Lepouras


                         Fig. 1: Rational Unified Process


information, during the processing phase, the use case model was completed, and
the final requirements were recorded. The architecture reached its final form and
the project’s development plan was finalized. Currently the project is under the
first construction phase, where modules are implemented and starting to be
integrate into the PaloAnalytics platform. Upon completion of the first phase of
implementations an overall system functionality, performance and usability test
will be done.
    The development of the platform follows a bottom-up approach, based on the
proposed architecture as presented in figure 2), starting from data collection that
will directly lead to data aggregation services which will be used individually. On
the produced data, multilingual content analysis’ services are employed, while in
parallel, at this stage, business intelligence extracting solutions are applied. The
availability of the proposed services will be both on Web and Mobile application
enabling increased penetration into the business community. Each service is built
supporting endpoint integration in order to be available for use as an individual
component even for third party systems, external to PaloAnalytics platform.
    This will develop a complete development stack, that is based on multilingual
content from news sites, open data sources and social media. The services of this
stack are expected to attract third-party businesses companies, public bodies
and researchers who will develop new management modes of business data from
the sources incorporated by PaloAnalytics platform and will set up new business
models on them, multiplying the benefits for the companies and organizations
       Title Suppressed Due to Excessive Length   5


Fig. 2: Proposed architecture
6        Poulopoulos, Wallace, Vassilakis and Lepouras

while maximizing the influence of the proposed solutions for the scientific and
business community.

3      Architecture
According to the architecture presented in figure 2 the proposed system is divided
into several components and modules enabling in this way individual design and
integration. The system, though, can be separated into four major components:
    – data entrance point / data storage
    – deep data analysis
    – semantics and metadata analysis
    – point of presentation
    Each of the major components consists of a number of modules in order to
successfully achieve its scope. Furthermore, each component will offer services
for direct data extraction and usage by third party systems.

3.1     Entry point
The entry point of the system is the component that is responsible for collecting
and storing data from the several different sources (social media, news and open
data). The data storage is built enabling several interfaces to be connected in
order to fetch and store data. In general, it follows a hybrid scheme including
both an SQL and a noSQL database.
    The system acts as a data warehouse, including modules for data extraction,
data transformation as well as data loading. The extraction of data is done
from several different sources including news websites, blogging platforms, social
media - focusing on text based ones - and open data sources. The data collected is
transformed in order to formulate similar objects with specific unified structure.
The unified structure of each unique object includes a unique identifier, title,
body, source, timestamp and author.
    The aforementioned is the main object of the system and described the main
form of data collected. A number of metadata and objects analyzing in depth
each object is used including detailed information about the source, the author,
accompanying multimedia and more. Figure 3 presents a generic schema of the
database infrastructure that is used in order to support the essential for the
system storage.
    The data collected are stored on both an SQL-like storage environment as
well as a noSQL environment. The hybrid scheme will help for storing elements
for fast access in the noSQL nodes and collection of all the collected data in an
SQL based structure for better interconnection between them and permanent
storage of data with historic metadata [7]. Furthermore, a time-series database
is used in order to keep track of the records that are stored in the database,
including information about the source or the author. The latter is extremely
useful when defining the rate of update for each source or the frequency of
posting for authors and their relation to period and time.
                                   Title Suppressed Due to Excessive Length       7


                          Fig. 3: Generic database scheme


3.2    System Core

The system core contains all the key elements and services of the system. It
consists the basis upon which the complete system is designed and implemented.
Each of the modules formulating the system core can act as an autonomous
system providing endpoints for independent usage. These endpoints can also be
used by the system internally in order to perform the physical interconnection
between the different services.
   The system core includes the following elements:

 – Named Entity Recognition (NER), which is a module for recogniz-
   ing entities in bunches of text. A machine learning mechanism based on
   OpenNLP 5 , a set of language features and a set of annotated documents
   for finding candidate NERs, enhanced by the use of dictionaries is used [5].
 – Breaking news detection, which is a component for recognizing important
   news topics. This is usually based on the number of similar articles produced
   in a period of time, but it should be considered that not all news topics
   are increased in numbers in the same manner. As such, machine learning
   algorithms are employed that are able to recognize breaking news based on
   the growth rate in time [6].
 – Clustering, which is responsible for finding interconnections between the
   different entities. It should be noted that the objects collected derive from
   several different sources and the scope of this module is to create physical
   interconnections between objects having identical meaning. According to the
   definition of the object (without any attached metadata) the main scope of
   the clustering procedure is to interconnect conceptually two objects. Fur-
   thermore, as the system intends to operate regardless of the language of
   origin, the interconnection of the object should be language agnostic.
5
    OpenNLP: a machine learning based toolkit for the processing of natural language
    text. https://opennlp.apache.org/
8         Poulopoulos, Wallace, Vassilakis and Lepouras

    – Classification, which is a module for automatic categorization of objects to
      predefined categories. As the categories of the system are predefined, due to
      the fact that Palo is used as a news aggregation service, the categorization is
      done in several primal categories. The current mechanism will be enhanced
      in order to enable multilevel categorization including two different levels [4].
    – Sentiment Analysis, which is responsible for extracting the polarity of the
      objects. A machine learning algorithm will be employed in order to replace
      a currently used algorithm based on the bag of words method [1].
    – Summarization, which is responsible for extracting summaries out of the
      clusters of objects. As the clustering procedure evolves in time, the summa-
      rization procedure must adapt to changes that are done to the size of the
      cluster in time.
    – Trending topics detection and enrichment, which is responsible for
      analyzing social media and open sources in order to detect topics that are
      trending and enrich them accordingly in order to detect their trends to other
      countries and languages.


3.3     High level analysis

The high level data analysis of the system includes a number of components that
combine the outcomes of the deep data analysis and they include:

    – Discovering social media influencers [2], [8]
    – Applying cross-border analytics [3]
    – Performing network analysis
    – Exploring semantic means of the web [9]
    – Simulating web and social media campaigns and measuring their impact


3.4     Frontend

The system frontend consists of both web and mobile applications that utilize
the data collected and analyzed in order to present reports, visualize data and
make it easy to explore the combined information.
    The web and mobile applications will have a public part that will make
parts of the collected available to public. This is a news aggregation service
including rich media format of data as well as interconnection of information
and multilingual content. The same is for the mobile application which ca be
formulated in order to enhance portability and usability of the presented content.


4      Expected outcomes and opportunities for cultural
       organizations

The design and development of the proposed system consists of a new and innova-
tive product for the international market, which is expected to be the attraction
                                  Title Suppressed Due to Excessive Length       9

for many companies and organizations primarily organizations that operate in-
ternationally. The absence of specialized competitive products in this field offers
a significant advantage and allows it to be a leading player in the Greek market,
which is the country of origin, and to penetrate the emerging and demanding
international market of high-volume data analysis technology by providing in-
novative services and products.
    All the aforementioned, is expected to provide a new dynamic in the field
of application development in the referred emerging sectors. This is achieved
by using state-of-the-art technologies and methodologies together with the ex-
tensive knowledge in the field by the partnership. At the same time, within
the framework of the proposed project, the know-how acquired in the areas of
large volume analysis is fully exploited, thereby enhancing the company’s policy
towards the increased use of cutting-edge technologies, as well as the partner-
ships’ research background. Finally, we should consider the valuable know-how
acquired by all the participating bodies during the implementation of the pro-
posed project through the two research organizations, which will be done by
the research and development in order to achieve the desired objectives. The
know-how to be transferred will improve all the organizations’ and especially
the company’s scientific potential by increasing its knowledge and expertise and
consequently the company’s capabilities for future support as well as developing
new applications and undertaking new research projects in the context of its
activities.
    Focusing on cultural related organizations it is possible to find opportunities
that these venues never had. It is important to note that cultural spaces have
recognized the important role of technology and online synchronous and asyn-
chronous communication, and are willing to utilize modern and edge-cutting
technological features in order both to enhance the experience of the visitors
as well as attract a broader audience. In this scope, it is extremely difficult for
people related to arts and culture perform an advanced step towards analyzing
the impact of their presence and marketing procedures on the internet. Palo-
Analytics project can play the role of the companion when it comes to their
online presence. The project can help recognize the supporters and fans, can
measure the impact of the online marketing strategy, can keep a record of other
spaces’ impact or connection and can help towards the improvement of the online
presence.


5   Discussion

We presented the project PaloAnalytics, which is reaching its first year of under-
going. During this period the first crucial steps have been made, including the
definition of the system use-cases, the formulation of the system architecture,
the set-up of the system infranstructure, as well as the design and initiation
of the first system components. Furthermore, the business-case is completed
and the implementation of the first sub-systems is almost finalized. The infras-
tructure of the system is set-up and the means of integration are defined. An
10      Poulopoulos, Wallace, Vassilakis and Lepouras

interesting feature of the project is the participation of two research laboratories
from two different institutions in Greece, which will join their research teams
to produce the results of the project. In order to achieve the objectives of the
project, cutting-edge technology and algorithms are used, which means that the
participants will join forces towards the research.
    Despite the fact that the actual outcome of the project is minimal compared
to the algorithmic procedures that lead to it, a number of related research fields
will be explored during the design and implementation of the components. First
of all, data mining algorithms will be researched in order to produce the optimal
solution for fetching data. Furthermore, the infrastructure that stores the data
is the basis of the system and as such its design and integration is part of a re-
search and development procedure. On the other hand, a number of algorithms
and techniques including deep machine learning will be investigation in order
to achieve procedures listing: clustering of data (including text objects deriving
from social media), summarization of clusters, named entity recognition, senti-
ment analysis, aspect mining and breaking news definition. Furthermore, apart
from the core algorithms, a number of “high level” procedures are required in
order to achieve the complete set of project scopes. These include influencers
mining, semantic web, network analysis, campaign impact, swot analysis and
more, which are based on the metadata that accompany the information col-
lected and processed.
    It should be noted, that all the aforementioned are not just part of a re-
search procedure; meaning that the research should not stand on the feasibility
and soundness of the results. The system is a production based environment tar-
geting large business and organizations, which can even test and formulate the
procedures and the use-case scenarios. It lies on the ground of applied research
and it is expected that all the implemented solutions will be able to endure large
volumes of data, users and demanding procedures.
    As far as the role that the system can play for cultural organizations it is
clearly defined as an important one. Specifically we defined the system as a
valuable companion that can totally alter the procedures of online marketing
strategies and social media interactions. The system can be used to examine the
behavior of the users towards exhibitions and presentations as well as towards
individual cultural objects. The project can be the beginning of a new era in
cultural informatics, acting as a novel pioneer procedure, that can involve edge
cutting technologies directly on the relation of the organizations and visitors
introducing a new way of mass culture.


Acknowledgment

This research has been cofinanced by the European Union and Greek national
funds through the Operational Program Competitiveness, Entrepreneurship and
Innovation, under the call RESEARCH CREATE INNOVATE (project code:
T1EDK-03470)
                                     Title Suppressed Due to Excessive Length           11

References
1. Castellano, G., Kessous, L., Caridakis, G. (2008). Emotion recognition through
  multiple modalities: face, body gesture, speech. In Affect and emotion in human-
  computer interaction (pp. 92-103). Springer, Berlin, Heidelberg.
2. Caridakis, G., Karpouzis, K., Wallace, M., Kessous, L., Amir, N. (2010). Multimodal
  users affective state analysis in naturalistic interaction. Journal on Multimodal User
  Interfaces, 3(1-2), 49-66.
3. Vlachostergiou, A., Caridakis, G., Kollias, S. (2014). Investigating context awareness
  of affective computing systems: a critical approach. Procedia Computer Science, 39,
  91-98.
4. Varlamis, I., Tsirakis, N., Poulopoulos, V., Tsantilas, P. (2014, October). An au-
  tomatic wrapper generation process for large scale crawling of news websites. In
  Proceedings of the 18th Panhellenic Conference on Informatics (pp. 1-6). ACM.
5. Makrynioti, N., Grivas, A., Sardianos, C., Tsirakis, N., Varlamis, I., Vassalos, V.,
  Poulopoulos, V. Tsantilas, P. (2017). PaloPro: a platform for knowledge extraction
  from big social data and the news. International Journal of Big Data Intelligence,
  4(1), 3-22.
6. Varlamis, I., Hilliard, D. F. (2017). Finding influential sources and breaking news
  in news media using graph analysis techniques. International Journal of Web Engi-
  neering and Technology, 12(2), 143-164.
7. Tsirakis, N., Poulopoulos, V., Tsantilas, P., Varlamis, I. (2017). Large scale opinion
  mining for social, news and blog data. Journal of Systems and Software, 127, 237-248.
8. Margaris, D., Vassilakis, C., Georgiadis, P. (2018). Query personalization using so-
  cial network information and collaborative filtering techniques. Future Generation
  Computer Systems, 78, 440-450.
9. Bampatzia, S., Bravo-Quezada, O. G., Antoniou, A., Nores, M. L., Wallace, M.,
  Lepouras, G., Vassilakis, C. (2016, September). The use of semantics in the CrossCult
  H2020 project. In Semanitic Keyword-based Search on Structured Data Sources (pp.
  190-195). Springer, Cham.
10. Levy, A., Rajaraman, A., Ordille, J. (1996). Querying heterogeneous information
  sources using source descriptions. Stanford InfoLab.
11. Hand, D. J. (2006). Data Mining. Encyclopedia of Environmetrics, 2.
12. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (1996). Advances
  in knowledge discovery and data mining.
13. McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., Barton, D. (2012).
  Big data: the management revolution. Harvard business review, 90(10), 60-68.
14. Ghoshal, A., Swietojanski, P., & Renals, S. (2013, May). Multilingual training of
  deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech
  and Signal Processing (pp. 7319-7323). IEEE.
15. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., ... and Sainath,
  T. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE
  Signal processing magazine, 29.
16. Violos, I., Tserpes, K., Varlamis, I., Varvarigou, T. (2018). Text classification using
  the n-gram graph representation model over high frequency data stream. Frontiers in
  Applied Mathematics and Statistics, section Mathematics of Computation and Data
  Science Journal. doi: 10.3389/fams.2018.00041

</pre>