<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Full-text Search in Intermediate Data Storage of FCART</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexey Neznanov</string-name>
          <email>ANeznanov@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Parinov</string-name>
          <email>AParinov@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>20 Myasnitskaya Ulitsa, Moscow, 101000</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>71</fpage>
      <lpage>77</lpage>
      <abstract>
        <p>The speed of full-text search directly affects the process of text analysis. Search engine creates a text index, which is used for fast full-text search. Solr and ElasticSearch are two popular search engines. A text analysis system requires fast implementing searching and indexing at the same time. This paper describes preprocessing workflow of the analysis system called Formal Concept Analysis Research Toolbox (FCART) and experiment of searching and indexing social networking service data at the same time. Results of the experiment show which search engine is better as the core of FCART search subsystem.</p>
      </abstract>
      <kwd-group>
        <kwd>Formal Concept Analysis</kwd>
        <kwd>Knowledge Extraction</kwd>
        <kwd>Data Mining</kwd>
        <kwd>Software</kwd>
        <kwd>Big Data</kwd>
        <kwd>Social Network Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Formal Concept Analysis Research Toolbox (FCART)[1] is a data analysis system,
which supports texts knowledge discovery techniques, including those based on
Formal Concept Analysis[2], clustering [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], multimodal clustering [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], pattern
structures[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The system supports iterative methodology of knowledge discovery.
The goal of developing FCART is to create a system for handy texts analysis from
external data sources, e.g. SQL databases, NoSql databases and Social Network
Services. In previous papers, we have described the system architecture, main
workflow and stages of data extraction from various external sources.
Fast search of relevant documents in a big dataset is an important function of a text
analysis system. Development of search engine from scratch is time-consuming and
unnecessary since many search engines are known so far, Solr [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and ElasticSearch
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] being among most popular ones [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        In this paper we describe key features of search engines mentioned above, describe
preprocessing workflow of FCART, requirements to the full-text search engine as a
part of an analytic system and then describe the experiment of searching and indexing
at the same time of data that was gathered from social network service LiveJournal
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Based on the results of the experiment, we choose a search engine that would
better serve as the core of FCART full-text search subsystem.
      </p>
    </sec>
    <sec id="sec-2">
      <title>FCART architecture and preprocessing workflow</title>
      <p>
        Formal Concept Analysis (FCA) has many applications in different fields [
        <xref ref-type="bibr" rid="ref11 ref12">11,12</xref>
        ].
“Formal Concept Analysis Research Toolbox” (FCART) is an integrated environment
for knowledge and data engineers with a set of research tools based on Formal
Concept Analysis. FCART is built as a distributed web-based system with a thick
client. In new distributed version FCART consists of four parts:
1. FCART Intermediate Data Storage (IDS) for storage and preprocessing (initial
converting, data preprocessing, etc.) of big datasets;
2. FCART Web-based solvers (Web-Solvers) for independent resource-intensive
computations;
3. FCART Auth Server for authentication and authorization;
4. FCART Thick Client for interactive data processing and visualization in the
integrated environment.
      </p>
      <p>From an analyst point of view, basic FCA workflow in FCART has five stages (see
Fig. 1).
1. Filling Intermediate Data Storage of FCART from various external SQL, XML or
JSON-like data sources (querying external source is described by External Data
Query Description (EDQD);
2. Data indexing and preprocessing;
3. Loading a data snapshot from a local storage to the current analytic session
(snapshot described by Snapshot Profile). Data snapshot is a data table with
annotated structures and text attributes, loaded in the system by accessing external
data sources;
4. Transforming the snapshot to a binary context (transformation described by</p>
      <p>Scaling Query);
5. Building and visualizing concept lattices and other artifacts arising from formal
contexts (binary data tables) within an analytic session.
External Data Source</p>
      <p>External Data Set</p>
      <p>FCART IDS
Import/Export</p>
      <p>Tools
FCART Client
Import/Export</p>
      <p>Tools</p>
      <p>FCART
Intermediate
Data Storage
Local Session</p>
      <p>Data Base
Analytic Artifacts
Pattern Structures</p>
      <p>Clusters
Other artifacts</p>
      <p>JSON Collection</p>
      <p>Data Snapshot
(Multivalued Context)</p>
      <p>Binary Context
Concept Lattice</p>
      <p>External
Data Query</p>
      <p>Description
Snapshot Profile
Scaling Query +</p>
      <p>Graph Generators</p>
      <p>All steps of data analysis are accomplished with the use of Client program. In the
background, Client interacts with Server using Server REST interface commands. In
the previous articles, we described Server REST interface in detail.</p>
      <p>In the current version of FCART the preprocessing workflow consists of two steps:
1. Normalizing incoming documents
2. Creating index
At normalizing step, FCART creates a document in JSON format, selects one text
field as identifier and calculates document metrics (size, word count, etc.).
At index creating step, FCART creates index on one or multiple text fields. This
index is used for fast data search.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Comparison of popular full-text libraries and systems</title>
      <p>Full-text indexing library adds content to a full-text index. It then allows performing
queries on this index, returning results either ranked by the relevance to the query or
sorted by an arbitrary field such as document's last modified date. The library gets fast
search responses because -- instead of searching the text directly -- it searches the
index. The library cannot be used independently, it can be used as a database plugin
or as a part of another software system.</p>
      <p>There are several key features of the full-text search library:
1. Search, initial indexing and reindexing speed;
2. Support of languages and dialects;
3. List of supported programming languages;
4. Format of supported documents;</p>
      <p>Stemming algorithms;</p>
      <p>
        List of metrics and ranking methods;
The full-text indexing system is a standalone, independent product. It is based on
fulltext indexing library and provides more features. For example, Solr and ElasticSearch
are based on same library called Lucene[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Besides the features mentioned above,
the full-text system provides the following ones:
1. Support of multiple indexes based on different fields
2. Scalability
3. Support for complex search expressions
4. Ranking and grouping of search results
5. Ease of use and ease of integration with data storage (e.g. MongoDb)
The final value of each feature of a particular software system depends on CPU,
number of cores, total amount of memory, etc. Solr and ElasticSearch provide very
similar sets of features [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. These sets of features are enough to satisfy our current
needs. We deploy ElasticSearch and Solr on our server and execute similar
experiment. In the next section we describe an experiment on indexing data gathered
from the social network service Liverjournal.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Indexing data description</title>
      <p>LiveJournal is a popular in Russia social blogging system. Using LiveJournal Web
API we have extracted 1000,100 000, 1000000 person’s profiles. Then we carried out
some classical SNA experiments and tested improved subsystems of FCART.
Initially, IDS takes five initial profiles and initiates process of downloading the
neighbourhood of these persons by relation “has a friend”. Each profile consists of
several blocks of fields :
3. Nickname (“nick” – unique identifier).
4. List of friends (“friends” – array of nicknames).
5. List of users who checked this profile as a friend (“friendsOf” – array of
nicknames).
6. List of interests (“interest” – array of tags).
7. List of watching communities (“watching” – array of community names).
8. List of memberships in communities (“memberOf” – array of community
names).
9. List of posts (not used in next stages).
10. Other personal information.
threads</p>
      <p>Profiles</p>
      <sec id="sec-4-1">
        <title>Profiles</title>
      </sec>
      <sec id="sec-4-2">
        <title>Profiles Profiles 1 10</title>
        <p>100
18
2
0.2
195
21
3
3671
291
32
42710
4483
471</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment of searching and indexing data</title>
      <p>
        Solr 5.4 and ElasticSearch 2.0 have been deployed on a computer with
installed INTEL Xeon Processor E5-2670 (2,60 GHz, 4 cores) 8Gb of RAM
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. We have compared time of search the systems spend on indexing data.
Table 2 contains results of the experiment. Fig 2 illustrates results of the
experiment.
      </p>
      <p>Experiment shows advantage ElasticSearch over Solr from 3 to 7 times.</p>
      <sec id="sec-5-1">
        <title>Indexing system</title>
      </sec>
      <sec id="sec-5-2">
        <title>Time for indexing, milliseconds</title>
      </sec>
      <sec id="sec-5-3">
        <title>Solr</title>
        <p>ElasticSearch
1000
profiles
3000
5
10000
profiles
5700
16
100000
profiles
6200
29
1000000
profiles
6900
42</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        In this paper, we have described key features of two popular full-text search engines,
architecture and preprocessing workflow of FCART software. The case of searching
data and indexing at the same time was considered. Our experiment have shown
significant advantage ElasticSearch over Solr. In future work we will use
ElasticSearch as the core of the search subsystem. Next FCART release will be
integrated with ConceptCloud[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and expanded by the function of indexing texts
using parsed thickets index[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was carried out by the authors within the project “Data mining based on
applied ontologies and lattices of closed descriptions” supported by the Basic
Research Program of the National Research University Higher School of Economics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Neznanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilvovsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parinov</surname>
            ,
            <given-names>A. Advancing FCA</given-names>
          </string-name>
          <article-title>Workflow in FCART System for Knowledge Discovery in Quantitative Data // 2nd International Conference on Information Technology and Quantitative Management (ITQM-</article-title>
          <year>2014</year>
          ), Procedia Computer Science,
          <volume>31</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
            <given-names>R</given-names>
          </string-name>
          . Formal
          <source>Concept Analysis: Mathematical Foundations</source>
          , Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Neznanov</surname>
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilvovsky</surname>
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            <given-names>S.O.</given-names>
          </string-name>
          <article-title>FCART: A New FCA-based System for Data Analysis and Knowledge Discovery, Contributions to the 11th</article-title>
          <source>International Conference on Formal Concept Analysis</source>
          ,
          <year>2013</year>
          . pp.
          <fpage>31</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mirkin</surname>
          </string-name>
          , B.
          <source>Mathematical Classification and Clustering</source>
          , Springer,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magizov</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhukov</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          <string-name>
            <surname>From</surname>
          </string-name>
          <article-title>Triconcepts to Triclusters</article-title>
          .
          <source>Proc. of 13th International Conference on rough sets</source>
          ,
          <article-title>fuzzy sets, data mining and granular computing (RSFDGrC-</article-title>
          <year>2011</year>
          ),
          <source>LNCS/LNAI Volume 6743/2011</source>
          , Springer (
          <year>2011</year>
          ), pp.
          <fpage>257</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          <article-title>Pattern Structures for Analyzing Complex Data //</article-title>
          <source>Proc. of 12th International conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC-2009)</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Apache</given-names>
            <surname>Solr</surname>
          </string-name>
          (http://lucene.apache.org/solr/)
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>ElasticSearch</surname>
          </string-name>
          (https://www.elastic.co/)
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. Ranking of Search Engines (http://db-engines.com/en/ranking/search+engine)</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>LiveJournal</surname>
          </string-name>
          (http://livejournal.com)
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jonas</surname>
            <given-names>Poelmans</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sergei O. Kuznetsov</surname>
            ,
            <given-names>Dmitry I. Ignatov</given-names>
          </string-name>
          , Guido Dedene,
          <article-title>Formal Concept Analysis in knowledge processing: A survey on models and techniques</article-title>
          .
          <source>In: Expert Systems with Applications</source>
          , Vol.
          <volume>40</volume>
          . No.
          <issue>16</issue>
          , pp.
          <fpage>6601</fpage>
          -
          <lpage>6623</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Poelmans</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dedene</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Formal concept analysis in knowledge processing: A survey on applications</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>40</volume>
          ,
          <year>2013</year>
          , pp.
          <fpage>6538</fpage>
          -
          <lpage>6560</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Apache</surname>
          </string-name>
          <article-title>Lucene (https://lucene</article-title>
          .apache.org/)
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Apache</surname>
          </string-name>
          <article-title>Solr vs Elasticsearch (http://solr-vs-elasticsearch</article-title>
          .com/)
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zeus</surname>
          </string-name>
          (http://zeus.hse.
          <source>ru:8080)</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Gillian</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Greene</surname>
            and
            <given-names>Bernd</given-names>
          </string-name>
          <string-name>
            <surname>Fischer</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>ConceptCloud: a tagcloud browser for software archives</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE</source>
          <year>2014</year>
          ). ACM, New York, NY, USA,
          <fpage>759</fpage>
          -
          <lpage>762</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Galitsky</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilvovsky</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Text integrity assessment: Sentiment profile vs rhetoric structure</article-title>
          ,
          <source>in: Computational Linguistics and Intelligent Text Processing. 16th International Conference, CICLing</source>
          <year>2015</year>
          , Cairo, Egypt,
          <source>April 14-20</source>
          ,
          <year>2015</year>
          , Proceedings,
          <string-name>
            <surname>Part II</surname>
          </string-name>
          . Vol.
          <volume>9042</volume>
          . Springer International Publishing,
          <year>2015</year>
          . P.
          <volume>126</volume>
          -
          <fpage>139</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>