<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mining Structures from Massive Text Data: A Data-Driven Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiawei Han</string-name>
          <email>hanj@illinois.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Abel Bliss Professor, Department of Computer Science University of Illinois at Urbana-Champaign Urbana</institution>
          ,
          <addr-line>IL 61801</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>16</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The real-world big data are largely unstructured, interconnected, and in the form of natural language text. One of the grand challenges is to mine structures from such massive unstructured data, and transform such big data into structured networks and actionable knowledge. We propose a text mining approach that requires only distant supervision or minimal supervision but relies on massive data. We show that quality phrases can be mined from such massive text data, types can be extracted from massive text data with distant supervision, and entity-attribute-value triples can be extracted from meta-patterns discovered from such data. Finally, we propose a data-to-network-to-knowledge paradigm, that is, first turn data into relatively structured information networks, and then mine such text-rich and structurerich networks to generate useful knowledge. We show such a paradigm represents a promising direction at turning massive text data into structured networks and useful knowledge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The success of data mining technology is largely
attributed to the efficient and effective analysis
of structured data. The construction of a
wellstructured, machine-actionable database from raw
(unstructured or loosely-structured) data sources is
often the premise of consequent applications.
Although the majority of existing data generated in
our society is unstructured, big data leads to big
opportunities to uncover structures of real-world
entities (e.g., person, company, product),
attributes (e.g., age, weight), relations (e.g.,
employee of, manufacture) from massive
text corpora. By integrating these
semanticrich structures with other inter-related structured
data (e.g., product specification, user
transaction log), one can construct a
powerful StructDB as a conceptual abstraction of the
original text corpora. The uncovered StructDBs
will facilitate browsing information and inferring
knowledge that are otherwise locked in the text
corpora. Computers can effectively perform
algorithmic analysis at a large scale over these
StructDBs and apply the new insights and
knowledge to improve human productivity in various
downstream tasks. Our phrase mining tool,
SegPhrase
        <xref ref-type="bibr" rid="ref4">(Jialu Liu, et al., 2015)</xref>
        , won the grand
prize of Yelp Dataset Challenge1 and was used by
TripAdvisor in productions2. Our entity
recognition and typing system, ClusType
        <xref ref-type="bibr" rid="ref4 ref6">(Xiang Ren, et
al., 2015)</xref>
        , was shipped as part of the products in
Microsoft Bing and U.S. Army Research Lab.
      </p>
      <p>The remaining of the paper is organized as
follows. Section 2 introduces our recent work on
automated mining of quality phrases from massive
corpora. Section 3 introduces our recent studies
on automated recognition and typing of entities
and relations with distant supervision. Section 4
presents our initial study on meta-pattern
discovery and its application to information extraction.
We conclude our study in Section 5 by pointing
out some future research topics on turning massive
unstructured data into structured knowledge
2</p>
    </sec>
    <sec id="sec-2">
      <title>Automated Quality Phrase Mining</title>
      <p>Concepts are words and phrases that represent
terms or ideas that people are interested in. A lot
of concepts, especially scientific concepts, are in
the form of phrases and are not restricted to noun
phrases (e.g., “NP Complete” and “Learning to
Rank”). Concepts are also often arranged in
hi1http://www.yelp.com/dataset_challenge
2http://engineering.tripadvisor.com/
mining-text-review-snippets/
erarchies where each node is a topic represented
by a ranked list of concepts (e.g., {‘social network
analysis’, ‘mining information networks’, . . .}, is
a child node of a general topic node: {‘knowledge
discovery’, ‘data mining’, . . .}). Such
hierarchical organization of concepts allows exploration of
corpus at varied granularity, and has applications
like visualization, search and summarization.</p>
      <p>The NLP community has conducted
extensive studies on automatic extraction of quality
phrases, but mostly rely on many kinds of
linguistic processing (e.g., chunking, dependency
parsing), domain-dependent language rules, and a
large amount of labeled data (e.g., treebanks).</p>
      <p>
        In our recent research, we have developed
several interesting automated phrase mining methods.
The general philosophy is that instead of relying
on explicit training, we explore statistical
redundancy in document collections by frequent-pattern
mining and semi-supervised learning. Such
datadriven approaches leverage statistical or heuristic
measures derived from corpus and achieve
impressive results. Our newly developed phrase mining
approach consists of three methods: (1)
unsupervised approach (i.e., requiring neither expert
explicitly labeled training data nor knowledge-base),
represented by ToPMine
        <xref ref-type="bibr" rid="ref1">(Ahmed El-Kishky, et al.,
2014)</xref>
        , (2) weakly supervised approach (i.e.,
requiring a small set of human labeled training data
on the quality of phrases), represented by
SegPhrase
        <xref ref-type="bibr" rid="ref4">(Jialu Liu, et al., 2015)</xref>
        , and (3)
distantlysupervised approach (i.e., requiring only distantly
labeled knowledge-bases, such as Wikipedia),
represented by AutoPhrase
        <xref ref-type="bibr" rid="ref10 ref2 ref3 ref3 ref5 ref9 ref9">(Jialu Liu, et al., 2017;
Jingbo Shang, et al., 2017)</xref>
        .
      </p>
      <p>Our experiments on large text corpora show
ToPMine and SegPhrase, with minor adaptation,
generate quality phrases in large corpora of
multiple languages (e.g., English, Arabic, Chinese and
Spanish) since both methods rely mainly on
statistical analysis instead of language parsing and
linguistic features. For AutoPhrase, it
demonstrates additional power over Segphrase on four
aspects: (i) minimized human effort, using a
robust positive-only distant training method which
estimates the phrase quality by leveraging
existing general knowledge bases; (ii) supporting
multiple languages including English, Spanish, and
Chinese, where the language in the input will be
automatically detected, (iii) high accuracy, using
a POS-guided phrasal segmentation model
incorporating POS tags when POS tagger is available,
and moreover, the new framework is able to
extract single-word quality phrases; and (iv) high
efficiency, due to a better indexing method and an
almost lock-free parallelization, which lead to both
running time speedup and memory saving.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Distantly Supervised Entity/Relation</title>
    </sec>
    <sec id="sec-4">
      <title>Recognition and Typing</title>
      <p>Extracting entities and relations for types of
interest from text is important for understanding
massive text corpora. Traditionally, systems of entity
relation extraction have been relying on
humanannotated corpora for training and adopted an
incremental pipeline. Such systems require
additional human expertise to be ported to a new
domain and are vulnerable to errors cascading down
the pipeline.</p>
      <p>
        Recently, we have investigated a distantly
supervised approach for extraction and typing of
entities and relations and developed several
interesting methods to reduce human effort and enhance
the performance. These include (1) ClusType
        <xref ref-type="bibr" rid="ref4 ref6">(Xiang Ren, et al., 2015)</xref>
        , which explores an
integrated, entity typing and relation-phrase
clustering approach, (2) PLE
        <xref ref-type="bibr" rid="ref7">(Xiang Ren, et al., 2016)</xref>
        for refined entity typing, and (3) Co-Type
        <xref ref-type="bibr" rid="ref2 ref5 ref8 ref9">(Xiang
Ren, et al., 2017)</xref>
        for jointly embedding and
typing entities and relations in a mutually enhanced
framework.
      </p>
      <p>
        ClusType
        <xref ref-type="bibr" rid="ref4 ref6">(Xiang Ren, et al., 2015)</xref>
        explores
data-driven phrase mining to generate entity
mention candidates and relation phrases, and enforces
the principle that relation phrases should be softly
clustered when propagating type information
between their argument entities. Then the method
predicts the type of each entity mention based
on the type signatures of its co-occurring relation
phrases and the type indicators of its surface name,
as computed over the corpus. The two tasks, type
propagation with relation phrases and multi-view
relation phrase clustering, are put in a joint
optimization framework and achieves high
performance.
      </p>
      <p>
        For extraction and typing of fine-grained
entity types in conjunction with existing knowledge
bases, a major difficulty is that the type labels
obtained from knowledge bases are often noisy
(i.e., incorrect for the entity mentions’ local
context). We proposed a framework, called PLE
        <xref ref-type="bibr" rid="ref7">(Xiang Ren, et al., 2016)</xref>
        , which conducts Label
Noise Reduction in Entity Typing (LNR), to
automatically identify correct type labels (type-paths)
for training examples, given the set of candidate
type labels obtained by distant supervision with a
given type hierarchy. PLE jointly embeds entity
mentions, text features and entity types into the
same low-dimensional space where objects whose
types are semantically close have similar
representations. Then we estimate the type-path for each
training example in a top-down manner using the
learned embeddings. We formulate a global
objective for learning the embeddings from text
corpora and knowledge bases, which adopts a novel
margin-based loss that is robust to noisy labels and
faithfully models type correlation derived from
knowledge bases.
      </p>
      <p>
        To Further enhance the overall performance
for entity and relation extraction and typing, We
propose a novel domain-independent framework,
called Co-Type
        <xref ref-type="bibr" rid="ref2 ref5 ref8 ref9">(Xiang Ren, et al., 2017)</xref>
        , that runs
a data-driven text segmentation algorithm to
extract entity mentions, and jointly embeds entity
mentions, relation mentions, text features and type
labels into two low-dimensional spaces (for
entity and relation mentions respectively), where, in
each space, objects whose types are close will also
have similar representations. COTYPE, then
using these learned embeddings, estimates the types
of test (unlinkable) mentions. We formulate a
joint optimization problem to learn embeddings
from text corpora and knowledge bases, adopting
a novel partial-label loss function for noisy labeled
data and introducing an object “translation”
function to capture the cross-constraints of entities and
relations on each other and achieved high
performance over existing embedding-based methods.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Meta-Pattern Guided Information</title>
    </sec>
    <sec id="sec-6">
      <title>Extraction</title>
      <p>
        Mining textual patterns in news, tweets, papers,
and many other kinds of text corpora may
facilitate effective information extraction from massive
text corpora. Previous studies adopt a dependency
parsing-based pattern discovery approach.
However, the parsing results lose rich context around
entities in the patterns, and the process is costly
for a corpus of large scale. Recently, we have
proposed a typed textual pattern structure, called
meta pattern, to represent a general form of
frequent, informative, and precise subsequence
patterns in certain context. We propose an efficient
framework, called MetaPAD
        <xref ref-type="bibr" rid="ref2 ref8 ref9">(Meng Jiang, et al.,
2017)</xref>
        , which discovers meta patterns from
massive corpora with three techniques: (1) it
develops a context-aware segmentation method to
carefully determine the boundaries of patterns with a
learned pattern quality assessment function, which
avoids costly dependency parsing and generates
high-quality patterns; (2) it identifies and groups
synonymous meta patterns from multiple facets—
their types, contexts, and extractions; and (3) it
examines type distributions of entities in the
instances extracted by each group of patterns, and
looks for appropriate type levels to make
discovered patterns precise.
      </p>
      <p>Our extensive experiments demonstrate that our
proposed framework discovers high-quality typed
textual patterns efficiently from different genres
of massive corpora and facilitates information
extraction. For example, from an Associate Press
and Reuter dataset (APR 2015), one can
discover meta-patterns for country and president and
extract country- president pairs even for rarely
mentioned pairs, like Burkina Faso-Blaise
Compaore´, and find which bacteria are resistant
to which antibiotics from the PubMed abstracts.
5</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future work</title>
      <p>Mining structures from massive text copora is an
important task for turning big text data into big
structured knowledge. Traditional approaches
relying on extensive human labeling or annotation
of a nontrivial sample set of documents in specific
application domain are not scalable. A new
direction is to develop effective weakly or distantly
supervised methods to explore existing
domainagnostic labels and massive existing text corpora
to achieve high performance on phrase mining,
entity and relation extraction and typing, and
information extraction.</p>
      <p>Our recent development of phrase mining
methods, such as ToPMine, SegPhrase and AutoPhrase,
entity/relation recognition and typing methods
such as ClusType, PLE and CoType, as well as
pattern-based discovery with massive text corpora,
such as MetaPAD, contribute to this direction.</p>
      <p>
        There are a lot of future research problems
along this direction. Besides further
consolidating these distantly supervised methods, an
important direction is to study automated multi-faceted
taxonomy direction from massive text to turn
extracted concepts (e.g., phrases) into organized
structures as well as identifying trusted claims and
comparative and succinct summaries, and build up
structured, multi-dimensional text-cubes and
information networks, from massive data. We have
been working along these lines and developing
some new methods, such as SetExpan
        <xref ref-type="bibr" rid="ref10">(Jiaming
Shen, et al., 2017)</xref>
        , REHession
        <xref ref-type="bibr" rid="ref3 ref5 ref9">(Liyuan Liu, et al.,
2017)</xref>
        and indirect supervision for relation
extraction using question-answer pairs
        <xref ref-type="bibr" rid="ref11">(JZeqiu Wu, et
al., 2018)</xref>
        . Still, this is a huge and promising area,
with a vast unexplored territory waiting to be
explored.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Research was sponsored in part by the U.S. Army
Research Lab. under Cooperative Agreement No.
W911NF-09-2-0053 (NSCTA), National Science
Foundation IIS 16-18481, IIS 17-04532, and
IIS17-41317, and grant 1U54GM114838 awarded
by NIGMS through funds provided by the
transNIH Big Data to Knowledge (BD2K) initiative
(www.bd2k.nih.gov). The views and conclusions
contained in this document are those of the
author(s) and should not be interpreted as
representing the official policies of the U.S. Army Research
Laboratory or the U.S. Government. The U.S.
Government is authorized to reproduce and
distribute reprints for Government purposes
notwithstanding any copyright notation hereon.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Ahmed</given-names>
            <surname>El-Kishky</surname>
          </string-name>
          , Yanglei Song, Chi Wang,
          <string-name>
            <surname>Clare R. Voss</surname>
          </string-name>
          , and Jiawei Han.
          <article-title>Scalable topical phrase mining from text corpora</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <fpage>305</fpage>
          -
          <lpage>316</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Meng</given-names>
            <surname>Jiang</surname>
          </string-name>
          , Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance Kaplan, Timothy Hanratty, and Jiawei Han.
          <article-title>MetaPAD: Meta patten discovery from massive text corpora</article-title>
          .
          <source>In Proc. 2017 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'17)</source>
          , Halifax, Nova Scotia, Canada, Aug.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Jialu</given-names>
            <surname>Liu</surname>
          </string-name>
          , Jingbo Shang, and Jiawei Han.
          <article-title>Phrase Mining from Massive Text and Its Applications</article-title>
          . Morgan &amp; Claypool Publishers,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jialu</given-names>
            <surname>Liu</surname>
          </string-name>
          , Jingbo Shang, Chi Wang,
          <string-name>
            <surname>Xiang Ren</surname>
          </string-name>
          , and Jiawei Han.
          <article-title>Mining quality phrases from massive text corpora</article-title>
          .
          <source>In Proc. 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15)</source>
          , Melbourne, Australia, May
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Liyuan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han.
          <article-title>Heterogeneous supervision for relation extraction: A representation learning approach</article-title>
          .
          <source>In Proc. of 2017 Conf. on Empirical Methods in Natural Language Processing EMNLP'17</source>
          , pages
          <fpage>46</fpage>
          -
          <lpage>56</lpage>
          , Copenhagen, Denmark, Sept.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Ren</surname>
          </string-name>
          , Ahmed El-Kishky,
          <string-name>
            <given-names>Chi</given-names>
            <surname>Wang</surname>
          </string-name>
          , Fangbo Tao,
          <string-name>
            <surname>Clare R. Voss</surname>
          </string-name>
          , Heng Ji, and Jiawei Han.
          <article-title>ClusType: Effective entity recognition and typing by relation phrase-based clustering</article-title>
          .
          <source>In Proc. 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15)</source>
          , Sydney, Australia, Aug.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Ren</surname>
          </string-name>
          , Wenqi He, Meng Qu,
          <string-name>
            <surname>Clare R. Voss</surname>
          </string-name>
          , Heng Ji, and Jiawei Han.
          <article-title>Label noise reduction in entity typing by heterogeneous partial-label embedding</article-title>
          .
          <source>In Proc. of 2016 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining</source>
          , San Francisco, CA, USA,
          <year>August</year>
          13-
          <issue>17</issue>
          ,
          <year>2016</year>
          , pages
          <fpage>1825</fpage>
          -
          <lpage>1834</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Ren</surname>
          </string-name>
          , Zeqiu Wu, Wenqi He, Meng Qu, Clare Voss, Heng Ji, Tarek Abdelzaher, and Jiawei Han.
          <article-title>CoType: Joint extraction of typed entities and relations with knowledge bases</article-title>
          .
          <source>In Proc. 2017 WorldWide Web Conf. (WWW'17)</source>
          , Perth, Australia, Apr.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Jingbo</given-names>
            <surname>Shang</surname>
          </string-name>
          , Jialu Liu, Meng Jiang, Xiang Ren,
          <string-name>
            <surname>Clare R. Voss</surname>
          </string-name>
          , and Jiawei Han.
          <article-title>Automated phrase mining from massive text corpora</article-title>
          .
          <source>CoRR, abs/1702.04457</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jiaming</given-names>
            <surname>Shen</surname>
          </string-name>
          , Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, and Jiawei Han.
          <article-title>SetExpan: Corpus-based set expansion via context feature selection and rank ensemble</article-title>
          .
          <source>In Proc. 2017 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'17)</source>
          , Skopje, Macedonia, Sept.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Zeqiu</given-names>
            <surname>Wu</surname>
          </string-name>
          , Xiang Ren, Frank F. Xu,
          <string-name>
            <given-names>Ji</given-names>
            <surname>Li</surname>
          </string-name>
          , and Jiawei Han.
          <article-title>Indirect supervision for relation extraction using question-answer pairs</article-title>
          .
          <source>In Proc. of 2018 ACM Int. Conf. on Web Search and Data Mining (WSDM'18)</source>
          , Los Angeles, CA,
          <year>Feb</year>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>