<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Big-data-augmented approach to emerging technologies identification: case of agriculture and food sector</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Moscow, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The paper discloses a new approach to emerging technologies identification, which strongly relies on capacity of big data analysis, namely text mining augmented by syntactic analysis techniques. The opportunities of the new big-data-augmented methodology are shown in comparison to existing results, both globally and in Russia. The integrated ontology of currently emerging technologies in A&amp;F sector is introduced. The directions and possible criteria of further enhancement and refinement of proposed methodology are contemplated.</p>
      </abstract>
      <kwd-group>
        <kwd>Text Mining Emerging Technology Agriculture and Food Sector</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Technology identification and mapping exercises for effective science and technology
(S&amp;T) and innovation policies shaping become less feasible without modern data
science techniques application. This happens due to the explosive growth of diversity and
quantity of available S&amp;T information, drawbacks of human-performed analytics, as
well as overextended periods of foresight studies and budget limitations. The attempts
to solve the problem include tech mining [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1–3</xref>
        ], as well as creation and regular update
of the ontologies specific for foresight studies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The main disadvantages of these
approaches are their insufficient scalability, as well as strong reliance on large expert
validation, manual filtering and data outputs cleaning. The results are highly prone to
subjectivity, human errors and obsolescence.
      </p>
      <p>For the purposes of emerging technologies (new technologies that might have a
significant impact on the economic activity) identification, we see text mining / semantic
analysis tools as the most appropriate, as identification of new man-made phenomena
of known nature (technologies in this case) can be reduced to identification of new
syntactic constructions signifying them. The fact that man-made artifacts tend to be
explicitly named, described and discussed with the use of written language makes the
problem well-posed.</p>
      <p>
        To demonstrate text-mining-augmented techniques applied to technology
identification and mapping we consider the case of the agriculture and food (A&amp;F sector). Our
choice is driven by the fact that large proportion of global challenges are directly related
to A&amp;F sector [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and seemingly cannot be solved without radical technology
innovation across the globe [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Methodology
The main hypothesis in this paper is that "emerging technology" as a signifying
syntactic construction has not lost its semantic utility despite the hype around this concept.
The analysis is based on the ample material of the two-year A&amp;F sector foresight study,
and relies on the capabilities of the Text Mining System of the National Research
University Higher School of Economics (NRU HSE). Composition of data sources of the
system include stratified random sample of summaries and metadata of top cited
research papers and international patents, as well as newsfeeds items from tops of global
news portals with science and technology flavor, analytical and forecast reports,
declarations, proceedings and other documents in PDF format (all acquired through open
access sources). At the time of the study, the system featured more than 12 million
documents, several hundred million sentences, of which up to 3 million documents
were at least partially relevant to A&amp;F sector and adjoining sectors, such as
biotechnology and bioenergy, more than 150 million terms - object signifiers (among which
technologies are presented).</p>
      <p>In this paper, we present one of possible approaches of technology identification,
namely cascade identification of words being governors within terms. The method
allows identifying unigrams – universal signifiers of semantic field of "techologicality",
i.e. words that radically increase the probability of an n-gram containing them to be a
name of certain technology. Examples of such words are technology, method, system,
platform, model, tool, layer, enzyme and others. Extraction of all object-signifying
words allows getting hundreds of thousands of terms – candidates for being names of
technologies (for instance, DNA sequencing technology, or recirculating aquaculture
system, etc.). These lists are filtered with the use of author-built machine learning
algorithms dealing with "information-richness" of terms, their monopolism and
specificity and other attributes.</p>
      <p>
        Then, analysis of dynamics of presence intensity in the discourse during the last
years is conducted for the candidate technology-signifying terms. The relative
frequency of terms is calculated using the following formula [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
in which a term has occurred,  – the amount of all sentences in the corpus.
where  – document’s number,
      </p>
      <p>, – the amount of sentences in the i-th document,</p>
      <p>In order to calculate the dynamics of the candidate technology-signifying terms, we
adapted the formula of average annual growth rate (AGR):
where  – the amount of years, for which the collection of documents is available,

(2)
A&amp;F technologies identification results include the list of 181 items. A random sample
is provided below:
aeration technologies
cultured meat
technolhorticultural
technoloagricultural
conserva</p>
      <p>dairy technologies
ogies
nologies
gies</p>
      <p>DNA micro array
techgies</p>
      <p>integrated soil fertility
management technologies</p>
      <p>LEISA technologies
nologies
algal biofuel
technolofeed probiotics
meat processing
techbioconversion
technolfertilisation
technolosmart irrigation</p>
      <p>Technologies were distinguished by dynamics of intensity of their presence in the
discourse during the last years. It can be visualized as trend maps: 2-dimensional plots
with one axis representing the popularity of a term and the other showing the
year-byyear dynamics of the normalized popularity (relative frequency of use). For trend map
of technologies in agriculture on media resources see Fig.1. The upper-right quadrant
consists of the strongest topics shaping the future agenda of the sector, they are popular
and gaining traction: in media they are exemplified by CRISPR technologies,
agroforestry and aquaponic technologies, precision agriculture and microalgae technologies
etc. The lower-right quadrant contains the so-called "weak signals": they are highly
trending but underrepresented in discourse yet. They can contain the emerging
technologies. This group presented by smart irrigation technologies, molecular breeding and
zinc-finger nucleases technologies etc. Among the popular topics losing their
significance are fertilisation, pruning, antifouling technologies and many more.
The method applied in this study yields most results with data obtained on research
papers and especially patents abstracts, which contain less low-informative terms than
general reports of international organizations, and discuss technologies in more
concrete terms. Within this approach, any individual terms are filtered out without an
anchor term within. One of the consequences is that branded, trademarked and other
proprietary technologies are almost not present in the output (with some exceptions, such
as Round-Up pesticide, which name has gone almost denominative in the GMO
application discourse, so that "roundup technology" were mentioned in the texts analyzed).
The next steps of filtering the obtained lists of technologies may include building the
semantic map that demonstrates dynamic classification, trend maps based on other
sources of data, as well as hype maps that show difference in normalized popularity of
topics in different data sources (e.g. media vs patents).</p>
      <p>The limitations of such approach is high dependence on the marker terms. Some
technologies may never co-occur with “technologicality” terms meaning the algorithm
will miss them. In order to overcome this obstacle, future studies will concentrate on
two main points: searching terms that are relatively more specific to patent literature
compared to other sources of data (such as scientific publications, media news,
analytical reports) as potentially technical terms, as well as using identified technology terms
as a sample for machine learning based on word embeddings. In other words, the main
hypothesis for the future studies is that terms that are semantically highly similar to
technology terms (based on word2vec, GloVe or other approaches) are also likely to be
candidates for being names of technologies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Porter</surname>
            , Alan L.,
            <given-names>and Scott W.</given-names>
          </string-name>
          <string-name>
            <surname>Cunningham</surname>
          </string-name>
          .
          <article-title>Tech mining: exploiting new technologies for competitive advantage</article-title>
          . Vol.
          <volume>29</volume>
          . John Wiley &amp; Sons (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Madani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          '
          <article-title>Technology Mining' bibliometrics analysis: applying network analysis and cluster analysis</article-title>
          .
          <source>Scientometrics</source>
          <volume>105</volume>
          (
          <issue>1</issue>
          ),
          <fpage>323</fpage>
          -
          <lpage>335</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bakhtin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Saritas</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>Tech Mining for Emerging STI Trends Through Dynamic Term Clustering</article-title>
          and
          <source>Semantic Analysis: The Case of Photonics. Anticipating Future Innovation Pathways Through Large Data Analysis</source>
          ,
          <fpage>341</fpage>
          -
          <lpage>360</lpage>
          . Springer International Publishing (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Popper</surname>
          </string-name>
          , R. iKNOW project, http://www.foresight-platform.eu/wp-content/uploads/2010/06/5.4-Popper_iKNOW_EFP_final.pdf,
          <source>last accessed</source>
          <year>2017</year>
          /03/15.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Godfray</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beddington</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crute</surname>
            ,
            <given-names>I.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddad</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          ., Muir.,
          <string-name>
            <surname>Pretty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toulmin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Food security: the challenge of feeding 9 billion people</article-title>
          ,
          <source>Science Express</source>
          ,
          <volume>327</volume>
          (
          <issue>5967</issue>
          ),
          <fpage>812</fpage>
          -
          <lpage>818</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Royal</given-names>
            <surname>Society</surname>
          </string-name>
          .
          <article-title>Reaping the benefits: Science and the sustainable intensification of global agriculture</article-title>
          ,
          <source>RS Policy Document</source>
          <volume>11</volume>
          /09, The Royal Society, London (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bakhtin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saritas</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chulok</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuzminov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timofeev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Trend monitoring for linking science and strategy</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>1</volume>
          -
          <fpage>17</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>