<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to Put Algorithms into Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction
Recently, neural networks have achieved remarkable success in many fields [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Many
practical systems for fundamental tasks are built with neural networks. For example, in
computer vision, it is image classification,, object detection and image segmentation; in
natural language processing, it is language modeling and automatic translation; in audio
processing, both speech recognition and synthesis. Many approaches have become an
industrial standard, and companies around the world are building products based on this
technology.
      </p>
      <p>
        Successful algorithms for various tasks are very different from each other and
required years of research to arrive at the current level of performance. Constructing
a good algorithm for a new task is often a non-trivial challenge. It also turns out that
networks can not just learn from data without exploiting some domain knowledge.
This knowledge is usually encoded at least in the architecture itself. For example,
convolutional neural networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] exploit intuition that translation of the object does not
change the object itself, i.e., a cat does not stop being a cat if moved left.
      </p>
      <p>
        At the same time, in many domains we already have powerful algorithms that do a
decent job. It is a very natural idea to exploit those to construct better networks. We can
look at this from two sides. From one side, this means constructing new layers or blocks
of layers for networks. From another side, this means making trainable algorithms. In
any case, the attempt is to take best of both worlds. This direction has been around since
90s [
        <xref ref-type="bibr" rid="ref1 ref3 ref5">1,3,5</xref>
        ], but for long time was not getting significant attention (together with neural
networks).
      </p>
      <p>In this talk, we will review three ways to combine algorithms and networks (see
Fig. 1):
1. structured pooling: an algorithm is used to select active features (similarly to max
pooling);
2. unrolling iterations into layers: an algorithm simply becomes a part of the network;
3. analytical derivative w.r.t. the algorithm input, i.e., building a layer with a special
backward operator.</p>
      <p>To illustrate all the approaches, we will use a running example of a simplified task
of handwriting recognition: recognize a word given a sequence of images where each
image shows exactly one letter.</p>
      <p>Acknowledgments
Supported by RSF project 19-71-30020.
1. Structured pooling
Forward pass
Backward pass:
need only the result of the algorithm</p>
      <p>Algorithm</p>
      <p>Saved
result</p>
      <p>Loss
Gradient</p>
      <p>Algorithm</p>
      <p>Analytical/algorithmic
differentiation</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Cun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          , and Y.:
          <article-title>Global training of document processing systems using graph transformer networks</article-title>
          .
          <source>In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          . MIT Press (
          <year>2016</year>
          ), http://www. deeplearningbook.org
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Le</given-names>
            <surname>Cun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            , and
            <surname>Haffner</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Gradient based learning applied to document recognition</article-title>
          .
          <source>Proceedings of IEEE</source>
          <volume>86</volume>
          (
          <issue>11</issue>
          ),
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          <volume>521</volume>
          (
          <issue>7553</issue>
          ),
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A tutorial on energy-based learning</article-title>
          .
          <source>Predicting structured data 1 (0)</source>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>