<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inlined Code Generation for Smalltalk</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Franklin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dave Mason</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Toronto Metropolitan University</institution>
          ,
          <addr-line>Toronto</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present our early work at improving Smalltalk performance by inlining message sends during compilation. Smalltalk developers typically write small method bodies with one or two statements, this limits a compiler's ability to perform many optimizations, e.g. common sub-expression elimination. Inlining messages into a method body produces methods with fewer message sends and more statements allowing the compiler to optimize and generate eficient executable code. There are several challenges to inlining messages in Smalltalk that need to be resolved like detecting cycles in the call graph, resolving methods arguments to their equivalent in the calling method, and handling non-local returns in block arguments. In this paper, we describe the inlining approach taken for the Zag Smalltalk compiler that solves for these issues and improves performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;compile</kwd>
        <kwd>inlining</kwd>
        <kwd>Smalltalk</kwd>
        <kwd>method dispatch</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Inlining is not a new compiler optimization but it is rare for dynamic languages like Smalltalk where
the receiver class of a message send is not statically known. Dynamic programming languages provide
incredible flexibility to the programmer. However, this flexibility is at the cost of message sends that
require significantly more overhead than static languages when looking up the method to be invoked
and setting up the stack for the call.</p>
      <p>
        The dynamic programming language SELF[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] implements a type inference mechanism to find all
possible types for a receiver of a message send. The compiler uses the derived types to compile a method
multiple times, one for each of the diferent type derived for the receiver. At runtime, knowing the type
of the receiver, SELF can then call the appropriate type-constrained compiled method.
      </p>
      <p>Most Smalltalk compilers inline a fixed set of messages, like ifTrue:ifFalse:, to:do: and
whileTrue: and their variants. These special selectors are hot spots in the code with known
implementers that can be inlined safely and gain significant performance improvements. While this is fairy
efective, there are numerous opportunities for inlining that are missed by this heuristic. There are 43
enumerating methods on collections, and often only a few are inlined.</p>
      <p>We will use the fibonacci code in listing 1 for examples in this paper.
fibonacci
self &lt;= 2</p>
      <p>ifTrue: [ ^ 1 ].
^ (self-1) fibonacci + (self-2) fibonacci</p>
      <p>
        Listing 1: Smalltalk Source for fibonacci
2.1. Stack
The Zag runtime makes use of a separate per-process stack instead of the hardware stack to manage
access to the variables of a program. The stack is where the method arguments, locals and temporaries
are stored along with a reference to self which facilitates message sends and assignments. To prepare
for the execution of a message send the target and arguments are found on the stack and pushed onto
the top of the stack. A dynamic dispatch is then performed which consumes the newly added elements
on the stack and add a return value on the stack. Programmers often make use of the double dispatch
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] pattern, for example when visiting tree structures. If double dispatch is detected a simple stack
optimization can be performed.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        Zag Smalltalk [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] uses a stack-based runtime with threaded instructions or native methods generated
Just In Time with LLVM[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3.1. Zag Code Generation</title>
        <p>Zag Smalltalk stores all methods as Abstract Syntax Trees (ASTs). When a method is first required to
be executed, it must be converted from an AST to either a threaded method or a JITed method. It is
beyond the scope of this paper to discuss the decisions about when a JITed version is required.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Abstract Syntax Tree</title>
          <p>The AST has Method and Block objects, which both have bodies composed of a sequence of expressions,
optionally terminated with a return. Expressions include: variable Assign, Array, Send/Cascade, Literal,
variable Reference, Self/Super, and ThisContext. This is a simplified version of the Pharo AST, and can
easily be generated from Smalltalk source, or by walking a Pharo AST.</p>
          <p>Regardless of whether generating for a threaded or native target, the AST compiler works in two
stages: conversion to basic blocks, followed by target code generation.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. AST to Basic Blocks</title>
          <p>The first stage converts the AST into a series of basic blocks, made up of a sequence of stack operations
(sends, pushes, pops, returns, etc.). Each basic block ends with a send, return, branch, or class switch.
Part of this stage is an optional target-independent inlining of methods and blocks. As part of inlining,
many of the sends may be replaced with switches based on the class of the receiver.</p>
          <p>The first basic block is a MethodBlock. A MethodBlock sets up a compile-time stack that represents the
self, parameters, and locals of the method (or BlockClosure). The basic conversion outputs a sequence
of pushes and pops, mimicking the runtime stack as well as adding the corresponding operations to the
current basic block. When a send or class switch is required it forces the creation of a new basic block.</p>
          <p>If this was a send then the new basic block will be a ReturnBlock. Because the send will execute
code we know nothing about, any and all values must be represented on the stack or the heap. This is
the exact semantics we you expect from the stack operations, but as we will see below this requires
particular attention when generating LLVM code. A ReturnBlock will be initialized with a stack as it
would be on return from the message send, that is with the stack as it was just before the send, but
with the receiver and parameters of the send replaced by a result value. Then we resume the basic
conversion.</p>
          <p>Figure 1 shows a version of the fibonacci method on Integer with no inlining. Note that there are
many diferent ReturnBlocks and an expensive send that returns to each.</p>
          <p>Figure 2 shows a version of the fibonacci method on Integer with extensive inlining. Note that
there are now only two sends and hence ReturnBlocks and most sends have been recognized because
of the primitives they correspond to when inlined, and the result of the comparison is known to be
boolean, so we can do a class switch on True and False.</p>
          <p>If this was a class switch the new basic blocks will be an InlineBlock. Each of the new basic blocks
will have a stack initialized from the stack at the point of the creation of the class switch. Because of the
structural nature of the replacement of the send → class switch, all of the various paths will eventually
come to a common InlineBlock which will have a stack the same as if the send was still in place, that
is with the receiver and parameters replaced by a result. A path that is a sequence of InlineBlocks is
highly advantageous for a target like LLVM because they will all be part of a single function and values
don’t need to be actually saved to memory, as we’ll see below. If, along any of the paths an actual send
is required, because there is no safe inlining that can be done, a ReturnBlock may be required, in which
case the common block will be converted to a ReturnBlock, at some performance cost for a target such
as LLVM.</p>
          <p>After all the basic blocks have been generated, the data-flow graph is updated for the basic blocks, if it
is required by the target code generator. This integrates the require/provides values of the various blocks.
We iterate over all the blocks, taking the values that are required for a given basic block and telling
all of the blocks that transfer to it that it needs those values. This may make those basic blocks now
require the values, so we continue the iteration until we reach a fix-point, where nothing is changing.
This analysis can reduce the memory trafic for native targets such as LLVM.</p>
          <p>
            Method dispatch uses both the class and the selector to find a compiled method in a dispatch hash
table. If a compiled method is not found for the given selector in the classes dispatch table, the method
is compiled from the AST and installed into the dispatch table for the receiver class[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. The advantage of
dynamic dispatch languages like Smalltalk allow the compiled to know upfront what class a method is
being compiled for, which means that all sends to self and super are identifiable and can be potentially
inlined. Also, during compilation any literals found or inferred in the code can be safely typed and the
message send inlined. See the trivial example in Listing 1.
          </p>
          <p>Our method of inlining will keep track of the types of variables
from assignments and method dispatch through the stack elements. abc
When a send is encountered, if the class of the target is known, ^ self def isNil
we will inline the corresponding method, in the cases of literals, def
self and super. Alternatively, if the number of implementers is abcAf^t4er2Inlining
small enough, the message send will be replaced by a class case ^ false
instruction based on the dynamic classes of the receiver. At runtime
the case instruction will attempt to match the runtime type of the Listing 2: Inlining self
class of the receiver, when a match is found a jump to the correct
inlined method will be taken. Since the case was derived from the
current image’s known implementers at runtime a match should be found. If the runtime target type
does not match any of the types for the class switch statement we will default to performing the message
send which likely will result in a Does Not Understand DNU being thrown.
To include the calling methods statements into the receiver, refactoring with the following rules that
handle references to locals, parameters, instance variables, class variables, self or super are required.
References must be refactored with the following rules:
1. all names are accessed by going up the stack, so names of locals will be found before other names
of enclosing inlining;
2. self and parameters are automatically set up as named versions of the values that were passed
as arguments to the send;
3. the return point for a method needs to be accessible to blocks defined in that method so that
"non-local" returns can simply jump to that exit point;
4. tail-recursive sends are recognized as loops so writing whileTrue: and related functions as tail
recursive makes them zero cost.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Work</title>
      <p>We are currently implementing inlining described in this paper, for the dynamic programming language
Smalltalk. Once completed we will run benchmarks to confirm inlining produces higher-performing
code.</p>
      <sec id="sec-4-1">
        <title>Future Work</title>
        <p>We are currently doing nothing to handle code explosion from inlining. We don’t anticipate this being
a problem, as (non-tail) recursive calls stop inlining and inlining seems to create smaller code as often
as it creates larger code. There are a lot of other optimizations we expect to perform including local
type inference, recognizing inlined methods whose results are discarded, recognizing more primitive
special cases. We are currently inlining depth-first, which may not be optimal.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <article-title>Design principles for a high-performance smalltalk</article-title>
          , in: International Workshop on Smalltalk Technologies,
          <year>2022</year>
          . URL: https://api.semanticscholar.org/CorpusID:259124052.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ungar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chambers</surname>
          </string-name>
          , U. Holzle, Object, message, and
          <article-title>performance: how they coexist in self</article-title>
          ,
          <source>Computer</source>
          <volume>25</volume>
          (
          <year>1992</year>
          )
          <fpage>53</fpage>
          -
          <lpage>64</lpage>
          . doi:
          <volume>10</volume>
          .1109/2.161280.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bettini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Capecchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Venneri</surname>
          </string-name>
          ,
          <article-title>Translating double dispatch into single dispatch</article-title>
          ,
          <source>Electronic Notes in Theoretical Computer Science</source>
          <volume>138</volume>
          (
          <year>2005</year>
          )
          <fpage>59</fpage>
          -
          <lpage>78</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/S1571066105051352. doi:https://doi.org/10.1016/j.entcs.
          <year>2005</year>
          .
          <volume>09</volume>
          .011,
          <source>proceedings of the Second Workshop on Object Oriented Developments (WOOD</source>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <article-title>Smalltalk jit compilation: Llvm experimentation</article-title>
          , in: International Workshop on Smalltalk Technologies,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <article-title>Revisiting dynamic dispatch for modern architectures</article-title>
          ,
          <source>in: Proceedings of the 15th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages</source>
          ,
          <string-name>
            <surname>VMIL</surname>
          </string-name>
          <year>2023</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>11</fpage>
          -
          <lpage>17</lpage>
          . URL: https: //doi.org/10.1145/3623507.3623551. doi:
          <volume>10</volume>
          .1145/3623507.3623551.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>