Skip to end of metadata
Go to start of metadata

Problem

Slow Clojure boot time is frequently one of the highest ranked issues in the annual State of Clojure survey. A separate poll showed that many people are experiencing slow boot times in a variety of circumstances - not just when trying to use Clojure for "scripting", but also when starting a REPL or running a Clojure program in a variety of ways.

Goals

In Jan, 2016 an informal poll collected startup time data from ~100 users. This data was incomplete and anecdotal but the following summarizes the essence of what was reported:

  • Starting a REPL with lein repl(most frequent reported use case)
    • 50% - reported 2-10 second start times (expected: 1-2 s)
    • 30% - reported 20-30 second start times (expected: 2-5 s)
    • 20% - reported 60+ second start times (expected: 2-15 s)
  • Starting Clojure program with lein run
    • 50% - reported 2-6 seconds (expected: <1 s)
    • rest - reported 10-60 seconds
  • Starting Clojure program with java command
    • 50% reported 2-5 seconds (expected: <1 s)
    • rest - reported 10-60 seconds (expected: 3-5 s)

There are a variety of different use cases where we would like to see progress:

  • Reduce clojure.core startup overhead- this helps everyone, but would be most felt in starting the REPL or running small, short-lived programs
    • Given the 94 ms of Java startup time, we will never compete directly with interpreted langs
    • Currently the Clojure boot time is about 640 ms - a good goal would be 400 ms (in addition to the Java startup time) 
  • Reduce tooling startup time(lein/boot startup)
    • While this has nothing to do with Clojure core, it is nevertheless the biggest component of start time for small projects
    • lein repl should start in 1 second - if we can get JVM/Clojure overhead to < 0.5 s, then this means lein repl has to happen in 0.5 s
    • Starting nrepl is the biggest factor in the startup time
    • Impoving startup time helps twice as lein repl launches a second JVM
  • Reduce time per-ns and per-var to improve large program start time
    • While there is fixed overhead, the majority of the costs scale with the number of namespaces and number of vars
    • The per-var cost is a key factor contributing to startup time for larger apps which load a significant number of vars
      • Lazy vars have the biggest potential here, but also optimizing the process of creating and binding vars
      • We need to cut this cost to the realm of 0.1 ms/var instead

Raw startups for other dynamic langs (20-30 ms):

CommandTime (s)
ruby -e 00.031
python -c 00.019
node -e 00.027

Timings

The startup time that people experience can be broken into several pieces:

  • JVM startup
  • Clojure runtime startup, biggest pieces are:
    • clojure.lang.RT
    • clojure.core namespace
  • Tooling startup - tools like Leiningen or Boot add significant overhead
    • Starting nrepl server
    • Connecting to nrepl with client (like reply)
    • Initializing middleware, most critically clojure-complete, which scans jars
  • Program load time
    • Each namespace you load in starting your program has a cost:
      • Compilation (if source)
      • Var loading (per var)
      • Class loading (per function)
    • Whatever actual logic you run at startup time

How does this break down for an example program?

Timings done with:

  • Java 1.8 - Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)
  • Clojure 1.8.0
  • Leiningen 2.6.1
  • Boot 2.5.5
  • Macbook Pro - Quad core 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3, SSD
  • Times reported are best of 5 runs

 

IDCommandProgramTime (seconds)Classes loaded
J-NILjava -cp java HelloJava empty main0.09423
C-NILjava -cp $CLOJURE18 clojure.main -e nilClojure eval nil0.731996
C-REPLecho | time java -cp $CLOJURE18 clojure.mainClojure REPL start/stop0.892389
C-RUNjava -cp $CLOJURE18:src clojure.main -e "(require 'hello.main) (hello.main/hi)"Run function in a ns0.802078
C-RUN-2java -cp $CLOJURE18:src clojure.main -e "(require 'hello.main2) (hello.main2/hi)"main2 loads 2nd ns with 100 defns0.872184
C-AOTjava -cp $CLOJURE18:target/classes hello.coreRun empty fn with AOT0.741997

Diffing these we can estimate these costs by diffing the times above:

  • JVM startup = 90 ms (423 classes) - we can take this as a minimum bar
  • Clojure runtime startup = 640 ms (1573 classes) - C-NIL vs J-NIL
  • Clojure REPL startup = 160 ms (393 classes) - C-REPL vs C-NIL
  • Loading 1 namespace with 1 defn = 70 ms (82 classes) - C-RUN vs C-NIL
  • Requiring namespace containing 100 defns = 70 ms (106 classes) - C-RUN-2 vs C-RUN

If we AOT compile a Clojure namespace and invoke the compiled form directly, we see a reduction of 60 ms and 81 classes. These programs are too simple to evaluate the impact of JIT vs AOT.

Leiningen overhead

We can also look at the lein repl overhead for the commands above.

IDCommandProgramTime (seconds)Classes loadedVs C- (sec)Vs C- (classes)
L-REPLecho | time lein replClojure REPL start/stop4.5433830.892389
L-RUNlein run -m hello.main/hiRun function in a ns2.5021040.802078
L-RUN-2lein run -m hello.main2/himain2 loads 2nd ns with 100 defns2.5423060.872184
L-AOTlein run -m hello.corerun AOT fn2.5121850.741997

When running Leiningen repl, we see 3.65 sec of additional time and almost 1000 additional classes loaded.

When running Leiningen run, we see about 1.8 s of additional time added.

Both of these commands are using nrepl. Looking closer at the nrepl startup, we see that it takes:

  • 1.6 s to start the nrepl server (this involves launching a second JVM)
  • 0.2 s to start the nrepl client

These cannot currently be parallelized do to race conditions in nrepl-ack (according to the code). 

There are other options for starting lein repl faster with fast trampoline. You can use this as follows (all times are after initial run which caches classpath etc):

IDCommandTime (seconds)Vs L- (sec)
LFT-REPLLEIN_FAST_TRAMPOLINE=y echo | lein trampoline run -m clojure.main
Note: using Clojure REPL, not lein repl
2.424.54
LFT-RUNLEIN_FAST_TRAMPOLINE=y lein trampoline run -m hello.main/hi0.812.50
LFT-RUN-2LEIN_FAST_TRAMPOLINE=y lein trampoline run -m hello.main2/hi0.832.54
LFT-AOTLEIN_FAST_TRAMPOLINE=y lein trampoline run -m hello.core0.802.51

These run times are essentially as fast as not using lein at all after the first invocation. However, most people don't do this because this is more complicated.

clojure.core loading

clojure.core defines all of the core vars in the Clojure language. clojure.core is actually split up across a number of files, many of which are loaded from core.clj. We can put some numbers on the time to load each part:

partms
core, pt 1216
core_proxy8
core_print23
genclass12
core_deftype15
core/protocols24
gvec10
instant21
uuid6
core, pt 279
core, pt 3 - datareaders8

gvec is initialized during load but is not actually used by anything during load - this startup could potentially happen in parallel or lazily. This would save at most 10 ms.

instant and uuid are fairly independent chunks of code that don't need to be loaded until the data readers are read at the very end of core - they could potentially be loaded in parallel as well. This would save at most 21 ms though.

Clearly though the majority of these times correspond to loading each var in core - there is a per-var cost on the order of a few tenths of a ms.

RT loading

clojure.lang.RT is the other major component of Clojure runtime load time. How does it break down?

TODO

Per-var load time

It is common for people to report server load times in the order of 10s of seconds with a total load time over a minute. While there is likely little to do with respect to an application's logic, there is a per-var cost (due to definition, initialization, and classloading). Could dramatic improvements to per-var loading make significant changes to load time? Lazy var loading is one approach to this, but there are potentially other improvements that could help as well (optimizing Var.bindRoot or even using something lighter weight than vars).

Examining an example program

This section analyzes a larger "real" program, the Luminus Guestbook example app (forked here to maintain a stable build with some mods). This is a fairly typical Clojure web app using many of the most popular libraries for its implementation. The app was modified to System/exit as soon as the main starts so we are primarily measuring boot time.

1) What are some timings for various ways to run the guestbook app?

IDCommandDescriptionTime (s)Classes loaded
GB-C-JITtime java -cp `cat cp` clojure.main -m guestbook.coreno lein, no aot9.57 s10494
GB-C-AOTtime java -jar target/guestbook.jarno lein, aot4.67 s7344
GB-C-AOT-DLtime java -jar target/guestbook.jarno lein, aot + direct linking4.59 s 
GB-L-JITtime lein run lein, no aot13.43 s10550
GB-L-AOTtime lein runlein, aot6.98 s7482
GB-L-REPLecho | time lein replrepl13.99 s10508

2) For non-AOT, is reading/compilation a dominant factor?

Yes - about half the time is spent in compilation (based on comparison to AOT).

3) Do lazy vars improve AOT times? 

As tested, this affects both Clojure jar itself and the guestbook uberjar.

IDCommandDescriptionTime (s)Classes loaded
GB-C-AOT-LAZYtime java -jar target/guestbook.jarno lein, aot, lazy3.88 s (17% improved)4440
GB-L-AOT-LAZYtime lein runlein, aot, lazy6.51 s (7% improved)4611

Yes - lazy vars improve startup time 10-20%.

4) What is the Lein overhead?

The lein run overhead with AOT was 2.31 s, slightly bigger than we saw in L-AOT.

The lein run overhead with no AOT was 3.86 s, bigger than we saw in L-RUN and L-RUN-2 (overhead was about the same there). 

5) For Lein, is some kind of classpath calculation caching worth doing?

Lein fast trampoline will cache the command startup and avoid the first lein execution.

Lein run AOT + LEIN_FAST_TRAMPOLINE: 7.00 s (about the same) - surprised this wasn't a couple seconds faster

Lein repl + LEIN_FAST_TRAMPOLINE:  8.58 s (39% improved) - why is this so much better than the lein run?

6) Why are lein repl times so much bigger than lein repl on a simple project and for lein run?

TODO

7) How big an impact is AOT compiling lein dependencies (like tools.nrepl)?

TODO

8) How big an impact is AOT compiling project dependencies?

TODO

Techniques for faster loads

This section discusses techniques available now to reduce load times.

JVM Args

The following JVM startup args may improve startup performance (note that some of these have important tradeoffs though):

  • -client -XX:+TieredCompilation -XX:TieredStopAtLevel=1 
    • These settings use the client compiler and use one the 1st stage compiler to favor faster start time. 
    • These settings prevent JIT compilation to get the highest level of performance over time, so you should only use these settings when running a REPL or short-running apps. Long-running server programs should not use these settings.
  • -Xverify:none
    • Suppresses the bytecode verifier (which speeds classloading). 
    • Enabling this option introduces a security risk from malicious bytecode, so should be carefully considered.
  • -XX:+AggressiveOpts
    • This option enables optimizations that are not yet enabled by default but will be in the next version of the JDK. Generally this option improves performance for any program and is safe to use.

You should also consider your heap settings. If you can anticipate the max heap you expect to use then setting the min and max heap to that value will allow the JVM to do a single memory allocation rather than growing the heap during startup. Some experimentation may be required to determine the optimal values for this. For example:  -Xms256m -Xmx256m

AOT compilation

When starting a Clojure program, any Clojure code that is loaded from a source file (.clj or .cljc) must be compiled. The Clojure compiler is fast, but it's a significant source of load time. Consider AOT compilation any time you expect to start the same instance of a program many times (without changing the source).

Most build tools (lein, boot, etc) provide the ability to compile your Clojure source ahead-of-time (AOT) to .class files. AOT is transitive (all source namespaces needed will be compiled) however distributing AOT compiled libraries that contain compiled versions of dependencies is a bad idea and will create version problems for downstream consumers - for this reason, most libraries are distributed as source and this is recommended. It is typically best to AOT compile when you build a final application (when creating an uberjar or war for deployment, for example). 

The default Clojure jar is distributed in AOT compiled form.

Eliding metadata

When compiling, each function is compiled into a class. Metadata (like the docstring) is included into that compiled class file and stored as a constant in the constant pool. Metadata like the docstring is typically not used except when developing at the REPL. The Clojure compiler can be instructed to elide that meta, removing it from the compiled class. This reduces class size and improves classloading.

Using a Clojure jar compiled with metadata elided reduces load time 5-15 ms. For large AOT'ed Clojure apps with a lot of docstrings, there could be some impact as well, but it's estimated that the impact is small.

Direct linking

Var invocation involves looking up a var, retrieving it's function, and invoking it. Direct linking shortcuts this by compiling into a direct static invocation of the function class. When direct linking is enabled, the Clojure compiler is enable to avoid initializing many of the vars, so the class size is reduced and classloading is improved. Since Clojure 1.8, the default clojure jar is compiled with direct linking.


Small improvements

This section describes smaller fixes or enhancements that can be done to reduce load time. Most of these are linked to tickets with patches.

Delay socket server loading

In 1.8, new code was added to runtime startup to check for socket server Java system properties and start a socket server for each one found. However, this code is causing several namespaces to be loaded (clojure.core.server, clojure.edn) even when no socket servers are defined (the common case).

A patch to address this issue is available at CLJ-1891Applying the patch reduces Clojure core start time about 20 ms.

Reduce `refer` overhead

Each time a new namespace is loaded, it will "refer" external vars into the namespace. This happens for clojure.core automatically and will also happen due to the use of `use` or `require` with `refer`. The current implementation of refer is not very efficient - it builds large intermediate maps, traverses all vars even when only a few are refer'ed, and cas'es each new var into the namespace individually. The patch at CLJ-1730 address the worst of these issues while still taking a relatively conservative approach in the changes. This reduces the cost of refer-clojure (done on the load of every new ns) by about 50% and the time for :refer :only by as much as 90%. However this is only a small percentage of typical load times which are dominated by var initialization and classloading.

Check elide-meta in add-doc-and-meta

The add-doc-and-meta function is used to attach docs later in clojure.core load, but this macro does not currently check whether the doc meta will be elided. A check in the macro could turn these into no-ops, saving some time.

Optimize Var.bindRoot()

This is called close to 1000 times on clojure.core start - it could potentially be optimized wrt watches, validation, and meta (optimize alterMeta wrt clearing macro flag).

Reduce RT and clojure.core load time

There are likely a number of changes that could be made in RT (and clojure.core) initialization to reduce load times:

  • When creating symbols and keywords, call the two arg fn with nil namespace, rather than the single arg form (which must analyze the string)
    • Symbol.intern("foo") should be Symbol.intern(null, "foo")
  • Intern common strings like: clojure.core, column, arglists, x, tag, &, coll, etc
  • When loading auto-imported java.lang classes, don't load the classes, just get the unloaded class instance
  • Remove {:static true} meta - no longer used
  • Remove {:added "1.0"} and make that the default "added" assumption.
  • Parallelize loading of gvec, instant, and uuid - these are mutually exclusive and don't interact much with the rest of core.

These need further testing to determine whether they are worth doing.

 

Big Improvements

This section discusses larger changes that could have a more significant impact on start time.

Reduce reader time

For non-AOT (common during dev/repl), reading is a significant time factor.

Reduce compile time

For non-AOT (common during dev/repl), compiling is a significant time factor.

Lazy vars

Most of this time is taken in loading and initializing vars, largely functions. The time is spent in classloading the function under the var, loading metadata from the constant pool, and initializing the vars.

Many of these vars are not actually needed to start a program - clojure.core in Clojure 1.8 has 725 interned vars but a large number of them are unused for most programs at startup time. Delaying the loading of these vars until they are needed would yield significant performance gains. Attached to this page is a patch (lazyvars.diff) adapted from Rich's prior work on the fastload branch of Clojure.

 

IDBefore patch (s)Before ClassesAfter patch (s)After classes
C-NIL0.7319960.53 (-27%)1175 (-41%)

C-REPL

0.8923890.64 (-28%)1350 (-43%)
C-RUN0.8020780.60 (-25%)1271 (-39%)
C-RUN-20.8721840.68 (-22%)1378 (-37%)
C-AOT0.7419970.53 (-28%)1163 (-42%)
L-REPL4.5433834.52 (-0.4%)2454 (-27%)
L-RUN2.5021042.45 (-2.0%)1307 (-38%)
L-RUN-22.5423062.50 (-1.6%)1414 (-39%)
L-AOT2.5121852.44 (-2.8%)1375 (-37%)

With lazily loaded vars we are seeing a significant reduction in both time and classes loaded for the Clojure runtimes but only slight improvements in the Leiningen runtimes (even though we see similar class reduction). This implies that while lazy vars do make a significant difference in Clojure start time, those gains are dwarfed by other parts of Leiningen start time. 

One downside of lazy vars is that the JIT is not as good at inlining through the lazy var check which makes var indirection slower after the lazy var has been forced. One open question is whether invokedynamic changes this, possibly allowing this to be the default.

John Rose on using indy for startup

 

Parallel namespace loading

 


Labels:
  1. Feb 17, 2016

    Are there any benchmarks of how fast the jvm loads classes? I think it would be interesting to see how long it takes to load a namespace with lots of metadata and vars but not loading any other classes vs. a namespace with all the metadata and vars and functions that cause other classes to be loaded.

    1. Feb 17, 2016

      This is kind of a loaded question, because I wonder if the jvm would load a big class with a bunch of static methods faster than many classes. If that is the case, many clojure functions could be compiled into a static method on the enclosing namespace when aot compiling, with semantics similar to direct linking. This would get rid of the lazy load check that the jvm has trouble inlining. The tradeoff is, if you do need those functions as values invoking is likely to be slower likely through reflection in some kind of wrapper over the static methods. Maybe invokedynamic would solve that (just like lazy loading)

    2. Feb 17, 2016

      This may be common knowledge, but I did a little benchmark and loading one class that causes 1000 classes (from a jar) seems to almost always be faster than loading 1 class with 1000 static methods(with various openjdk versions on linux). I don't know why.

      So lifting fn bodies in to static methods doesn't seem like it would improve start up time.

  2. Feb 24, 2016

    It seems like parallel code loading is off the table without a huge amount of work. Loading clojure code is just so free form and permitting of so many effects (some even baked in like defmethod)