Clojure

Improve writeClassFile performance

Details

  • Type: Enhancement Enhancement
  • Status: Closed Closed
  • Priority: Critical Critical
  • Resolution: Completed
  • Affects Version/s: Release 1.3
  • Fix Version/s: Release 1.8
  • Component/s: None
  • Patch:
    Code
  • Approval:
    Ok

Description

When writing class files to disk, Clojure currently calls flush on the FileOutputStream used, and sync on the underlying FileDescriptor for each class file it writes. This ticket proposes to remove the sync call as it provides little value, but has a detrimental effect on performance on some operating systems.

Background

The call to sync in the current implementation of Compiler.writeClassFile was added in SVN-1252 to address an issue where compile allegedly returned but changes to class files weren't visible[1]. The original report lacks detail and it's hard to discern what the actual issue was.

Calling FileDescriptor.sync tells the operating system to flush any buffered data to disk (to its best capability). This means that in the event of a system failure, the kernel buffers will not hold any data that hasn't been persisted to disk, and thus no data would be lost.

However, sync does not provide any special guarantees around visbility. Any data written using to a file descriptor that is subsequently closed is already guaranteed to be visible to all other processes, as soon as the close call returns. When not using sync, the data returned may reside only in kernel buffers, but the reading process will see no difference.

Syncing is an expensive operation and doing so after each class file is written can make compilation significantly slower.

Laurent Petit (who made the original request for this) reports that it would not make any difference for him now in CCW (https://groups.google.com/d/msg/clojure-dev/Vz9h8hqAvFk/cQvip5j_zoQJ).

Proposed solution

Remove the call to FileDescriptor.sync when writing class files to disk.

Latest patch: http://dev.clojure.org/jira/secure/attachment/14214/clj-703-remove-sync-in-write-class-file.patch

Performance implications

The performance gains of removing the sync call will vary depending on OS and file system. Some operating systems provide strong guarantees that your data will actually end up on the disk, others less so. See for example differences between fsync in Linux[2] and OSX[3].

Preliminary tests on Linux have shown compile times going from:

  • aleph.http: 13s -> 4s
  • incanter.core: 14s -> 4s

I've created a test for this and have asked the community for help testing and gathering data: http://goo.gl/forms/0b8SVt2pyN

Will update this ticket once more data is available.

Tradeoffs

In short, this change trades some degree of on-disk consistency for compilation speed. If a fault (process, OS, hardware) occurs whilst compiling, this patch may result in class files not being properly written to disk. This however, is still the case today, depending on the type of fault, the OS you're using and the fsync guarantees provided. If a compilation fails with a fatal error, it is unreasonable to expect that the output is in a consistent state.

[1]: https://groups.google.com/forum/#!topic/clojure/C-DEWzpPPUY
[2]: "Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.": https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html
[3]: "[...] includes writing through or flushing a disk cache if present." http://man7.org/linux/man-pages/man2/fsync.2.html[2]:

  1. 0001-use-File.mkdirs-instead-of-mkdir-every-single-direct.patch
    04/Jan/11 12:58 PM
    1 kB
    Jürgen Hötzel
  2. 0002-Ensure-atomic-creation-of-class-files-by-renaming-a-.patch
    04/Jan/11 12:58 PM
    1 kB
    Jürgen Hötzel
  3. clj-703-remove-sync-in-write-class-file.patch
    30/May/15 6:24 PM
    0.7 kB
    Ragnar Dahlen
  4. improve-writeclassfile-perf.patch
    25/May/12 4:22 AM
    3 kB
    John Szakmeister
  5. remove-flush-and-sync-only.patch
    06/Oct/14 12:30 PM
    1 kB
    Ragnar Dahlen
  6. remove-sync-only.patch
    08/Oct/14 5:59 AM
    0.7 kB
    Ragnar Dahlen

Activity

Hide
David Powell added a comment -

Removing sync makes clojure build much faster. I wonder why it was added in the first place? I guess only Rich knows? I assume that it is not necessary.

If we are removing sync though, I wouldn't bother with the atomic rename stuff. Doing that sort of thing can cause problems on some platforms, eg with search indexers and virus checkers momentarily locking files after they are created.

The patch seems to be assuming that sync is there for some reason, but my initial assumption would be that sync isn't necessary - perhaps it was working around some issue that no longer exists?

Show
David Powell added a comment - Removing sync makes clojure build much faster. I wonder why it was added in the first place? I guess only Rich knows? I assume that it is not necessary. If we are removing sync though, I wouldn't bother with the atomic rename stuff. Doing that sort of thing can cause problems on some platforms, eg with search indexers and virus checkers momentarily locking files after they are created. The patch seems to be assuming that sync is there for some reason, but my initial assumption would be that sync isn't necessary - perhaps it was working around some issue that no longer exists?
Hide
Jürgen Hötzel added a comment -

Although its unlikely: there is a possible race condition "loading a paritally written classfile"?:

https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L393

Show
Jürgen Hötzel added a comment - Although its unlikely: there is a possible race condition "loading a paritally written classfile"?: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/RT.java#L393
Hide
John Szakmeister added a comment -

The new improve-writeclassfile-perf version of the patch combines the two previous patches into a single patch file and brings them up-to-date with master. I can split the two changes back out into separate patch files if desired, but I figured out current tooling is more geared towards a single patch being applied.

Show
John Szakmeister added a comment - The new improve-writeclassfile-perf version of the patch combines the two previous patches into a single patch file and brings them up-to-date with master. I can split the two changes back out into separate patch files if desired, but I figured out current tooling is more geared towards a single patch being applied.
Hide
John Szakmeister added a comment -

FWIW, both fixes look sane. The first one is a nice cleanup. The second one is a little more interesting in that uses a rename operation to put the final file into place. It removes the sync call, which does make things faster. In general, if we're concerned about on-disk consistency, we should really have a combination of the two: write the full contents to a tmp file, sync it, and atomically rename to the destination file name.

Neither the current master, nor the current patch will guarantee on-disk consistency across a machine wide crash. The current master could crash before the sync() occurs, leaving the file in an inconsistent state. With the patch, the OS may not get the file from file cache to disk before an OS level crash occurs, and leave the file in an inconsistent state as well. The benefit of the patch version is that the whole file does atomically come into view at once. It does have a nasty side effect of leaving around a temp file if the compiler crashes just before the rename though.

Perhaps a little more work to catch an exception and clean up is in order? In general, I like the patched version better.

Show
John Szakmeister added a comment - FWIW, both fixes look sane. The first one is a nice cleanup. The second one is a little more interesting in that uses a rename operation to put the final file into place. It removes the sync call, which does make things faster. In general, if we're concerned about on-disk consistency, we should really have a combination of the two: write the full contents to a tmp file, sync it, and atomically rename to the destination file name. Neither the current master, nor the current patch will guarantee on-disk consistency across a machine wide crash. The current master could crash before the sync() occurs, leaving the file in an inconsistent state. With the patch, the OS may not get the file from file cache to disk before an OS level crash occurs, and leave the file in an inconsistent state as well. The benefit of the patch version is that the whole file does atomically come into view at once. It does have a nasty side effect of leaving around a temp file if the compiler crashes just before the rename though. Perhaps a little more work to catch an exception and clean up is in order? In general, I like the patched version better.
Hide
Ivan Kozik added a comment - - edited

File.renameTo returns false on (most?) errors, but the patch doesn't check for failure. Docs say "The return value should always be checked to make sure that the rename operation was successful." Failure might be especially likely on Windows, where files are opened by others without FILE_SHARE_DELETE.

Show
Ivan Kozik added a comment - - edited File.renameTo returns false on (most?) errors, but the patch doesn't check for failure. Docs say "The return value should always be checked to make sure that the rename operation was successful." Failure might be especially likely on Windows, where files are opened by others without FILE_SHARE_DELETE.
Hide
Dave Della Costa added a comment -

We've been wondering why our compilation times on linux were so slow. It became the last straw when we walked away from one project and came back after 15 minutes and it was not done yet.

After some fruitless investigation into our linux configuration and lein java args, we stumbled upon this issue via the associated Clojure group thread. Upon commenting out the flush() and sync() lines (https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/Compiler.java#L7171-L7172) and compiling Clojure 1.6 ourselves, our projects all started compiling in under a minute.

Point being, can we at least provide some flag to allow for "unsafe compilation" or something? As it is, this is bad enough that we've manually modified all our local versions of Clojure to work around the issue.

Show
Dave Della Costa added a comment - We've been wondering why our compilation times on linux were so slow. It became the last straw when we walked away from one project and came back after 15 minutes and it was not done yet. After some fruitless investigation into our linux configuration and lein java args, we stumbled upon this issue via the associated Clojure group thread. Upon commenting out the flush() and sync() lines (https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/Compiler.java#L7171-L7172) and compiling Clojure 1.6 ourselves, our projects all started compiling in under a minute. Point being, can we at least provide some flag to allow for "unsafe compilation" or something? As it is, this is bad enough that we've manually modified all our local versions of Clojure to work around the issue.
Hide
Tim McCormack added a comment -

Additional motivation: This becomes really unpleasant on an encrypted filesystem, since write and read latency become higher.

As a partial workaround, I've been using this script to mount a ramdisk over top of target, which speeds up compilation 2-4x: https://gist.github.com/timmc/6c397644668bcf41041f (but removing flush() and sync() entirely would probably speed things up even more, if safe)

Show
Tim McCormack added a comment - Additional motivation: This becomes really unpleasant on an encrypted filesystem, since write and read latency become higher. As a partial workaround, I've been using this script to mount a ramdisk over top of target, which speeds up compilation 2-4x: https://gist.github.com/timmc/6c397644668bcf41041f (but removing flush() and sync() entirely would probably speed things up even more, if safe)
Hide
Ragnar Dahlen added a comment -

I'd like to explore this issue further as I also don't think the flush and sync calls add any value, but do have a severe impact on performance.

To resurrect the discussion, I've attached a new patch with the following approach:

  • create a temporary file
  • write class bytecode to temporary file, no flush or sync
  • close temporary file
  • atomic rename of temporary file to class file name

It is different to previous patches in that:

  • it applies cleanly to master
  • it checks return value from File.renameTo
  • it omits proposed File.mkdirs change as the current implementation is actually converting from an "internal name", where forward slashes are assumed (splits on "/"), to a platform specific path using File.separator. I'm not convinced that the previous patch is safe on all versions of Windows, and I think it's separate from the main issue here.

I opted for the atomic rename of a temp file to avoid leaving empty class files with a correct expected class file name in case of failure.

It is my understanding that this patch will guarantee that:

  • when writeClassFile returns successfully, a class file with the expected name will exists, and subsequent reads from that file name will return the expected bytecode (possibly from kernel buffers).
  • when writeClassFile fails, a class file with the expected name will not have been created (or updated if it previously existed).

Anything preventing the operating system from flushing its buffers to disk (power failure etc) might leave a corrupt (empty, partially written) class file. It's my opinion that that is acceptable behaviour, and worth the performance improvement (I'm seeing AOT compilation reduced from 1m20s -> 22s on one of our codebases, would be happy to provide more benchmarks if requested).

Would be grateful for feedback/testing of this patch.

Show
Ragnar Dahlen added a comment - I'd like to explore this issue further as I also don't think the flush and sync calls add any value, but do have a severe impact on performance. To resurrect the discussion, I've attached a new patch with the following approach:
  • create a temporary file
  • write class bytecode to temporary file, no flush or sync
  • close temporary file
  • atomic rename of temporary file to class file name
It is different to previous patches in that:
  • it applies cleanly to master
  • it checks return value from File.renameTo
  • it omits proposed File.mkdirs change as the current implementation is actually converting from an "internal name", where forward slashes are assumed (splits on "/"), to a platform specific path using File.separator. I'm not convinced that the previous patch is safe on all versions of Windows, and I think it's separate from the main issue here.
I opted for the atomic rename of a temp file to avoid leaving empty class files with a correct expected class file name in case of failure. It is my understanding that this patch will guarantee that:
  • when writeClassFile returns successfully, a class file with the expected name will exists, and subsequent reads from that file name will return the expected bytecode (possibly from kernel buffers).
  • when writeClassFile fails, a class file with the expected name will not have been created (or updated if it previously existed).
Anything preventing the operating system from flushing its buffers to disk (power failure etc) might leave a corrupt (empty, partially written) class file. It's my opinion that that is acceptable behaviour, and worth the performance improvement (I'm seeing AOT compilation reduced from 1m20s -> 22s on one of our codebases, would be happy to provide more benchmarks if requested). Would be grateful for feedback/testing of this patch.
Hide
Ragnar Dahlen added a comment -

We're testing this patch on various projects/platforms at my company. So far we've seen:

  • Significantly reduced compilation times on Linux (two typical examples: 30s to 15s, 1m30s to 30s)
  • No significant change in compilation times on Mac OSX.
  • File.renameTo consistently failing on a Windows machine.

My understanding is that the performance difference between Linux and OSX is due to differences in how these platforms implement fsync. OSX by default does not actually tell the drive to flush its buffers (requires fcntl F_FULLSYNC for this, not used by JVMs) [1], whereas Linux does [2].

Our very limited test shows (as was previously pointed out) that File.renameTo is problematic on Windows. I've attached a new patch that doesn't use rename, and only has the the sync call removed (flush is a no-op for FileOutputStreams). We're currently testing this patch.

The drawback of this patch is that it may leave correctly named, but empty class files if the write fails. One option would be to try and delete the file in the catch block. Personally, I wouldn't expect a compilation that failed because of OS/IO reasons to leave my classfiles in a consistent state.

[1]: "Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.": https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html

[2]: "[...] includes writing through or flushing a disk cache if present."
http://man7.org/linux/man-pages/man2/fsync.2.html

Show
Ragnar Dahlen added a comment - We're testing this patch on various projects/platforms at my company. So far we've seen:
  • Significantly reduced compilation times on Linux (two typical examples: 30s to 15s, 1m30s to 30s)
  • No significant change in compilation times on Mac OSX.
  • File.renameTo consistently failing on a Windows machine.
My understanding is that the performance difference between Linux and OSX is due to differences in how these platforms implement fsync. OSX by default does not actually tell the drive to flush its buffers (requires fcntl F_FULLSYNC for this, not used by JVMs) [1], whereas Linux does [2]. Our very limited test shows (as was previously pointed out) that File.renameTo is problematic on Windows. I've attached a new patch that doesn't use rename, and only has the the sync call removed (flush is a no-op for FileOutputStreams). We're currently testing this patch. The drawback of this patch is that it may leave correctly named, but empty class files if the write fails. One option would be to try and delete the file in the catch block. Personally, I wouldn't expect a compilation that failed because of OS/IO reasons to leave my classfiles in a consistent state. [1]: "Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.": https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html [2]: "[...] includes writing through or flushing a disk cache if present." http://man7.org/linux/man-pages/man2/fsync.2.html
Hide
Alex Miller added a comment -

Currently this ticket is not a state where it can be evaluated, and I would like it to get there before we consider tickets for 1.8.

The description for this ticket needs to be something that a screener read to fully understand the problem, proposed solution, performance implications, and tradeoffs. Right now it does not seem up to date with information discussed in comments, which patch should be considered, performance data, and what we might lose by making the change. It would be great if someone could help get this in shape.

Show
Alex Miller added a comment - Currently this ticket is not a state where it can be evaluated, and I would like it to get there before we consider tickets for 1.8. The description for this ticket needs to be something that a screener read to fully understand the problem, proposed solution, performance implications, and tradeoffs. Right now it does not seem up to date with information discussed in comments, which patch should be considered, performance data, and what we might lose by making the change. It would be great if someone could help get this in shape.
Hide
Andy Fingerhut added a comment -

Alex, you will probably get a wider audience of contributors willing to help if you sent an email to clojure-dev every once in a while with a list of tickets you want help updating. Right now I think it is this one and CLJ-1449, but it would be nice if there were a single report link contributors could click on to find out what you are looking for help with.

Normally there is the "Needs Patch" state for the next release, but for CLJ-1449 you are looking for a patch for the next-next release, so "Needs Patch" won't do it.

Perhaps if there were 'standard' labels created for 'Alex requests ticket updates' for such tickets, and filter for them?

Show
Andy Fingerhut added a comment - Alex, you will probably get a wider audience of contributors willing to help if you sent an email to clojure-dev every once in a while with a list of tickets you want help updating. Right now I think it is this one and CLJ-1449, but it would be nice if there were a single report link contributors could click on to find out what you are looking for help with. Normally there is the "Needs Patch" state for the next release, but for CLJ-1449 you are looking for a patch for the next-next release, so "Needs Patch" won't do it. Perhaps if there were 'standard' labels created for 'Alex requests ticket updates' for such tickets, and filter for them?
Hide
Ragnar Dahlen added a comment -

I'd be happy to help improve the state of this ticket. I'm not the original reporter, is it still OK for me to change description etc. to address your concerns?

Show
Ragnar Dahlen added a comment - I'd be happy to help improve the state of this ticket. I'm not the original reporter, is it still OK for me to change description etc. to address your concerns?
Hide
Alex Miller added a comment -

Andy - I will probably do so soon but I thought previous commenters would see this now.

Ragnar - absolutely!

Show
Alex Miller added a comment - Andy - I will probably do so soon but I thought previous commenters would see this now. Ragnar - absolutely!
Hide
Ragnar Dahlen added a comment -

Updated for RC1

Show
Ragnar Dahlen added a comment - Updated for RC1
Hide
Ragnar Dahlen added a comment -

Alex - Updated description, please let me know if I can improve it further.

Show
Ragnar Dahlen added a comment - Alex - Updated description, please let me know if I can improve it further.
Hide
Alex Miller added a comment -

Ragnar - much better - thank you! Could you also add a pointer to the patch for consideration?

Show
Alex Miller added a comment - Ragnar - much better - thank you! Could you also add a pointer to the patch for consideration?
Hide
Ragnar Dahlen added a comment -

Alex - Done

Show
Ragnar Dahlen added a comment - Alex - Done

People

Vote (27)
Watch (11)

Dates

  • Created:
    Updated:
    Resolved: