[CLJS-367] doseq & for macros should support chunked seqs Created: 31/Aug/12  Updated: 27/Jul/13  Resolved: 26/Apr/13

Status: Closed
Project: ClojureScript
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: David Nolen Assignee: Michał Marczyk
Resolution: Completed Votes: 0
Labels: None

Attachments: Text File 0001-CLJS-367-chunked-seq-support-in-for-and-doseq-compre.patch    

Comment by Michał Marczyk [ 06/Apr/13 2:34 PM ]

Commit message:

    CLJS-367: chunked seq support in for, comprehension benchmarks
    Copy & paste from Clojure with small tweaks (-nth instead of .nth on
    chunks, ClojureScript-specific type hints).
    Also fixes a minor bug around argument format verification in for.

for performance on the attached benchmark is dramatically improved by the patch. doseq is unchanged for now, because a similar copy & paste approach makes it slower by more than a factor of 2. I'll try and investigate the reasons for that, but in the meantime, the unchunked doseq is still faster than even the chunked for (doseq benchmark takes around 4x as much time, but that's over 10x as many iterations) and for can already benefit from the present patch.

Comment by Michał Marczyk [ 07/Apr/13 2:05 AM ]

Two extra benchmarks, so that both for and doseq are benchmarked with both a chunked an a non-chunked seq.

I should add that copy & pasted doseq is substantially faster than the current doseq on chunked seqs; it's on non-chunked seqs that it's pretty terrible.

Comment by Michał Marczyk [ 07/Apr/13 2:10 AM ]

Hm, I now notice that, rather importantly, for on non-chunked seqs is significantly slower with the patch (although by a smaller factor than doseq). Guess this needs more baking.

Comment by Michał Marczyk [ 07/Apr/13 2:26 AM ]

Here's the patch with the chunk-aware doseq.

By the way: what sort of slowdown for the non-chunked case of chunk-aware comprehensions would be acceptable?

Comment by David Nolen [ 08/Apr/13 4:52 PM ]

What are the differences between the two patches? It's preferable that there not be much difference in performance - how much slowdown are you seeing?

I can give the patches and generated code a look to see if I see any optimization opportunities.

Comment by Michał Marczyk [ 08/Apr/13 9:55 PM ]

I'd love it if you reviewed the code!

As for the patches, one makes for chunk-aware and introduces the new benchmarks; the other does all that and also makes doseq chunk-aware. I don't think either is particularly suitable for inclusion as it stands, so I'll just replace them both with a new one, chunkifying both macros and free of truth_ calls (turns out I missed one earlier; unfortunately it doesn't make a difference where it really matters, only makes the fast case slightly faster still). I'll be happy to split them apart in any way which is useful.

The slowdown is large for for and massive for doseq. Here are some benchmark results:

;;; current master
;;; comprehensions
[xs (range 512)], (last (for [x xs y xs] (+ x y))), 10 runs, 4783 msecs
[xs (vec (range 512))], (last (for [x xs y xs] (+ x y))), 10 runs, 3900 msecs
[a (Box. 0) xs (range 512)], (doseq [x xs y xs] (set! a -val (+ (.-val a) x))), 100 runs, 4062 msecs
[a (Box. 0) xs (vec (range 512))], (doseq [x xs y xs] (set! a -val (+ (.-val a) x))), 100 runs, 4486 msecs

;;; latest patch
;;; comprehensions
[xs (range 512)], (last (for [x xs y xs] (+ x y))), 10 runs, 6350 msecs
[xs (vec (range 512))], (last (for [x xs y xs] (+ x y))), 10 runs, 813 msecs
[a (Box. 0) xs (range 512)], (doseq [x xs y xs] (set! a -val (+ (.-val a) x))), 100 runs, 13801 msecs
[a (Box. 0) xs (vec (range 512))], (doseq [x xs y xs] (set! a -val (+ (.-val a) x))), 100 runs, 856 msecs
Comment by David Nolen [ 24/Apr/13 9:38 PM ]

First off I don't see the very dramatic slow down, but I do see a slowdown and this is not desirable. The issue is that both the seq and chunked seq cases share the same recursive function. So we actually test whether something is a chunked seq at every step. The problem is chunked-seq? calls into satisfies? which is somewhat slow in the false case - we have to check the bitmask and then call into type_satisfies if the bitmask test fails.

So really we should only check once to see if we have chunked seq, if we do we always have one. Obviously Clojure doesn't have this performance problem with chunked-seq? because this stuff happens via interfaces not protocols.

Comment by David Nolen [ 24/Apr/13 9:43 PM ]

Oh and I do see a very big speedup for chunked seqs, on my machine it's faster than map now!

Comment by Michał Marczyk [ 25/Apr/13 5:45 AM ]

Good to know your measurements are more encouraging. As to the slowdown:

I do realize that checking seq type every step of the day is problematic, but I don't know that we can simply not do it. Sequences routinely turn out to be part chunked, part not (e.g. cons a single element onto a vector seq, concat a non-chunked seq with a chunked-seq); Clojure switches processing mode midway in such cases and I thought it would be good to do so in ClojureScript as well. In fact in the chunked case we simply must do it, because the chunked logic makes no sense for non-chunked seqs; but it's clearly no problem there, given the speedup.

I thought that maybe the check could be performed, say, every 8 steps or so (that is, if we start off in the non-chunked case), hoping that this would give most of the benefit for mostly chunked seqs, while only introducing a minor hit for mostly non-chunked seqs. I've played with two patches along these lines with not quite satisfactory results; I was certainly planning to give it another go – do you think this sounds worthwhile?

Of course we could start with a version of the patch which does the straightforward thing (permanently switching to current behaviour at the first sign of non-chunkedness, but still boosting seqs which start off chunked) and then try to figure out if switching midway is possible without too much of a perf hit.

On a side note, I need to upgrade my benchmarking setup, I think it may have been too long since I've upgraded V8... This could explain the larger slowdown that I am seeing.

Comment by David Nolen [ 25/Apr/13 7:04 AM ]

Hmm. You are right. I don't think adding more smarts is the way to go. I think a simpler approach until we can figure out some better way to make satisfies? less expensive - just make chunked-seq? do instance? checks for the two IChunkedSeq types that we currently have.

Comment by Michał Marczyk [ 25/Apr/13 7:20 AM ]

Ah, good idea, I'll give it a try.

Incidentally, about making satisfies? faster: I had a look at the jsPerf benchmark linked to from CLJS-246 – here's a direct link: http://jsperf.com/direct-vs-chain/8 – and it seems like JS engine perf profiles may be shifting. In Chrome 24 "property" and "property many interfaces" are about equally fast and somewhat faster than "bit" & "bit m.i.". Chrome 21 saw no difference, recent versions of Firefox seem to be similar to Chrome w.r.t. relative results, Opera seems to prefer bit, but only slightly, Safari has a slightly stronger preference for bit than that.

Comment by David Nolen [ 25/Apr/13 7:37 AM ]

Interesting! I'd be happy to drop the bitmask stuff as it complicates the implementation, though looking at those benchmarks it doesn't appear that it would make much of a difference. It does seems that Safari 6 does greatly prefer the bitmask approach. We'd have to do quite a bit more testing before deciding of course.

Comment by David Nolen [ 25/Apr/13 7:46 AM ]

Reviewing the patch some more, it might be worth inlining chunk-append and chunk via macros. Also in the chunked seq branch, since we've hinted the result of chunk-first as ^not-native we should make sure to just call the protocols fns directly where possible.

Comment by David Nolen [ 26/Apr/13 4:45 PM ]

Fixed, http://github.com/clojure/clojurescript/commit/44389e2f233ccc26ed5a1c75b1e043054e61b586

fixing chunked-seq? made the perf difference quite small for regular seqs. We can take futher enhancement in other tickets/patches.

Comment by Michał Marczyk [ 26/Apr/13 4:55 PM ]

This is fantastic news, thanks!

Generated at Tue Mar 26 15:31:10 CDT 2019 using JIRA 4.4#649-r158309.