Clojure

range should return something that implements IReduce

Details

  • Type: Enhancement Enhancement
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Completed
  • Affects Version/s: Release 1.7
  • Fix Version/s: Release 1.7
  • Component/s: None
  • Labels:
    None
  • Patch:
    Code and Test
  • Approval:
    Ok

Description

range should implement IReduce for fast reduction.

Patch: clj-1515-14.patch - Java range with long optimizations

Approach: The clj-1515-14 patch revives the unused clojure.lang.Range class. This class is similar in implementation to ChunkedCons. It differs by lazily constructing the first chunk (this is done in the existing impl by using a LazySeq wrapper) and by caching the next reference like other seq impls. The latter is done because range is frequently used in both chunked and unchunked traversal.

Additionally, 1515-14 contains an optimized version of range (LongRange) for the extremely common case where start, end, and step are all longs. This version uses primitive longs and primitive math and a customized version of ArrayChunk for greater performance.

The special case of (range) is just handled with (iterate inc' 0) (which was further optimized for reduce in CLJ-1603).

Alternatives:

  • CLJ-1515-deftype3.patch took the approach of using deftype to create a seqable, reducible entity that was not actually a seq. Based on work for CLJ-1603, it was established that due to historical requirements, this is not viable and the entity returned for range should implement the ISeq and Collection interfaces directly.
  • clj-1515-11.patch uses the approach of a "split" implementation - a LazySeq that has a fast reduce path. This is a minimal patch that provides no perf difference for sequence usage but a moderate improvement for reduce paths. One important difference vs clj-1515-12 is that the fast reduce path is only obtained on the initial range. Once you walk off the head, the performance will be the same as prior (seq perf).

Performance

criterium quick-bench with java 1.8.0-b132

code 1.7.0-master Java clj-1515-14 split clj-1515-11
(count (range (* 1024 1024))) 63 ms 0 ms 58 ms
(reduce + (map inc (range (* 1024 1024)))) 50 ms 33 ms 50 ms
(reduce + (map inc (map inc (range (* 1024 1024))))) 68 ms 52 ms 67 ms
(count (keep odd? (range (* 1024 1024)))) 73 ms 52 ms 67 ms
(transduce (comp (map inc) (map inc)) + (range (* 1024 1024))) 46 ms 26 ms 34 ms
(reduce + 0 (range (* 2048 1024))) 67 ms 19 ms 40 ms
(reduce + 0 (rest (range (* 2048 1024)))) 67 ms 19 ms 57 ms
(doall (range 0 31)) 1.04 µs 0.75 µs 1.15 µs
(doall (range 0 32)) 1.08 µs 0.76 µs 1.18 µs
(doall (range 0 4096)) 135 µs 96 µs 145 µs
(into [] (map inc (range 31))) 1.81 µs 1.30 µs 1.88 µs
(into [] (map inc) (range 31)) 1.71 µs 0.75 µs 1.02 µs
(into [] (range 128)) 4.82 µs 2.20 µs 3.27 µs
(doall (range 1/2 1000 1/3)) 1.24 ms 1.24 ms 1.28 µs
(doall (range 0.5 1000 0.33)) 126 µs 147 µs 151 µs
(into [] (range 1/2 1000 1/3)) 1.26 ms 1.20 ms 1.26 ms
(into [] (range 0.5 1000 0.33)) 148 µs 91 µs 130 µs
(count (filter odd? (take (* 1024 1024) (range)))) 185 ms 167 ms 176 ms
(transduce (take (* 1024 1024)) + (range)) 67 ms 35 ms 68 ms

Performance notes:

  • 1515-14 (Java) - significant improvements in virtually all use cases, both seq and reduce. The final two cases with (range) leverage optimizations from CLJ-1603.
  • 1515-11 (split) - a much smaller patch that gives no seq benefits and smaller reduction benefits. Only the head of the range will receive the perf benefits (see (reduce + 0 (rest (range (* 2048 1024))))).

Questions
(range) and supports auto-promotion towards infinity in this patch, which seems to be implied by the doc string but was not actually implemented or tested correctly afaict.

  1. clj-1515.patch
    09/Dec/14 2:40 AM
    23 kB
    Alex Miller
  2. clj-1515-10.patch
    02/Jan/15 11:24 AM
    21 kB
    Alex Miller
  3. clj-1515-11.patch
    25/Mar/15 12:44 PM
    6 kB
    Alex Miller
  4. clj-1515-12.patch
    25/Mar/15 12:44 PM
    18 kB
    Alex Miller
  5. clj-1515-13.patch
    27/Mar/15 3:44 PM
    18 kB
    Alex Miller
  6. clj-1515-14.patch
    03/Apr/15 10:28 AM
    18 kB
    Alex Miller
  7. clj-1515-2.patch
    10/Dec/14 5:12 PM
    23 kB
    Alex Miller
  8. clj-1515-3.patch
    10/Dec/14 11:34 PM
    22 kB
    Alex Miller
  9. clj-1515-4.patch
    11/Dec/14 12:11 AM
    22 kB
    Alex Miller
  10. clj-1515-5.patch
    11/Dec/14 10:20 AM
    22 kB
    Alex Miller
  11. clj-1515-6.patch
    11/Dec/14 11:31 AM
    22 kB
    Alex Miller
  12. clj-1515-7.patch
    11/Dec/14 11:49 AM
    22 kB
    Alex Miller
  13. clj-1515-8.patch
    11/Dec/14 3:49 PM
    23 kB
    Alex Miller
  14. clj-1515-9.patch
    12/Dec/14 2:11 PM
    22 kB
    Alex Miller
  15. CLJ-1515-deftype.patch
    18/Jan/15 8:27 PM
    13 kB
    Ghadi Shayban
  16. CLJ-1515-deftype2.patch
    22/Jan/15 9:35 AM
    12 kB
    Ghadi Shayban
  17. CLJ-1515-deftype3.patch
    23/Jan/15 3:20 PM
    12 kB
    Ghadi Shayban
  18. CLJ-1515-deftype-nostructural-dup.patch
    18/Jan/15 11:05 PM
    12 kB
    Ghadi Shayban
  19. patch.diff
    29/Aug/14 3:34 PM
    9 kB
    Timothy Baldridge
  20. range-patch3.diff
    19/Sep/14 10:02 AM
    11 kB
    Timothy Baldridge
  21. reified-range4.diff
    31/Oct/14 11:30 AM
    16 kB
    Timothy Baldridge

Activity

Hide
Alex Miller added a comment -

1) Not sure about losing chunked seqs - that would make older usage slower, which seems undesirable.
2) RangeIterator.next() needs to throw NoSuchElementException when walking off the end
3) I think Range should implement IReduce instead of relying on support for CollReduce via Iterable.
4) Should let _hash and _hasheq auto-initialize to 0 not set to -1. As is, I think _hasheq always would be -1?
5) _hash and _hasheq should be transient.
6) count could be cached (like hash and hasheq). Not sure if it's worth doing that but seems like a win any time it's called more than once.
7) Why the change in test/clojure/test_clojure/serialization.clj ?
8) Can you squash into a single commit?

Show
Alex Miller added a comment - 1) Not sure about losing chunked seqs - that would make older usage slower, which seems undesirable. 2) RangeIterator.next() needs to throw NoSuchElementException when walking off the end 3) I think Range should implement IReduce instead of relying on support for CollReduce via Iterable. 4) Should let _hash and _hasheq auto-initialize to 0 not set to -1. As is, I think _hasheq always would be -1? 5) _hash and _hasheq should be transient. 6) count could be cached (like hash and hasheq). Not sure if it's worth doing that but seems like a win any time it's called more than once. 7) Why the change in test/clojure/test_clojure/serialization.clj ? 8) Can you squash into a single commit?
Hide
Timothy Baldridge added a comment -

1) I agree, adding chunked seqs to this will dramatically increase complexity, are we sure we want this?
2) exception added
3) I can add IReduce, but it'll pretty much just duplicate the code in protocols.clj. If we're sure we want that I'll add it too.
4) fixed hash init values, defaults to -1 like ASeq
5) hash fields are now transient
6) at the cost of about 4 bytes we can cache the cost of a multiplication and an addition, doesn't seem worth it?
7) the tests in serialization.clj assert that the type of the collection roundtrips. This is no longer the case for range which starts as Range and ends as a list. The change I made converts range into a list so that it properly roundtrips. My assumption is that we shouldn't rely on all implementations of ISeq to properly roundtrip through EDN.
8) squashed.

Show
Timothy Baldridge added a comment - 1) I agree, adding chunked seqs to this will dramatically increase complexity, are we sure we want this? 2) exception added 3) I can add IReduce, but it'll pretty much just duplicate the code in protocols.clj. If we're sure we want that I'll add it too. 4) fixed hash init values, defaults to -1 like ASeq 5) hash fields are now transient 6) at the cost of about 4 bytes we can cache the cost of a multiplication and an addition, doesn't seem worth it? 7) the tests in serialization.clj assert that the type of the collection roundtrips. This is no longer the case for range which starts as Range and ends as a list. The change I made converts range into a list so that it properly roundtrips. My assumption is that we shouldn't rely on all implementations of ISeq to properly roundtrip through EDN. 8) squashed.
Hide
Alex Miller added a comment -

6) might be useful if you're walking through it with nth, which hits count everytime, but doubt that's common
7) yep, reasonable

Show
Alex Miller added a comment - 6) might be useful if you're walking through it with nth, which hits count everytime, but doubt that's common 7) yep, reasonable
Hide
Andy Fingerhut added a comment - - edited

I have already pointed out to Edipo in personal email the guidelines on what labels to use for Clojure JIRA tickets here: http://dev.clojure.org/display/community/Creating+Tickets

Show
Andy Fingerhut added a comment - - edited I have already pointed out to Edipo in personal email the guidelines on what labels to use for Clojure JIRA tickets here: http://dev.clojure.org/display/community/Creating+Tickets
Hide
Timothy Baldridge added a comment -

New patch with IReduce directly on Range instead of relying on iterators

Show
Timothy Baldridge added a comment - New patch with IReduce directly on Range instead of relying on iterators
Hide
Alex Miller added a comment - - edited

The new patch looks good. Could you do a test to determine the perf difference from walking the old chunked seq vs the new version? If the perf diff is negligible, I think we can leave as is.

Another idea: would it make sense to have a specialized RangeLong for the (very common) case where start, end, and step could all be primitive longs? Seems like this could help noticeably.

Show
Alex Miller added a comment - - edited The new patch looks good. Could you do a test to determine the perf difference from walking the old chunked seq vs the new version? If the perf diff is negligible, I think we can leave as is. Another idea: would it make sense to have a specialized RangeLong for the (very common) case where start, end, and step could all be primitive longs? Seems like this could help noticeably.
Hide
Timothy Baldridge added a comment -

Looks like chunked seqs do make lazy seq code about 5x faster in these tests.

Show
Timothy Baldridge added a comment - Looks like chunked seqs do make lazy seq code about 5x faster in these tests.
Hide
Ghadi Shayban added a comment -

I think penalizing existing code possibly 5x is a hard cost to stomach. Is there another approach where a protocolized range can live outside of core? CLJ-993 has a patch that makes it a reducible source in clojure.core.reducers, but it's coll-reduce not IReduce, and doesn't contain an Iterator. Otherwise we might have to take the chunked seq challenge.

Alex: Re long/float. Old reified Ranged.java in clojure.lang blindly assumes ints, it would be nice to have a long vs. float version, though I believe the contract of reduce boxes numbers. (Unboxed math can be implemented very nicely as in Prismatic's Hiphip array manipulation library, which takes the long vs float specialization to the extreme with different namespaces)

Show
Ghadi Shayban added a comment - I think penalizing existing code possibly 5x is a hard cost to stomach. Is there another approach where a protocolized range can live outside of core? CLJ-993 has a patch that makes it a reducible source in clojure.core.reducers, but it's coll-reduce not IReduce, and doesn't contain an Iterator. Otherwise we might have to take the chunked seq challenge. Alex: Re long/float. Old reified Ranged.java in clojure.lang blindly assumes ints, it would be nice to have a long vs. float version, though I believe the contract of reduce boxes numbers. (Unboxed math can be implemented very nicely as in Prismatic's Hiphip array manipulation library, which takes the long vs float specialization to the extreme with different namespaces)
Hide
Timothy Baldridge added a comment -

I don't think anyone is suggesting we push unboxed math all the way down through transducers. Instead, this patch contains a lot of calls to Numbers.*, if we were to assume that the start end and step params of range are all Longs, then we could remove all of these calls and only box when returning an Object (in .first) or when calling IFn.invoke (inside .reduce)

Show
Timothy Baldridge added a comment - I don't think anyone is suggesting we push unboxed math all the way down through transducers. Instead, this patch contains a lot of calls to Numbers.*, if we were to assume that the start end and step params of range are all Longs, then we could remove all of these calls and only box when returning an Object (in .first) or when calling IFn.invoke (inside .reduce)
Hide
Alex Miller added a comment -

I agree that 5x slowdown is too much - I don't think we can give up chunked seqs if that's the penalty.

On the long case, I was suggesting what Tim is talking about, in the case of all longs, create a Range that stores long prims and does prim math, but still return boxed objects as necessary. I think the only case worth optimizing is all longs - the permutation of other options gets out of hand quickly.

Show
Alex Miller added a comment - I agree that 5x slowdown is too much - I don't think we can give up chunked seqs if that's the penalty. On the long case, I was suggesting what Tim is talking about, in the case of all longs, create a Range that stores long prims and does prim math, but still return boxed objects as necessary. I think the only case worth optimizing is all longs - the permutation of other options gets out of hand quickly.
Hide
Ghadi Shayban added a comment -

Tim, I'm not suggesting unboxed math, but the singular fast-path of all-Longs that you and Alex describe. I mistakenly lower-cased Long/Float.

Show
Ghadi Shayban added a comment - Tim, I'm not suggesting unboxed math, but the singular fast-path of all-Longs that you and Alex describe. I mistakenly lower-cased Long/Float.
Hide
Timothy Baldridge added a comment -

Here's the latest work on this, a few tests fail. If someone wants to take a look at this patch feel free, otherwise I'll continue to work on it as I have time/energy.

Show
Timothy Baldridge added a comment - Here's the latest work on this, a few tests fail. If someone wants to take a look at this patch feel free, otherwise I'll continue to work on it as I have time/energy.
Hide
Nicola Mometto added a comment -

As discussed with Tim in #clojure, the current patch should not change ArrayChunk's reduce impl, that's an error.

Show
Nicola Mometto added a comment - As discussed with Tim in #clojure, the current patch should not change ArrayChunk's reduce impl, that's an error.
Hide
Alex Miller added a comment -

Still a work in progress...

Show
Alex Miller added a comment - Still a work in progress...
Hide
Nicola Mometto added a comment - - edited

Alex, while this is still a work in progress, I see that the change on ArrayChunk#reduce from previous WIP patches not only has not been reverted but has been extended. I don't think the current approach makes sense as ArrayChunk#reduce is not part of the IReduce/IReduceInit contract but of the IChunk contract and changing the behaviour to be IReduce-like in its handling of reduced introduces the burden of having to use preserve-reduced on the reducing function to no apparent benefit.

Given that the preserve-reduced is done on the clojure side, it seems to me like directly invoking .reduce rather than routing through internal-reduce should be broken but I haven't tested it.

Show
Nicola Mometto added a comment - - edited Alex, while this is still a work in progress, I see that the change on ArrayChunk#reduce from previous WIP patches not only has not been reverted but has been extended. I don't think the current approach makes sense as ArrayChunk#reduce is not part of the IReduce/IReduceInit contract but of the IChunk contract and changing the behaviour to be IReduce-like in its handling of reduced introduces the burden of having to use preserve-reduced on the reducing function to no apparent benefit. Given that the preserve-reduced is done on the clojure side, it seems to me like directly invoking .reduce rather than routing through internal-reduce should be broken but I haven't tested it.
Hide
Alex Miller added a comment -

That's the work in progress part - I haven't looked at yet. I have not extended or done any work re ArrayChunk, just carried through what was on the prior patch. I'll be working on it again tomorrow.

Show
Alex Miller added a comment - That's the work in progress part - I haven't looked at yet. I have not extended or done any work re ArrayChunk, just carried through what was on the prior patch. I'll be working on it again tomorrow.
Hide
Ghadi Shayban added a comment -

I am impressed and have learned a ton through this exercise.

quick review of clj-1515-2
1) withMeta gives the newly formed object the wrong meta.
2) LongRange/create() is the new 0-arity constructor for range, which sets the 'end' to Double/POSITIVE_INFINITY cast as a long. Current core uses Double/POSITIVE_INFINITY directly. Not sure how many programs rely upon iterating that far, or how they would break.
3) Relatedly, depending on the previous point: Because only all-long arguments receive chunking, the very common case of (range) with no args would be unchunked. Doesn't seem like too much of a stretch to add chunking to the other impl.
4) Though the commented invariants say that Range is never empty, the implementation uses a magic value of _count == 0 to mean not cached, which is surprising to me. hashcodes have the magic value of -1
5) s/instanceof Reduced/RT.isReduced
6) is the overflow behavior of "int count()" correct?

Show
Ghadi Shayban added a comment - I am impressed and have learned a ton through this exercise. quick review of clj-1515-2 1) withMeta gives the newly formed object the wrong meta. 2) LongRange/create() is the new 0-arity constructor for range, which sets the 'end' to Double/POSITIVE_INFINITY cast as a long. Current core uses Double/POSITIVE_INFINITY directly. Not sure how many programs rely upon iterating that far, or how they would break. 3) Relatedly, depending on the previous point: Because only all-long arguments receive chunking, the very common case of (range) with no args would be unchunked. Doesn't seem like too much of a stretch to add chunking to the other impl. 4) Though the commented invariants say that Range is never empty, the implementation uses a magic value of _count == 0 to mean not cached, which is surprising to me. hashcodes have the magic value of -1 5) s/instanceof Reduced/RT.isReduced 6) is the overflow behavior of "int count()" correct?
Hide
Alex Miller added a comment -

1) agreed!
2) Good point. I am definitely changing behavior on this (max of 9223372036854775807). I will look at whether this can be handled without affecting perf. Really, handling an infinite end point is not compatible with several things in LongRange.
3) I actually did implement chunking for the general Range and found it was slower (the original Clojure chunking is faster). LongRange is making up for that difference with improved primitive numerics.
4) Since empty is invalid, 0 and -1 are equally invalid. But I agree -1 conveys the intent better.
5) agreed
6) probably not. ties into 2/3.

Thanks for this, will address.

Show
Alex Miller added a comment - 1) agreed! 2) Good point. I am definitely changing behavior on this (max of 9223372036854775807). I will look at whether this can be handled without affecting perf. Really, handling an infinite end point is not compatible with several things in LongRange. 3) I actually did implement chunking for the general Range and found it was slower (the original Clojure chunking is faster). LongRange is making up for that difference with improved primitive numerics. 4) Since empty is invalid, 0 and -1 are equally invalid. But I agree -1 conveys the intent better. 5) agreed 6) probably not. ties into 2/3. Thanks for this, will address.
Hide
Alex Miller added a comment -

Added -4 patch that addresses 1,4,5 but not the (range) stuff.

Show
Alex Miller added a comment - Added -4 patch that addresses 1,4,5 but not the (range) stuff.
Hide
Alex Miller added a comment -

Latest -7 patch addresses all feedback and perf #s updated.

Show
Alex Miller added a comment - Latest -7 patch addresses all feedback and perf #s updated.
Hide
Ghadi Shayban added a comment -

See CLJ-1572, I believe CollReduce needs to be extended directly to clojure.lang.{LongRange,Range} inside protocols.clj

Show
Ghadi Shayban added a comment - See CLJ-1572, I believe CollReduce needs to be extended directly to clojure.lang.{LongRange,Range} inside protocols.clj
Hide
Ghadi Shayban added a comment -

Seems like a missing benchmark is (range 0 31) without the doall. Current patch (-9) allocates the chunk buffer at range's construction time. Maybe this can be delayed?

Show
Ghadi Shayban added a comment - Seems like a missing benchmark is (range 0 31) without the doall. Current patch (-9) allocates the chunk buffer at range's construction time. Maybe this can be delayed?
Hide
Alex Miller added a comment -

Ghadi - you are right. I reworked the patch (new -10 version) so that the chunk is only created on demand. Basically, the chunking is only used when traversing via chunked seqs and in normal seq iteration or reduce, no explicit chunks are created. That improved several timings. I also added the bench for (range 0 31) which was greatly improved by lazily creating the chunk, so good point on that.

Show
Alex Miller added a comment - Ghadi - you are right. I reworked the patch (new -10 version) so that the chunk is only created on demand. Basically, the chunking is only used when traversing via chunked seqs and in normal seq iteration or reduce, no explicit chunks are created. That improved several timings. I also added the bench for (range 0 31) which was greatly improved by lazily creating the chunk, so good point on that.
Hide
Michael Blume added a comment -

I'm looking at the implementation of equiv() and wondering if it's worth checking whether the other object is also a reified range and comparing the private parameters rather than iterating through the sequences.

Show
Michael Blume added a comment - I'm looking at the implementation of equiv() and wondering if it's worth checking whether the other object is also a reified range and comparing the private parameters rather than iterating through the sequences.
Hide
Ghadi Shayban added a comment -

Attaching a simplified implementation as a deftype. I benchmarked a billion variants and dumped bytecode. I'm attaching the best performing combination, and it beats out the Java one in most cases.

Approach:
Two deftypes, one specialized to primitive longs, the other generic math.
Implement: Seqable/Counted,IReduce,IReduceInit,Iterable
Also, marker interfaces Sequential and Serialized.

The implementations are very straightforward, but Seqable deserves mention.
'seq' delegates its implementation to the existing lazy-seq based range, which has been stripped of the strange corner cases and now only handles strictly ascending/descending ranges. It has been renamed range*, rather than range1 because all the arguments to range are already numbers. core references to range have been accordingly renamed.

The new range constructor is loaded after reduce & protocols load. It checks types, and delegates to either long-range or generic-range, which handle the wacky argument cases that are not strictly ascending or descending. I'm annoyed with the structural duplication of those conditionals, not sure how to solve it.

Inside the LongRange implementation of IReduce/IReduceInit, boxing is carefully controlled, and was verified through bytecode dumping.

I elaborated the benchmarks for comparison, and also included benchmarking without type specialization.

You discover strange things working on this stuff. Turns out having the comparator close over the 'end' field is beneficial: #(< % end).

Alex, I figured out why the Java versions had a nearly exactly 2x regression on (doall (range 0 31)), and I attached a change to 'dorun'. Instead of calling (seq coll) then (next coll), it effectively calls (next (seq coll)). I think the implementation assumes that calling 'seq' is evaluating the thunk inside LazySeq. This should also help other Seqable impls like Cycle/Iterate/Repeat CLJ-1603.

Results (criterium full runs):

code 1.7.0-alpha5 clj-1515-10.patch (Java) deftype specialized (attached) deftype unspecialized
(count (filter odd? (take (* 1024 1024) (range)))) 187.30 ms 194.92 ms 182.53 ms 184.13 ms
(transduce (take (* 1024 1024)) + (range)) 58.27 ms 89.37 ms 84.69 ms 84.72 ms
(count (range (* 1024 1024))) 62.34 ms 27.11 ns not run not run
(reduce + (map inc (range (* 1024 1024)))) 57.63 ms 46.14 ms 46.17 ms 52.94 ms
(reduce + (map inc (map inc (range (* 1024 1024))))) 82.83 ms 68.66 ms 64.36 ms 71.76 ms
(count (keep odd? (range (* 1024 1024)))) 76.09 ms 57.59 ms not run not run
(transduce (comp (map inc) (map inc)) + (range (* 1024 1024))) 52.17 ms 39.71 ms 28.83 ms 42.23 ms
(reduce + 0 (range (* 2048 1024))) 85.93 ms 38.03 ms 26.49 ms 42.43 ms
(doall (range 0 31)) 1.33 µs 2.89 µs 1.03 µs 1.08 µs
(doall (range 0 32)) 1.35 µs 2.97 µs 1.07 µs 1.10 µs
(doall (range 0 128)) 5.27 µs 11.93 µs 4.19 µs 4.29 µs
(doall (range 0 512)) 21.66 µs 47.33 µs 16.95 µs 17.66 µs
(doall (range 0 4096)) 171.30 µs 378.52 µs 135.45 µs 140.27 µs
(into [] (map inc (range 31))) 1.97 µs 1.57 µs 1.87 µs 1.95 µs
(into [] (map inc) (range 31)) 1.66 µs 824.67 ns 891.90 ns 1.11 µs
(into [] (range 128)) 5.11 µs 2.21 µs 2.43 µs 3.36 µs
(doall (range 1/2 1000 1/3)) 1.53 ms 1.67 ms 1.59 ms 1.52 ms
(doall (range 0.5 1000 0.33)) 164.83 µs 382.38 µs 149.38 µs 141.21 µs
(into [] (range 1/2 1000 1/3)) 1.53 ms 1.40 ms 1.43 ms 1.50 ms
(into [] (range 0.5 1000 0.33)) 157.44 µs 108.27 µs 104.18 µs 127.20 µs

Open Questions for screeners of the deftype patch:

1) What to do about the autopromotion at the Long/MAX_VALUE boundary? I've preserved the current behavior of Clojure 1.6.
2) Alex, I did not pull forward the filter chunk tweak you discovered
3) Is the structural duplication of the conditionals in generic-range long-range awful?
4) Are there any other missing interfaces? IMeta comes to mind. Not sure about IHashEq either.

Show
Ghadi Shayban added a comment - Attaching a simplified implementation as a deftype. I benchmarked a billion variants and dumped bytecode. I'm attaching the best performing combination, and it beats out the Java one in most cases. Approach: Two deftypes, one specialized to primitive longs, the other generic math. Implement: Seqable/Counted,IReduce,IReduceInit,Iterable Also, marker interfaces Sequential and Serialized. The implementations are very straightforward, but Seqable deserves mention. 'seq' delegates its implementation to the existing lazy-seq based range, which has been stripped of the strange corner cases and now only handles strictly ascending/descending ranges. It has been renamed range*, rather than range1 because all the arguments to range are already numbers. core references to range have been accordingly renamed. The new range constructor is loaded after reduce & protocols load. It checks types, and delegates to either long-range or generic-range, which handle the wacky argument cases that are not strictly ascending or descending. I'm annoyed with the structural duplication of those conditionals, not sure how to solve it. Inside the LongRange implementation of IReduce/IReduceInit, boxing is carefully controlled, and was verified through bytecode dumping. I elaborated the benchmarks for comparison, and also included benchmarking without type specialization. You discover strange things working on this stuff. Turns out having the comparator close over the 'end' field is beneficial: #(< % end). Alex, I figured out why the Java versions had a nearly exactly 2x regression on (doall (range 0 31)), and I attached a change to 'dorun'. Instead of calling (seq coll) then (next coll), it effectively calls (next (seq coll)). I think the implementation assumes that calling 'seq' is evaluating the thunk inside LazySeq. This should also help other Seqable impls like Cycle/Iterate/Repeat CLJ-1603. Results (criterium full runs):
code 1.7.0-alpha5 clj-1515-10.patch (Java) deftype specialized (attached) deftype unspecialized
(count (filter odd? (take (* 1024 1024) (range)))) 187.30 ms 194.92 ms 182.53 ms 184.13 ms
(transduce (take (* 1024 1024)) + (range)) 58.27 ms 89.37 ms 84.69 ms 84.72 ms
(count (range (* 1024 1024))) 62.34 ms 27.11 ns not run not run
(reduce + (map inc (range (* 1024 1024)))) 57.63 ms 46.14 ms 46.17 ms 52.94 ms
(reduce + (map inc (map inc (range (* 1024 1024))))) 82.83 ms 68.66 ms 64.36 ms 71.76 ms
(count (keep odd? (range (* 1024 1024)))) 76.09 ms 57.59 ms not run not run
(transduce (comp (map inc) (map inc)) + (range (* 1024 1024))) 52.17 ms 39.71 ms 28.83 ms 42.23 ms
(reduce + 0 (range (* 2048 1024))) 85.93 ms 38.03 ms 26.49 ms 42.43 ms
(doall (range 0 31)) 1.33 µs 2.89 µs 1.03 µs 1.08 µs
(doall (range 0 32)) 1.35 µs 2.97 µs 1.07 µs 1.10 µs
(doall (range 0 128)) 5.27 µs 11.93 µs 4.19 µs 4.29 µs
(doall (range 0 512)) 21.66 µs 47.33 µs 16.95 µs 17.66 µs
(doall (range 0 4096)) 171.30 µs 378.52 µs 135.45 µs 140.27 µs
(into [] (map inc (range 31))) 1.97 µs 1.57 µs 1.87 µs 1.95 µs
(into [] (map inc) (range 31)) 1.66 µs 824.67 ns 891.90 ns 1.11 µs
(into [] (range 128)) 5.11 µs 2.21 µs 2.43 µs 3.36 µs
(doall (range 1/2 1000 1/3)) 1.53 ms 1.67 ms 1.59 ms 1.52 ms
(doall (range 0.5 1000 0.33)) 164.83 µs 382.38 µs 149.38 µs 141.21 µs
(into [] (range 1/2 1000 1/3)) 1.53 ms 1.40 ms 1.43 ms 1.50 ms
(into [] (range 0.5 1000 0.33)) 157.44 µs 108.27 µs 104.18 µs 127.20 µs
Open Questions for screeners of the deftype patch: 1) What to do about the autopromotion at the Long/MAX_VALUE boundary? I've preserved the current behavior of Clojure 1.6. 2) Alex, I did not pull forward the filter chunk tweak you discovered 3) Is the structural duplication of the conditionals in generic-range long-range awful? 4) Are there any other missing interfaces? IMeta comes to mind. Not sure about IHashEq either.
Hide
Ghadi Shayban added a comment -

note with the deftype patch, the transduce over infinite (range) case is the only one where 'master' currently performs best. This is because the implementation was changed to (iterate inc' 0). When CLJ-1603 is applied that situation should improve better.

Show
Ghadi Shayban added a comment - note with the deftype patch, the transduce over infinite (range) case is the only one where 'master' currently performs best. This is because the implementation was changed to (iterate inc' 0). When CLJ-1603 is applied that situation should improve better.
Hide
Ghadi Shayban added a comment -

fixup patch CLJ-1515-deftype

Show
Ghadi Shayban added a comment - fixup patch CLJ-1515-deftype
Hide
Ghadi Shayban added a comment - - edited

new patch CLJ-1515-deftype-nostructural-dup.patch with the silly conditional duplication removed. Updated benchmarks including 'count' impls. These benchmarks run on a different machine with the same hardness. Blue ribbon.

code 1.7.0-alpha5 1.7.0-rangejava 1.7.0-rangespecial rangespecial / alpah5
(count (filter odd? (take (* 1024 1024) (range)))) 297.14 ms 333.93 ms 328.62 ms 1.10
(transduce (take (* 1024 1024)) + (range)) 105.55 ms 145.44 ms 164.70 ms 1.56
(count (range (* 1024 1024))) 108.92 ms 61.09 ns 26.61 ns 0
(reduce + (map inc (range (* 1024 1024)))) 97.67 ms 95.41 ms 84.62 ms 0.86
(reduce + (map inc (map inc (range (* 1024 1024))))) 140.21 ms 135.59 ms 116.38 ms 0.83
(count (keep odd? (range (* 1024 1024)))) 121.18 ms 104.63 ms 111.46 ms 0.92
(transduce (comp (map inc) (map inc)) + (range (* 1024 1024))) 100.40 ms 86.28 ms 67.17 ms 0.67
(reduce + 0 (range (* 2048 1024))) 131.77 ms 80.43 ms 63.24 ms 0.48
(doall (range 0 31)) 2.53 µs 4.36 µs 2.24 µs 0.89
(doall (range 0 32)) 2.37 µs 4.00 µs 1.99 µs 0.84
(doall (range 0 128)) 9.20 µs 14.98 µs 8.01 µs 0.87
(doall (range 0 512)) 37.28 µs 59.13 µs 35.16 µs 0.94
(doall (range 0 4096)) 331.28 µs 471.57 µs 291.76 µs 0.88
(into [] (map inc (range 31))) 2.83 µs 2.79 µs 2.67 µs 0.94
(into [] (map inc) (range 31)) 2.21 µs 1.39 µs 1.26 µs 0.57
(into [] (range 128)) 6.72 µs 3.25 µs 3.09 µs 0.46
(doall (range 1/2 1000 1/3)) 3.41 ms 4.04 ms 3.14 ms 0.92
(doall (range 0.5 1000 0.33)) 281.04 µs 530.92 µs 244.14 µs 0.87
(into [] (range 1/2 1000 1/3)) 3.32 ms 3.71 ms 2.99 ms 0.90
(into [] (range 0.5 1000 0.33)) 215.53 µs 165.93 µs 138.86 µs 0.64
Show
Ghadi Shayban added a comment - - edited new patch CLJ-1515-deftype-nostructural-dup.patch with the silly conditional duplication removed. Updated benchmarks including 'count' impls. These benchmarks run on a different machine with the same hardness. Blue ribbon.
code 1.7.0-alpha5 1.7.0-rangejava 1.7.0-rangespecial rangespecial / alpah5
(count (filter odd? (take (* 1024 1024) (range)))) 297.14 ms 333.93 ms 328.62 ms 1.10
(transduce (take (* 1024 1024)) + (range)) 105.55 ms 145.44 ms 164.70 ms 1.56
(count (range (* 1024 1024))) 108.92 ms 61.09 ns 26.61 ns 0
(reduce + (map inc (range (* 1024 1024)))) 97.67 ms 95.41 ms 84.62 ms 0.86
(reduce + (map inc (map inc (range (* 1024 1024))))) 140.21 ms 135.59 ms 116.38 ms 0.83
(count (keep odd? (range (* 1024 1024)))) 121.18 ms 104.63 ms 111.46 ms 0.92
(transduce (comp (map inc) (map inc)) + (range (* 1024 1024))) 100.40 ms 86.28 ms 67.17 ms 0.67
(reduce + 0 (range (* 2048 1024))) 131.77 ms 80.43 ms 63.24 ms 0.48
(doall (range 0 31)) 2.53 µs 4.36 µs 2.24 µs 0.89
(doall (range 0 32)) 2.37 µs 4.00 µs 1.99 µs 0.84
(doall (range 0 128)) 9.20 µs 14.98 µs 8.01 µs 0.87
(doall (range 0 512)) 37.28 µs 59.13 µs 35.16 µs 0.94
(doall (range 0 4096)) 331.28 µs 471.57 µs 291.76 µs 0.88
(into [] (map inc (range 31))) 2.83 µs 2.79 µs 2.67 µs 0.94
(into [] (map inc) (range 31)) 2.21 µs 1.39 µs 1.26 µs 0.57
(into [] (range 128)) 6.72 µs 3.25 µs 3.09 µs 0.46
(doall (range 1/2 1000 1/3)) 3.41 ms 4.04 ms 3.14 ms 0.92
(doall (range 0.5 1000 0.33)) 281.04 µs 530.92 µs 244.14 µs 0.87
(into [] (range 1/2 1000 1/3)) 3.32 ms 3.71 ms 2.99 ms 0.90
(into [] (range 0.5 1000 0.33)) 215.53 µs 165.93 µs 138.86 µs 0.64
Hide
Alex Miller added a comment -

This is looking very good and I think we should move forward with it as the preferred approach. Feel free to update the description appropriately. I'll file a separate ticket with the filter tweak. Some comments on the patch:

1) GenericRange/count - this math is broken if you start to mix infinity in there. I think just (count (seq this)) is safer.
2) GenericRange/iterable - I think if we are IReduceInit and Seqable, we can omit this. I had it in the Java one because I inherited Iterable via ASeq but that's not an issue in this impl.
3) LongRange/reduce - why Long/valueOf? Isn't start a long field?
4) LongRange/iterable - ditto #2
5) print-method - anytime I see print-method, there should probably be print-dup too. For example, this is broken: (binding [*print-dup* true] (println (range 0 10)))
6) Is serialization broken by this patch? Can you justify the test changes?

Show
Alex Miller added a comment - This is looking very good and I think we should move forward with it as the preferred approach. Feel free to update the description appropriately. I'll file a separate ticket with the filter tweak. Some comments on the patch: 1) GenericRange/count - this math is broken if you start to mix infinity in there. I think just (count (seq this)) is safer. 2) GenericRange/iterable - I think if we are IReduceInit and Seqable, we can omit this. I had it in the Java one because I inherited Iterable via ASeq but that's not an issue in this impl. 3) LongRange/reduce - why Long/valueOf? Isn't start a long field? 4) LongRange/iterable - ditto #2 5) print-method - anytime I see print-method, there should probably be print-dup too. For example, this is broken: (binding [*print-dup* true] (println (range 0 10))) 6) Is serialization broken by this patch? Can you justify the test changes?
Hide
Ghadi Shayban added a comment -

regarding Serialization tests currently:

(defn roundtrip
  [v]
  (let [rt (-> v serialize deserialize)
        rt-seq (-> v seq serialize deserialize)]        ;; this
    (and (= v rt)            ;; this fails because the test ^ calls seq first
      (= (seq v) (seq rt))   ;; this passes
      (= (seq v) rt-seq))))  ;; this passes

These new types are merely seqable, so (not= (LongRange. 0 10 1) (seq (LongRange. 0 10 1)))

Not sure how to handle this 100%. Nothing precludes the LongRange itself from roundtripping, but just that calling seq on it returns a separate object.

Show
Ghadi Shayban added a comment - regarding Serialization tests currently:
(defn roundtrip
  [v]
  (let [rt (-> v serialize deserialize)
        rt-seq (-> v seq serialize deserialize)]        ;; this
    (and (= v rt)            ;; this fails because the test ^ calls seq first
      (= (seq v) (seq rt))   ;; this passes
      (= (seq v) rt-seq))))  ;; this passes
These new types are merely seqable, so (not= (LongRange. 0 10 1) (seq (LongRange. 0 10 1))) Not sure how to handle this 100%. Nothing precludes the LongRange itself from roundtripping, but just that calling seq on it returns a separate object.
Hide
Alex Miller added a comment -

More comments...

1) Instead of extending IReduce and IReduceInit, just extend IReduce and implement both arities (IReduce extends IReduceInit).
2) I'm slightly troubled by the .invokePrim now. Did you look at a macro that does prim type-hinting maybe?
3) On the serialization thing, is the problem really with serialization or with equality? The test that's failing is (= v rt). Because these are not IPersistentCollections, pcequiv won't be used and it's just calling LongRange.equals(), which is not implemented and falls back to identity, right? Probably need to implement the hashCode and equals stuff. Might need IHashEq too.

user=> (= (range 5) (range 5))
false
Show
Alex Miller added a comment - More comments... 1) Instead of extending IReduce and IReduceInit, just extend IReduce and implement both arities (IReduce extends IReduceInit). 2) I'm slightly troubled by the .invokePrim now. Did you look at a macro that does prim type-hinting maybe? 3) On the serialization thing, is the problem really with serialization or with equality? The test that's failing is (= v rt). Because these are not IPersistentCollections, pcequiv won't be used and it's just calling LongRange.equals(), which is not implemented and falls back to identity, right? Probably need to implement the hashCode and equals stuff. Might need IHashEq too.
user=> (= (range 5) (range 5))
false
Hide
Ghadi Shayban added a comment -

Took care of 1) and 3). Punting on hiding the invokePrim behind a macro. It may be shameful but it works really fast.

Another case found:
(assoc {'(0 1 2 3 4) :foo} (range 5) :bar) needs to properly overwrite keys in the map.

Cause: Might be irrelevant in the face of CLJ-1375. Util/equiv for maps doesn't use IPC/equiv unless the collection is also java.util.Collection or Map. If it is then it checks for IPersistentCollection. I added a check for IPersistentCollection first.

https://github.com/ghadishayban/clojure/commits/for-screening

To get hasheq working, I added back the iterators for use by Murmur3/hashOrdered.

Show
Ghadi Shayban added a comment - Took care of 1) and 3). Punting on hiding the invokePrim behind a macro. It may be shameful but it works really fast. Another case found: (assoc {'(0 1 2 3 4) :foo} (range 5) :bar) needs to properly overwrite keys in the map. Cause: Might be irrelevant in the face of CLJ-1375. Util/equiv for maps doesn't use IPC/equiv unless the collection is also java.util.Collection or Map. If it is then it checks for IPersistentCollection. I added a check for IPersistentCollection first. https://github.com/ghadishayban/clojure/commits/for-screening To get hasheq working, I added back the iterators for use by Murmur3/hashOrdered.
Hide
Ghadi Shayban added a comment -

Handle hash and equality not through IPersistentCollection. w/ test-cases too.

Show
Ghadi Shayban added a comment - Handle hash and equality not through IPersistentCollection. w/ test-cases too.
Hide
Alex Miller added a comment -

Ghadi, we need to have range implement IObj too so with-meta works.

Show
Alex Miller added a comment - Ghadi, we need to have range implement IObj too so with-meta works.
Hide
Ghadi Shayban added a comment -

Will add pronto but maybe someone can clarify something: Adding a simple clojure.lang.IObj/withMeta to the deftype results in an AbstractMethodError when trying to print a range. This is because the impl of vary-meta used in the default printer assumes that if all IObjs are also IMetas. Seems like a problem with either vary-meta's impl or the interface separation.

I can fix by:
1) adding a _meta field to Ranges and handling IMeta as well as IObj.

;; vary-meta:
(with-meta obj (apply f (meta obj) args)))

;; default printer
(defmethod print-method :default [o, ^Writer w]
  (if (instance? clojure.lang.IObj o)
    (print-method (vary-meta o #(dissoc % :type)) w)
    (print-simple o w)))
Show
Ghadi Shayban added a comment - Will add pronto but maybe someone can clarify something: Adding a simple clojure.lang.IObj/withMeta to the deftype results in an AbstractMethodError when trying to print a range. This is because the impl of vary-meta used in the default printer assumes that if all IObjs are also IMetas. Seems like a problem with either vary-meta's impl or the interface separation. I can fix by: 1) adding a _meta field to Ranges and handling IMeta as well as IObj.
;; vary-meta:
(with-meta obj (apply f (meta obj) args)))

;; default printer
(defmethod print-method :default [o, ^Writer w]
  (if (instance? clojure.lang.IObj o)
    (print-method (vary-meta o #(dissoc % :type)) w)
    (print-simple o w)))
Hide
Ghadi Shayban added a comment -

Ugh never mind that last bit I fell out of the hammock prematurely. IMeta << IObj.

Alex, CLJ-1603 needs IMeta/meta too.

Show
Ghadi Shayban added a comment - Ugh never mind that last bit I fell out of the hammock prematurely. IMeta << IObj. Alex, CLJ-1603 needs IMeta/meta too.
Hide
Alex Miller added a comment -

current direction is pending results of where CLJ-1603 goes

Show
Alex Miller added a comment - current direction is pending results of where CLJ-1603 goes
Hide
Fogus added a comment -

Screened. Though the amount of duplicated code saddens me, I don't hold it against the implementer. It's a fairly straight-forward counting Impl with a chunk cache.

Show
Fogus added a comment - Screened. Though the amount of duplicated code saddens me, I don't hold it against the implementer. It's a fairly straight-forward counting Impl with a chunk cache.
Hide
Alex Miller added a comment -

New patch -13 removes the array allocation and performs better, still need to update timings and consider the non-long case.

Show
Alex Miller added a comment - New patch -13 removes the array allocation and performs better, still need to update timings and consider the non-long case.
Hide
Alex Miller added a comment -

The -14 patch just switches to leverage Repeat now that 1603 has been applied.

Show
Alex Miller added a comment - The -14 patch just switches to leverage Repeat now that 1603 has been applied.
Hide
Fogus added a comment -

The latest patch (-14) handles the extraneous allocations from -12. All tests run and timings verified.

Show
Fogus added a comment - The latest patch (-14) handles the extraneous allocations from -12. All tests run and timings verified.

People

Vote (2)
Watch (5)

Dates

  • Created:
    Updated:
    Resolved: