Clojure

Speedup mapv for multiple collections using iterators

Details

  • Type: Enhancement Enhancement
  • Status: Open Open
  • Priority: Minor Minor
  • Resolution: Unresolved
  • Affects Version/s: Release 1.9
  • Fix Version/s: None
  • Component/s: None
  • Labels:

Description

mapv for multiple collections currently does:

(into [] (map f c1 c2 c3))

I propose to change these to:

(let [it1 (clojure.lang.RT/iter c1)
      it2 (clojure.lang.RT/iter c2)]
     (loop [out (transient [])]
       (if (and (.hasNext it1) (.hasNext it2))
         (recur (conj! out (f (.next it1) (.next it2))))
         (persistent! out))))

This is 5x faster.

For variadic arity we can check if colls is counted?:

  • If yes: Use MultiIterator
  • If no: Forward to map like the current implementation does.

Any thoughts on this or if this could brake something?

Activity

Hide
Alex Miller added a comment -

Iterators should be treated as dangerous and unsafe by default - they are mutable, stateful, (potentially) non-thread-safe objects. Thus, their use should be carefully scrutinized and in particular you don't want them to a) leak out of a localized context (so they should be eagerly walked and released) or b) be used in a way that triggers the execution of impure functions such that they would be reexecuted later. Sequences cache their realization (and are thread-safe) which makes them slower, but much safer in avoiding this problem.

In this case, you're eagerly producing a concrete collection (not a lazy seq) so the first concern is addressed. However, the second is a little harder to reason through. I think it might be ok, but would have to think about that more.

Note that into uses transduce which uses CollReduce, which will do an iterReduce on Iterable colls, so you could probably get the same effect by simply using the transducer form:

(into [] (map f) c1)

Except that this form only takes a single coll. Multi coll support for transducers exists (see TransformerIterator/createMulti) but is not well-surfaced in the current apis. Seems like leveraging it directly would be vastly preferable to the suggestion above though:

(defn mapv2 [f & cs]
  (let [iters (map #(.iterator %) cs)
        t (clojure.lang.TransformerIterator/createMulti (map f) iters)]
    (loop [out (transient [])]
      (if (.hasNext t)
        (recur (conj! out (.next t)))
        (persistent! out)))))

Note that this assumes all cs are Iterable, which is not a valid assumption. For example: (mapv identity "abc"). So this would still need some work on fallbacks and perf testing.

Show
Alex Miller added a comment - Iterators should be treated as dangerous and unsafe by default - they are mutable, stateful, (potentially) non-thread-safe objects. Thus, their use should be carefully scrutinized and in particular you don't want them to a) leak out of a localized context (so they should be eagerly walked and released) or b) be used in a way that triggers the execution of impure functions such that they would be reexecuted later. Sequences cache their realization (and are thread-safe) which makes them slower, but much safer in avoiding this problem. In this case, you're eagerly producing a concrete collection (not a lazy seq) so the first concern is addressed. However, the second is a little harder to reason through. I think it might be ok, but would have to think about that more. Note that into uses transduce which uses CollReduce, which will do an iterReduce on Iterable colls, so you could probably get the same effect by simply using the transducer form:
(into [] (map f) c1)
Except that this form only takes a single coll. Multi coll support for transducers exists (see TransformerIterator/createMulti) but is not well-surfaced in the current apis. Seems like leveraging it directly would be vastly preferable to the suggestion above though:
(defn mapv2 [f & cs]
  (let [iters (map #(.iterator %) cs)
        t (clojure.lang.TransformerIterator/createMulti (map f) iters)]
    (loop [out (transient [])]
      (if (.hasNext t)
        (recur (conj! out (.next t)))
        (persistent! out)))))
Note that this assumes all cs are Iterable, which is not a valid assumption. For example: (mapv identity "abc"). So this would still need some work on fallbacks and perf testing.
Hide
A. R added a comment -

Alex: I actually did use exactly this very implementation but found it to be quite a bit slower than manually keeping track of the iterators. My guess is it's because of xf.applyTo inside the step function of TransformIterator. We already know source.size() when creating it in createMulti. So this implementation could probably be made smarter by directly invoking the right arity.
Though, I haven't benchmarked anything. So just a guess for now.

Show
A. R added a comment - Alex: I actually did use exactly this very implementation but found it to be quite a bit slower than manually keeping track of the iterators. My guess is it's because of xf.applyTo inside the step function of TransformIterator. We already know source.size() when creating it in createMulti. So this implementation could probably be made smarter by directly invoking the right arity. Though, I haven't benchmarked anything. So just a guess for now.
Hide
Alex Miller added a comment -

Seems like a reasonable guess and something that could be optimized in TransformIterator.

Show
Alex Miller added a comment - Seems like a reasonable guess and something that could be optimized in TransformIterator.
Hide
A. R added a comment - - edited

I actually thought about this a bit more. I agree that the proper step here would be to properly leverage transducers and provide better support for multiple collections with transducers.

So I'd suggest changing the game plan to:

1. Make into and transduce work with multiple collections
2. Throw out MultiIterator. We can get much better performance doing this inline in TransformIterator, then we avoid the unnecessary step of "build up sequence in MultiIterator" followed by "spread out this just created sequence in apply".

Though, this would make the TransformIterator much more involved.

After this, we could give many transducers the option of working over multiple sequences (keep, filter, map-indexed, take, drop etc etc) and efficiently using them with into and transduce.

If you agree, I can change the title to reflect the new scope.

Show
A. R added a comment - - edited I actually thought about this a bit more. I agree that the proper step here would be to properly leverage transducers and provide better support for multiple collections with transducers. So I'd suggest changing the game plan to: 1. Make into and transduce work with multiple collections 2. Throw out MultiIterator. We can get much better performance doing this inline in TransformIterator, then we avoid the unnecessary step of "build up sequence in MultiIterator" followed by "spread out this just created sequence in apply". Though, this would make the TransformIterator much more involved. After this, we could give many transducers the option of working over multiple sequences (keep, filter, map-indexed, take, drop etc etc) and efficiently using them with into and transduce. If you agree, I can change the title to reflect the new scope.
Hide
Alex Miller added a comment -

You are moving from "enhancing performance of an existing function" into doing api design which has a whole host of larger considerations. I think it's unlikely that we will expand either into or transduce into multi-coll territory and I don't think there are very many transducers that make sense as natively multi-coll. So, I do not think you should expand the scope of this ticket, as I would then decline it.

Show
Alex Miller added a comment - You are moving from "enhancing performance of an existing function" into doing api design which has a whole host of larger considerations. I think it's unlikely that we will expand either into or transduce into multi-coll territory and I don't think there are very many transducers that make sense as natively multi-coll. So, I do not think you should expand the scope of this ticket, as I would then decline it.

People

  • Assignee:
    Unassigned
    Reporter:
    A. R
Vote (0)
Watch (1)

Dates

  • Created:
    Updated: