Clojure

reduce gives a SO for pathological seqs

Details

  • Type: Defect Defect
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: Release 1.5
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None
  • Environment:
    1.5.1
  • Patch:
    Code and Test

Description

reduce gives a StackOverflowError on long sequences that contain many transitions between chunked and unchunked:

(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (reduce (constantly nil)))
;; throws StackOverflowError

Such a sequence is well behaved under most other sequence operations, and its underlying structure can even be masked such that reduce succeeds:

(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (take 10000000)
     (reduce (constantly nil)))
;; => nil

I don't think Clojure developers normally worry about mixing chunked and unchunked seqs, so the existence of such a sequence is not at all unreasonable (and indeed this happened to me at work and was very difficult to debug).

It seems obvious what causes this given the implementation of reduce – it bounces back and forth between the chunked impl and the unchunked impl, consuming more and more stack as it goes. Without proper tail call optimization, it's not obvious to me what a good fix would be.

Presumed bad solutions

Degrade to naive impl after first chunk

In the IChunkedSeq implementation, instead of calling internal-reduce when the
sequence stops being chunked, it could have an (inlined?) unoptimized implementation,
ensuring that no further stack space is taken up. This retains the behavior that a
generic seq with a chunked tail will still run in an optimized fashion, but a seq with
two chunked portions would only be optimized the first time.

Use clojure.core/trampoline

This would presumably work, but requires wrapping the normal return values from all
implementations of internal-reduce.

Proposed Solution

(attached as CLJ-1237c.patch)

Similar to using trampoline, but create a special type (Unreduced) that signals
an implementation change. The two implementation-change points in internal-reduce
(in the IChunkedSeq impl and the Object impl) are converted to return an instance
of Unreduced instead of a direct call to internal-reduce.

Then seq-reduce is converted to check for instances of Unreduced before returning,
and recurs if it finds one.

Pros

  • Only requires one additional check in most cases
  • Reduces stack usage for existing heterogeneous reductions that weren't extreme enough to crash
  • Should be compatible with 3rd-party implementations of internal-reduce, which can still use the old style (direct recursive calls to internal-reduce) or the optimized style if desired.

Cons

  • internal-reduce is slightly more complicated
  • There's an extra check at the end of seq-reduce – is that a performance worry?

Activity

Gary Fredericks made changes -
Field Original Value New Value
Description {{reduce}} gives a {{StackOverflowError}} on long sequences that contain many transitions between chunked and unchunked:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (reduce (constantly nil)))
;; throws StackOverflowError
{code}

Such a sequence is well behaved under most other sequence operations, and its underlying structure can even be masked such that {{reduce}} succeeds:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (take 10000000)
     (reduce (constantly nil)))
;; => nil
{code}

I don't think Clojure developers normally worry about mixing chunked and unchunked seqs, so the existence of such a sequence is not at all unreasonable (and indeed this happened to me at work and was very difficult to debug).

It seems obvious what causes this given the implementation of reduce -- it bounces back and forth between the chunked impl and the unchunked impl, consuming more and more stack as it goes. Without proper tail call optimization, it's not obvious to me what a good fix would be.
{{reduce}} gives a {{StackOverflowError}} on long sequences that contain many transitions between chunked and unchunked:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (reduce (constantly nil)))
;; throws StackOverflowError
{code}

Such a sequence is well behaved under most other sequence operations, and its underlying structure can even be masked such that {{reduce}} succeeds:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (take 10000000)
     (reduce (constantly nil)))
;; => nil
{code}

I don't think Clojure developers normally worry about mixing chunked and unchunked seqs, so the existence of such a sequence is not at all unreasonable (and indeed this happened to me at work and was very difficult to debug).

It seems obvious what causes this given the implementation of reduce -- it bounces back and forth between the chunked impl and the unchunked impl, consuming more and more stack as it goes. Without proper tail call optimization, it's not obvious to me what a good fix would be.

h2. Presumed bad solutions

h3. Degrade to naive impl after first chunk

In the {{IChunkedSeq}} implementation, instead of calling {{internal-reduce}} when the
sequence stops being chunked, it could have an (inlined?) unoptimized implementation,
ensuring that no further stack space is taken up. This retains the behavior that a
generic seq with a chunked tail will still run in an optimized fashion, but a seq with
two chunked portions would only be optimized the first time.

h3. Use clojure.core/trampoline

This would presumably work, but requires wrapping the normal return values from all
implementations of {{internal-reduce}}.

h2. Proposed Solution

Similar to using {{trampoline}}, but create a special type ({{Unreduced}}) that signals
an implementation change. The two implementation-change points in {{internal-reduce}}
(in the {{IChunkedSeq}} impl and the {{Object}} impl) are converted to return an instance
of {{Unreduced}} instead of a direct call to {{internal-reduce}}.

Then {{seq-reduce}} is converted to check for instances of {{Unreduced}} before returning,
and recurs if it finds one.

h3. Pros

- Only requires one additional check in most cases
- Reduces stack usage for existing heterogeneous reductions that weren't extreme enough to crash
- Should be compatible with 3rd-party implementations of {{internal-reduce}}, which can still use the old style (direct recursive calls to {{internal-reduce}}) or the optimized style if desired.

h3. Cons

- {{internal-reduce}} is slightly more complicated
- There's an extra check at the end of {{seq-reduce}} -- is that a performance worry?
Gary Fredericks made changes -
Attachment CLJ-1237c.patch [ 12204 ]
Gary Fredericks made changes -
Description {{reduce}} gives a {{StackOverflowError}} on long sequences that contain many transitions between chunked and unchunked:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (reduce (constantly nil)))
;; throws StackOverflowError
{code}

Such a sequence is well behaved under most other sequence operations, and its underlying structure can even be masked such that {{reduce}} succeeds:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (take 10000000)
     (reduce (constantly nil)))
;; => nil
{code}

I don't think Clojure developers normally worry about mixing chunked and unchunked seqs, so the existence of such a sequence is not at all unreasonable (and indeed this happened to me at work and was very difficult to debug).

It seems obvious what causes this given the implementation of reduce -- it bounces back and forth between the chunked impl and the unchunked impl, consuming more and more stack as it goes. Without proper tail call optimization, it's not obvious to me what a good fix would be.

h2. Presumed bad solutions

h3. Degrade to naive impl after first chunk

In the {{IChunkedSeq}} implementation, instead of calling {{internal-reduce}} when the
sequence stops being chunked, it could have an (inlined?) unoptimized implementation,
ensuring that no further stack space is taken up. This retains the behavior that a
generic seq with a chunked tail will still run in an optimized fashion, but a seq with
two chunked portions would only be optimized the first time.

h3. Use clojure.core/trampoline

This would presumably work, but requires wrapping the normal return values from all
implementations of {{internal-reduce}}.

h2. Proposed Solution

Similar to using {{trampoline}}, but create a special type ({{Unreduced}}) that signals
an implementation change. The two implementation-change points in {{internal-reduce}}
(in the {{IChunkedSeq}} impl and the {{Object}} impl) are converted to return an instance
of {{Unreduced}} instead of a direct call to {{internal-reduce}}.

Then {{seq-reduce}} is converted to check for instances of {{Unreduced}} before returning,
and recurs if it finds one.

h3. Pros

- Only requires one additional check in most cases
- Reduces stack usage for existing heterogeneous reductions that weren't extreme enough to crash
- Should be compatible with 3rd-party implementations of {{internal-reduce}}, which can still use the old style (direct recursive calls to {{internal-reduce}}) or the optimized style if desired.

h3. Cons

- {{internal-reduce}} is slightly more complicated
- There's an extra check at the end of {{seq-reduce}} -- is that a performance worry?
{{reduce}} gives a {{StackOverflowError}} on long sequences that contain many transitions between chunked and unchunked:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (reduce (constantly nil)))
;; throws StackOverflowError
{code}

Such a sequence is well behaved under most other sequence operations, and its underlying structure can even be masked such that {{reduce}} succeeds:

{code}
(->> (repeat 50000 (cons :x [:y]))
     (apply concat)
     (take 10000000)
     (reduce (constantly nil)))
;; => nil
{code}

I don't think Clojure developers normally worry about mixing chunked and unchunked seqs, so the existence of such a sequence is not at all unreasonable (and indeed this happened to me at work and was very difficult to debug).

It seems obvious what causes this given the implementation of reduce -- it bounces back and forth between the chunked impl and the unchunked impl, consuming more and more stack as it goes. Without proper tail call optimization, it's not obvious to me what a good fix would be.

h2. Presumed bad solutions

h3. Degrade to naive impl after first chunk

In the {{IChunkedSeq}} implementation, instead of calling {{internal-reduce}} when the
sequence stops being chunked, it could have an (inlined?) unoptimized implementation,
ensuring that no further stack space is taken up. This retains the behavior that a
generic seq with a chunked tail will still run in an optimized fashion, but a seq with
two chunked portions would only be optimized the first time.

h3. Use clojure.core/trampoline

This would presumably work, but requires wrapping the normal return values from all
implementations of {{internal-reduce}}.

h2. Proposed Solution

(attached as CLJ-1237c.patch)

Similar to using {{trampoline}}, but create a special type ({{Unreduced}}) that signals
an implementation change. The two implementation-change points in {{internal-reduce}}
(in the {{IChunkedSeq}} impl and the {{Object}} impl) are converted to return an instance
of {{Unreduced}} instead of a direct call to {{internal-reduce}}.

Then {{seq-reduce}} is converted to check for instances of {{Unreduced}} before returning,
and recurs if it finds one.

h3. Pros

- Only requires one additional check in most cases
- Reduces stack usage for existing heterogeneous reductions that weren't extreme enough to crash
- Should be compatible with 3rd-party implementations of {{internal-reduce}}, which can still use the old style (direct recursive calls to {{internal-reduce}}) or the optimized style if desired.

h3. Cons

- {{internal-reduce}} is slightly more complicated
- There's an extra check at the end of {{seq-reduce}} -- is that a performance worry?
Alex Miller made changes -
Patch Code and Test [ 10002 ]

People

Vote (1)
Watch (4)

Dates

  • Created:
    Updated: