tools.nrepl

Intermittent test failures for stdin

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Critical Critical
  • Resolution: Completed
  • Affects Version/s: None
  • Fix Version/s: 0.2.0
  • Component/s: None
  • Labels:
    None

Description

Failures like these are happening sometimes in the test suite, mainly around stdin:

FAIL in (request-multiple-read-objects-in) (nrepl_test.clj:211)
expected: (= (quote (:ohai)) (response-values (for [resp (repl-eval session "(read)")] (do (when (-> resp :status set (contains? "need-input")) (session {:op :stdin, :stdin ":ohai :kthxbai\n"})) resp))))
actual: (not (= (:ohai) nil))

FAIL in (request-multiple-read-objects-in) (nrepl_test.clj:217)
expected: (= [" :kthxbai"] (repl-values session "(read-line)"))
actual: (not (= [" :kthxbai"] nil))

I've done a ton of debugging and thinking about this, and haven't gotten far. A few theories I had that seemed decent at the time, but ultimately haven't gotten rid of these occasional failures:

The debugging I've been doing (print statements galore) suggests that when this problem happens, the client does send back the proper :stdin, but it doesn't get picked up by the add-stdin handler in time to write to the :stdin-writer and avoid the expiring timeout.

I'm pretty much stumped at this point, unfortunately.

Activity

Hide
Chas Emerick added a comment -

This is becoming a serious problem; the build fails half the time on build.clojure.org, making it incredibly frustrating to perform releases, etc.

Any new ideas / evaluation of this is most welcome.

Show
Chas Emerick added a comment - This is becoming a serious problem; the build fails half the time on build.clojure.org, making it incredibly frustrating to perform releases, etc. Any new ideas / evaluation of this is most welcome.
Hide
Colin Jones added a comment -

My current vague suspicion is that it's a CPU contention thing between the server and client.

It's kind of hard for me to reason about the execution of the response-values lazy seq, and where the blocking actually happens. And of course adding logging changes the behavior sometimes. I wouldn't be opposed to adding sleeps in there, but I expect those kinds of things would still be flaky later, on build machines in particular :/

Incidentally, bumping the client timeout up to Long.MAX_VALUE helps a bit the stdin failures, but doesn't solve them. It does create new ones: session-lifecycle, unknown-op, and read-timeout block forever trying to reduce across the whole seq in combine-responses. It was a bit surprising to me to see that reduce in the response-values pipeline, and I pursued that at one point as well, but it hasn't gotten me anywhere. Decreasing the timeout helps to reproduce the failures.

I got this thread dump from a blocking-forever stdin test (run via `lein test clojure.tools.nrepl-test`): https://refheap.com/paste/2141.

Show
Colin Jones added a comment - My current vague suspicion is that it's a CPU contention thing between the server and client. It's kind of hard for me to reason about the execution of the response-values lazy seq, and where the blocking actually happens. And of course adding logging changes the behavior sometimes. I wouldn't be opposed to adding sleeps in there, but I expect those kinds of things would still be flaky later, on build machines in particular :/ Incidentally, bumping the client timeout up to Long.MAX_VALUE helps a bit the stdin failures, but doesn't solve them. It does create new ones: session-lifecycle, unknown-op, and read-timeout block forever trying to reduce across the whole seq in combine-responses. It was a bit surprising to me to see that reduce in the response-values pipeline, and I pursued that at one point as well, but it hasn't gotten me anywhere. Decreasing the timeout helps to reproduce the failures. I got this thread dump from a blocking-forever stdin test (run via `lein test clojure.tools.nrepl-test`): https://refheap.com/paste/2141.
Hide
Colin Jones added a comment -

With this patch, I was able to run clojure.tools.nrepl-test 100 times successfully. Previous attempts would generally only get as far as 15 or so before failing for one reason or another.

It also gets rid of a race condition with my skip-if-eol stuff that was addressing multiple input reads. Lastly, it avoids the need for the 10-second timeout, so the tests are faster now as well.

Show
Colin Jones added a comment - With this patch, I was able to run clojure.tools.nrepl-test 100 times successfully. Previous attempts would generally only get as far as 15 or so before failing for one reason or another. It also gets rid of a race condition with my skip-if-eol stuff that was addressing multiple input reads. Lastly, it avoids the need for the 10-second timeout, so the tests are faster now as well.
Hide
Chas Emerick added a comment -

I'm psyched that you may have found the solution to these problems. I've applied the patch locally, and it tests well. I want to eyeball it for a minute (probably tomorrow) before moving on.

Show
Chas Emerick added a comment - I'm psyched that you may have found the solution to these problems. I've applied the patch locally, and it tests well. I want to eyeball it for a minute (probably tomorrow) before moving on.
Hide
Chas Emerick added a comment -

Pushed, will be in the next beta release. Thanks so much!

Show
Chas Emerick added a comment - Pushed, will be in the next beta release. Thanks so much!

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: