<< Back to previous view

[NREPL-12] Intermittent test failures for stdin Created: 26/Feb/12  Updated: 10/Jun/12  Resolved: 10/Jun/12

Status: Closed
Project: tools.nrepl
Component/s: None
Affects Version/s: None
Fix Version/s: 0.2.0

Type: Defect Priority: Critical
Reporter: Colin Jones Assignee: Chas Emerick
Resolution: Completed Votes: 0
Labels: None

Attachments: Text File 0001-Improve-timing-issues-in-tests.patch    

 Description   

Failures like these are happening sometimes in the test suite, mainly around stdin:

FAIL in (request-multiple-read-objects-in) (nrepl_test.clj:211)
expected: (= (quote (:ohai)) (response-values (for [resp (repl-eval session "(read)")] (do (when (-> resp :status set (contains? "need-input")) (session {:op :stdin, :stdin ":ohai :kthxbai\n"})) resp))))
actual: (not (= (:ohai) nil))

FAIL in (request-multiple-read-objects-in) (nrepl_test.clj:217)
expected: (= [" :kthxbai"] (repl-values session "(read-line)"))
actual: (not (= [" :kthxbai"] nil))

I've done a ton of debugging and thinking about this, and haven't gotten far. A few theories I had that seemed decent at the time, but ultimately haven't gotten rid of these occasional failures:

The debugging I've been doing (print statements galore) suggests that when this problem happens, the client does send back the proper :stdin, but it doesn't get picked up by the add-stdin handler in time to write to the :stdin-writer and avoid the expiring timeout.

I'm pretty much stumped at this point, unfortunately.



 Comments   
Comment by Chas Emerick [ 29/Mar/12 4:18 PM ]

This is becoming a serious problem; the build fails half the time on build.clojure.org, making it incredibly frustrating to perform releases, etc.

Any new ideas / evaluation of this is most welcome.

Comment by Colin Jones [ 15/Apr/12 10:24 PM ]

My current vague suspicion is that it's a CPU contention thing between the server and client.

It's kind of hard for me to reason about the execution of the response-values lazy seq, and where the blocking actually happens. And of course adding logging changes the behavior sometimes. I wouldn't be opposed to adding sleeps in there, but I expect those kinds of things would still be flaky later, on build machines in particular :/

Incidentally, bumping the client timeout up to Long.MAX_VALUE helps a bit the stdin failures, but doesn't solve them. It does create new ones: session-lifecycle, unknown-op, and read-timeout block forever trying to reduce across the whole seq in combine-responses. It was a bit surprising to me to see that reduce in the response-values pipeline, and I pursued that at one point as well, but it hasn't gotten me anywhere. Decreasing the timeout helps to reproduce the failures.

I got this thread dump from a blocking-forever stdin test (run via `lein test clojure.tools.nrepl-test`): https://refheap.com/paste/2141.

Comment by Colin Jones [ 03/Jun/12 5:07 AM ]

With this patch, I was able to run clojure.tools.nrepl-test 100 times successfully. Previous attempts would generally only get as far as 15 or so before failing for one reason or another.

It also gets rid of a race condition with my skip-if-eol stuff that was addressing multiple input reads. Lastly, it avoids the need for the 10-second timeout, so the tests are faster now as well.

Comment by Chas Emerick [ 09/Jun/12 10:57 AM ]

I'm psyched that you may have found the solution to these problems. I've applied the patch locally, and it tests well. I want to eyeball it for a minute (probably tomorrow) before moving on.

Comment by Chas Emerick [ 10/Jun/12 4:40 PM ]

Pushed, will be in the next beta release. Thanks so much!

Generated at Wed Apr 23 08:25:21 CDT 2014 using JIRA 4.4#649-r158309.