[NREPL-12] Intermittent test failures for stdin Created: 26/Feb/12 Updated: 10/Jun/12 Resolved: 10/Jun/12
|Reporter:||Colin Jones||Assignee:||Chas Emerick|
Failures like these are happening sometimes in the test suite, mainly around stdin:
FAIL in (request-multiple-read-objects-in) (nrepl_test.clj:211)
FAIL in (request-multiple-read-objects-in) (nrepl_test.clj:217)
I've done a ton of debugging and thinking about this, and haven't gotten far. A few theories I had that seemed decent at the time, but ultimately haven't gotten rid of these occasional failures:
The debugging I've been doing (print statements galore) suggests that when this problem happens, the client does send back the proper :stdin, but it doesn't get picked up by the add-stdin handler in time to write to the :stdin-writer and avoid the expiring timeout.
I'm pretty much stumped at this point, unfortunately.
|Comment by Chas Emerick [ 29/Mar/12 4:18 PM ]|
This is becoming a serious problem; the build fails half the time on build.clojure.org, making it incredibly frustrating to perform releases, etc.
Any new ideas / evaluation of this is most welcome.
|Comment by Colin Jones [ 15/Apr/12 10:24 PM ]|
My current vague suspicion is that it's a CPU contention thing between the server and client.
It's kind of hard for me to reason about the execution of the response-values lazy seq, and where the blocking actually happens. And of course adding logging changes the behavior sometimes. I wouldn't be opposed to adding sleeps in there, but I expect those kinds of things would still be flaky later, on build machines in particular :/
Incidentally, bumping the client timeout up to Long.MAX_VALUE helps a bit the stdin failures, but doesn't solve them. It does create new ones: session-lifecycle, unknown-op, and read-timeout block forever trying to reduce across the whole seq in combine-responses. It was a bit surprising to me to see that reduce in the response-values pipeline, and I pursued that at one point as well, but it hasn't gotten me anywhere. Decreasing the timeout helps to reproduce the failures.
I got this thread dump from a blocking-forever stdin test (run via `lein test clojure.tools.nrepl-test`): https://refheap.com/paste/2141.
|Comment by Colin Jones [ 03/Jun/12 5:07 AM ]|
With this patch, I was able to run clojure.tools.nrepl-test 100 times successfully. Previous attempts would generally only get as far as 15 or so before failing for one reason or another.
It also gets rid of a race condition with my skip-if-eol stuff that was addressing multiple input reads. Lastly, it avoids the need for the 10-second timeout, so the tests are faster now as well.
|Comment by Chas Emerick [ 09/Jun/12 10:57 AM ]|
I'm psyched that you may have found the solution to these problems. I've applied the patch locally, and it tests well. I want to eyeball it for a minute (probably tomorrow) before moving on.
|Comment by Chas Emerick [ 10/Jun/12 4:40 PM ]|
Pushed, will be in the next beta release. Thanks so much!