<< Back to previous view

[DGEN-5] `data.generators/reservoir-sample` is biased against early members of the collection Created: 01/Dec/16  Updated: 07/Dec/16

Status: Open
Project: data.generators
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Thomas Athorne Assignee: Stuart Halloway
Resolution: Unresolved Votes: 0
Labels: bug, patch

Attachments: Text File reservoir-sample.patch    
Patch: Code

 Description   

An off-by-one error in the function `reservoir-sample` means that it can never return the first `ct` members of the collection. This makes it noticeably biased against members at the start of the collection, especially for small values of `ct`.



 Comments   
Comment by Alex Miller [ 06/Dec/16 10:16 AM ]

This example (which returns 7) seems in contradiction to the statement in the description:

user=> (g/reservoir-sample 10 (range 100))
[24 59 2 93 81 90 70 7 22 49]

The patch doesn't really make sense to me either in tandem with the description.

Comment by Thomas Athorne [ 07/Dec/16 6:40 AM ]

Hi Alex,

Sorry, my description was a little terse and 'never return the first
`ct` members of the collection' is misleading---I meant it can never
return exactly those members (ie. your example would never return
`[0 1 2 3 4 5 6 7 8 9]`, though it should have the possibility---very
unlikely of course---of doing that).

The bias I mentioned is statistical so to observe it properly you
should run the function many times---here is the example I've been
using to test this:

(sort-by first (frequencies (into [] cat (repeatedly 10000 #(g/reservoir-sample 5 (range 10))))))
=> ([0 4534] [1 4388] [2 4371] [3 4417] [4 4426] [5 5607] [6 5609] [7 5600] [8 5436] [9 5612])

As you can see, the first half of the vector is under-represented and
the second half is over-represented. A more extreme example is the
fact that the expression

(g/reservoir-sample 1 (range 2))

will always return `[1]` and never `[0]`.

The patch has the effect of slightly lowering the probability of
swapping out one of the result vector at each iteration of the loop.
The new version is correct in that the probability of each element
being sampled is the same. (You can see this with a simple proof by
induction, I will post that here if you wish.)





[DGEN-4] Document that floating point numbers aren't supported Created: 16/Feb/16  Updated: 16/Feb/16

Status: Open
Project: data.generators
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Saul Shanabrook Assignee: Stuart Halloway
Resolution: Unresolved Votes: 0
Labels: docstring, documentation, interop


 Description   

I noticed that when using `weighted` doesn't work when the weights are not integers. I think this is because `uniform` only returns integer values. For example `(clojure.data.generators/uniform 0.0 1.1)` always seems to return `0`.

It doesn't mention this explicitly in the documentation, so I am not sure if this is intended behavior.






[DGEN-3] Gauss and triangular distributions Created: 27/Jan/16  Updated: 27/Jan/16

Status: Open
Project: data.generators
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Martin Clausen Assignee: Stuart Halloway
Resolution: Unresolved Votes: 0
Labels: None


 Description   

Would a patch implementing a Gauss and/or triangular distribution be welcome?






Generated at Thu Jan 19 21:58:08 CST 2017 using JIRA 4.4#649-r158309.