<< Back to previous view

[CLJ-1036] hash is inconsistent with = for some BigInteger and floating point values Created: 02/Aug/12  Updated: 29/Jan/14  Resolved: 05/Sep/13

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.4, Release 1.5
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Paul Stadig Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: None

Attachments: Text File clj-1036-hasheq-for-biginteger-patch-v4.txt    
Patch: Code and Test
Approval: Triaged

 Description   

hash is documented to be consistent with = but Util/hasheq returns different hash values for some pairs of numbers that are =

user> (apply = [-1 -1N (biginteger -1)])
true
user> (map hash [-1 -1N (biginteger -1)])
(0 0 -1)

user> (apply = [(Float. 1e9) (Double. 1e9)])
true
user> (map hash [(Float. 1e9) (Double. 1e9)])
(1315859240 1104006501)

Consequences include incorrect behavior for hash-maps containing keys that are =, but have different hash values:

;; Incorrect return value with multiple keys = to each other
user> (assoc (hash-map -1N :should-be-replaced) (biginteger -1) :new-val)
{-1N :should-be-replaced, -1 :new-val}

;; array-map gives correct value, since it uses =, not hash
user> (assoc (array-map -1N :should-be-replaced) (biginteger -1) :new-val)
{-1N :new-val}

Patch: clj-1036-hasheq-for-biginteger-patch-v4.txt

Approach:

The only BigInteger values that have inconsistent hash values should be those in the range of a long. BigInteger and BigInt values outside the range of a long already both return BigInteger.hashCode().

All integer values will return consistent hash codes if we add a new case to Numbers.hasheq(Number) for BigIntegers that lie in the range of a long, returning the same hash that such a long value does.

For floating point values, the patch makes their hashes consistent by converting floats to doubles and then hashing.

One alternate approach would be to convert all double values to floats and hash float values only. However, this throws away half of the bits of the double value before hashing, leading to many undesirable hash collisions between different double values.



 Comments   
Comment by Paul Stadig [ 02/Aug/12 9:55 AM ]

Also, the biginteger function has metadata saying that it has been added since 1.0, but it was actually added in 1.3. The bigint function has metadata saying that it has been added since 1.3, but it has been added since 1.0.

I think during the work to implement BigInt someone renamed the existing bigint function (which used to return a BigInteger) to biginteger, and the metadata got carried with it, then a new bigint function was added with :since 1.3 metadata even though that function name has existed since 1.0.

Comment by Andy Fingerhut [ 26/Sep/12 11:59 AM ]

clj-1036-hasheq-for-biginteger-patch-v1.txt dated Sep 26 2012 makes BigInteger's return equal hash values for values considered equal by =.

It does the same for Float and Double types, which before returned different hash values for values considered equal by =

I went ahead and changed the :added metadata on bigint and biginteger, although I can see that without my change, the person who did that may have meant for the :added to go with the behavior of the function, not with the name. Paul's suggested change that I have in the patch is for the :added metadata to go with the name, not the function behavior. It is easy to remove that part of the patch if that change is not desired.

Comment by Rich Hickey [ 13/Nov/12 3:29 PM ]

You can't just consider only the lower long of bigints. Also, what's the rationale for the float stuff?

Comment by Andy Fingerhut [ 13/Nov/12 9:44 PM ]

clj-1036-hasheq-for-biginteger-patch-v2.txt dated Nov 13 2012 is identical to clj-1036-hasheq-for-biginteger-patch-v1.txt except that it addresses Rich's comment that for BigInt's and BigInteger values that don't fit in a long, their entire value must be hashed.

The rationale for the changes to hasheq for Float and Double types is the same as the rationale for the change for BigInteger: without that change, Float and Double types that are = can have different hasheq values.

Comment by Paul Stadig [ 14/Nov/12 5:18 AM ]

Although you are correct that Double and Float are =, but have different hashes:

user=> (apply = [(Double. -1.0) (Float. -1.0)])
true
user=> (map hash [(Double. -1.0) (Float. -1.0)])
(-1074790400 -1082130432)

I could not get the same errant behavior out of PHM:

user=> (assoc clojure.lang.PersistentHashMap/EMPTY (Float. -1.0) :oops! (Double. -1.0) :one)
{-1.0 :one}

I haven't taken the time to investigate exactly what is happening here, but either way I think this ticket is very specifically about BigInteger and the Float/Double issue could be explored in another ticket.

Comment by Andy Fingerhut [ 14/Nov/12 10:08 AM ]

I can open another ticket for the Float/Double issue if that is what people would prefer.

I think what is happening in the test case you give, Paul, is that the hash values for (Float. -1.0) and (Double. -1.0) happen to be the same in their least significant 20 bits, and PHM isn't using the upper bits where the hash values differ.

Clojure 1.5.0-beta without patch:
user=> (map #(format "%x" %) (map hash [(Float. -1.0) (Double. -1.0)]))
("bf800000" "bff00000")

There are other Float/Double values where this lucky accident doesn't happen, e.g.

Clojure 1.5.0-beta1 without patch:

user=> (= (Float. 1e9) (Double. 1e9))
true
user=> (map hash [(Float. 1e9) (Double. 1e9)])
(1315859240 1104006501)
user=> (assoc clojure.lang.PersistentHashMap/EMPTY (Float. 1e9) :oops! (Double. 1e9) :one)

{1.0E9 :one, 1.0E9 :oops!}

With 1.5.0-beta1 plus patch clj-1036-hasheq-for-biginteger-patch-v2.txt:

user=> (= (Float. 1e9) (Double. 1e9))
true
user=> (map hash [(Float. 1e9) (Double. 1e9)])
(1315859240 1315859240)
user=> (assoc clojure.lang.PersistentHashMap/EMPTY (Float. 1e9) :oops! (Double. 1e9) :one)

{1.0E9 :one}
Comment by Andy Fingerhut [ 01/Jan/13 11:30 AM ]

Presumptuously changing status from Not Approved to Vetted, since patch clj-1036-hasheq-for-biginteger-patch-v2.txt should address the reasons that Rich marked the previous patch as Not Approved. Changing it to Vetted on the assumption that if Stuart Halloway marked the previous patch as Screened, the ticket itself is good enough to be Vetted.

Comment by Rich Hickey [ 12/Apr/13 8:48 AM ]

Patches and tickets need to be better than this. Talks about BigInteger, changes hash for doubles. Lists problem but not approach, need to trawl through comments and code to see what's going on, etc.

Comment by Andy Fingerhut [ 07/Aug/13 6:49 PM ]

Updated summary to mention floating point types, and completely revamped description so that it should be self-contained, without having to read any comments.

Comment by Andy Fingerhut [ 08/Aug/13 3:51 PM ]

Add new patch clj-1036-hasheq-for-biginteger-patch-v4.txt dated Aug 8 2013 that handles floating point hashes better. Description will be updated to match.

Comment by Rich Hickey [ 05/Sep/13 8:27 AM ]

It is out of scope for Clojure to fix this for Java types Float/Double/BigInteger

Comment by Andy Fingerhut [ 29/Jan/14 3:05 PM ]

It seems to have become in scope to fix this for Java type BigInteger with this commit: https://github.com/clojure/clojure/commit/96e72517615cd2ccdb4fdbbeb6ffba5ad99dbdac

Generated at Fri Nov 28 07:35:24 CST 2014 using JIRA 4.4#649-r158309.