# Better hashing

## Key

• This line was removed.
• Formatting was changed.

...

Problem

Using Clojure 1.6.0-alpha3 hash (same as Clojure 1.5.1)

Clojure 1.6.0 just after this commit on Jan 29 2014

Clojure 1.6.0 just after this commit on Jan 30 2014Using Mark Engelberg's 2013-11-18 proposed hashUsing Murmur3 hash

Paul Butcher's N-queens problem with 6x6 board

180,568 solutions hash to N distinct hash values

average of A solutions per hash value (max M)

6.7 min (~33 times slower than with 2013-11-18 proposed hash)

N=301

A=599.9 (M=2,492)

12.8 sec

N=180,566

A=1.0 (M=2)

12.5 sec

N=180,561

A=1.0 (M=2)

12.2 sec

N=180,566

A=1.0 (M=2)

11.8 sec

N=180,567

A=1.0 (M=2)

with 6x9 board

20,136,752 solutions hash to N distinct hash values

average of A solutions per hash value (max M)

> 8 hours (did not wait for it to finish)

N=17,936

A=1,122.7 (M=81,610)

12.8 min

N=20,089,842

A=1.0 (M=3)

11.9 min

N=20,089,436

A=1.0 (M=3)

11.4 min

N=20,089,766

A=1.0 (M=4)

12.0 min

N=20,089,390

A=1.00 (M=3)

Hash all ints in range -1,000,000 to 1,000,000 inclusive

2,000,001 values hash to N distinct hash values

average of A integers per hash value (max M)

N=1,000,001

A=2.0 (M=2)

N=1,999,801

A=1.0 (M=2)

N=1,999,801

A=1.0 (M=2)

Hash all 2-element vectors with first element in range [0,999] and second element in same range

1,000,000 values hash to N distinct hash values

average of A values per hash value (max M)

N=31,969

A=31.3 (M=33)

N=999,846

A=1.0 (M=2)

N=999,900

A=1.0 (M=2)

Hash all 2-element vectors with first element in range [-500,499] and second element in same range

1,000,000 values hash to N distinct hash values

average of A values per hash value (max M)

N=15,969

A=62.6 (M=68)

N=999,860

A=1.0 (M=2)

N=999,883

A=1.0 (M=2)

All sets that are subsets of #{0 1 2 3 ... 19}

1,048,576 sets hash to N distinct hash values

average of A values per set (max M)

N=191

A=5,489.9 (M=16,440)

N=1,048,451

A=1.0 (M=2)

N=1,048,208

A=1.0 (M=2)

All maps {i j} where i is in the range [0,999] and j is in the same range

1,000,000 maps hash to N distinct hash values

average of A values per map (max M)

N=1,024

A=976.6 (M=1,000)

N=999,728

A=1.0 (M=2)

N=999,900

A=1.0 (M=2)

Compile Clojure 1.6.0-alpha3 source with "time ant jar", so no tests run

avg 20.1 sec

meas: 19.6, 19.8, 19.8, 20.2, 20.9

avg 20.6 sec (2.5% longer)

meas: 19.8, 20.4, 20.7, 20.8, 21.3

avg 20.6 sec (2.5% longer)

meas: 20.3, 20.5, 20.5, 20.8, 20.9

avg 20.0 sec

meas: 19.6, 19.7, 19.9, 20.2, 20.7

avg 21.4 sec (6.5% longer)

meas: 21.3, 21.4, 21.4, 21.4 21.6

Compile Clojure 1.6.0-alpha3 source with "time ant", which includes running tests, but with generative test duration reduced to 1.0 sec

avg 48.0 sec

meas: 47.3, 47.4, 47.6, 48.3, 49.5

120,353 unique values hash to 113,405 distinct hash values

average of 1.06 values per hash value

avg 47.7 sec

meas: 47.2, 47.4, 47.6, 47.9, 48.2

avg 47.3 sec

meas: 46.8, 47.2, 47.4, 47.5, 47.8

avg 48.0 sec

meas: 47.1, 47.6, 47.7, 48.2, 49.5

119,811 unique values hash to 114,329 distinct hash values

average of 1.05 values per hash value

avg 48.7 sec (1.5% longer)

meas: 47.8, 48.5, 48.9 49.1 49.3

Calc hashes of all integers in (range 1000000000) and return sum, using:

(time (hash-range-n 1000000000))

See here for definition of hash-range-n

avg 29.8 sec

meas: 29.6, 29.6, 29.9, 30.0, 30.1

avg 47.9 sec (60.7% longer)

meas: 47.9, 47.9, 47.9, 47.9, 48.0

avg 44.3 sec (48.7% longer)

meas: 43.2, 43.3, 45.0, 45.0, 45.0

avg 38.7 sec (30% longer)

meas: 38.6, 38.6, 38.7, 38.7, 38.7

Verified that hash values of first 500 million integers (those in (range 500000000)) are all different.

avg 30.2 sec (1.3% longer)

meas: 29.9, 30.2, 30.2, 30.2, 30.2

Calc hashes of 30,001 vectors [] [0] [0 1], etc. up to [0 1 ... 29999] and return sum, using

(let [cs (doall (reductions conj [] (range 30000)))]

(time (total-hash cs)))

See here for definition of total-hash

avg 10.7 sec

meas: 10.6, 10.6, 10.7, 10.7, 10.7

30,000 unique hash values, avg 1.00, max 2

avg 16.0 sec (49.5% longer)

meas: 15.9, 16.0, 16.1, 16.1, 16.1

All 30,001 hash values unique

avg 14.3 sec (33.6% longer)

meas: 14.1, 14.3, 14.3, 14.3, 14.3

All 30,001 hash values unique

avg 10.8 sec (pretty much same)

meas: 10.5, 10.7, 10.7, 10.7, 11.6

30,000 unique hash values, avg 1.00, max 2

avg 11.4 sec (6.5% longer)

meas: 11.3, 11.3, 11.4, 11.4, 11.4

All 30,001 hash values unique

Calc hashes of 30,001 sets #{} #{0} #{0 1}, etc. up to #{0 1 ... 29999} and return sum, using

(let [cs (doall (reductions conj #{} (range 30000)))]

(time (total-hash cs)))

avg 70.2 sec

meas: 66.4, 69.1, 69.4, 71.3, 74.8

30,000 unique hash values, avg 1.00, max 2

avg 86.3 sec (22.9% longer)

meas: 83.3, 84.1, 86.3, 88.2, 89.5

All 30,001 hash values unique

avg 81.8 sec (16.5% longer)

meas: 78.8, 79.1, 80.1, 81.2, 89.6

All 30,001 hash values unique

avg 71.4 sec (1.7% longer)

meas: 70.9, 71.0, 71.2, 72.0, 72.1

29,308 unique hash values, avg 1.02, max 2

avg 71.6 sec (2.0% longer)

meas: 67.9, 72.0, 72.5, 72.7, 73.1

All 30,001 hash values unique

Calc hashes of 30,001 maps {} {0 1} {0 1, 1 2} ... up to {0 1, 1 2, ..., 29999 30000} and return sum, using

(let [cs (doall (reductions (fn [m i] (assoc m i (inc i))) {} (range 30000)))]

(time (total-hash cs)))

avg 71.7 sec

meas: 68.4, 69.0, 69.1, 75.7, 76.3

All 30,001 hash values unique

avg 163.4 (127.9% longer)

meas: 159.2, 159.2, 164.3, 166.5, 167.5

All 30,001 hash values unique

avg 124.6 sec (73.8% longer)

meas: 115.7, 125.5, 126.6, 127.4, 127.8

All 30,001 hash values unique

avg 78.5 sec (9.5% longer)

meas: 74.1, 75.0, 77.0, 81.2, 85.4

All 30,001 hash values unique

avg 89.3 sec (24.5% longer)

meas: 83.9, 87.8, 89.8, 91.2, 93.7

All 30,001 hash values unique

...

### Tracking updates to other libraries for Clojure 1.6 hash changes

Clojure 1.6.0-beta1 is out as of this writing, and the hash method has been decided upon.  Now comes updating Clojure contrib libraries and 3rd party Clojure libraries that implement their own customized collections so that their hash function values for custom sets and maps (at least those intended to be clojure.core/= to built-in Clojure sets, vectors, or maps) have consistent hash values when using Clojure 1.6.

The table below gives some places that could use updating, and status of changes.

Notes: In Clojure 1.5.1 and 1.4.0 (and maybe also 1.3.0?), (.hashCode x) and (hash x) were calculated similarly, and had the same value for many values of x, even many sets, maps, and vectors.  However, they were not always the same, because (.hashCode x) uses .hashCode to hash each set/vector element or map key-value pair, whereas (hash x) uses hash to do so, and .hashCode and hash are intentionally different for Integers, Shorts, Bytes, and some other types.

In Clojure 1.6.0, (.hashCode x) should remain consistent with Java .hashCode for all sets, maps, and vectors, whereas (hash x) will almost always be different than (.hashCode x).  Unit tests should include special cases comparing hash of your custom collection against Clojure's corresponding built-in collection for the empty set, vector, and/or map, because that is no longer the simple value of 0/1 that it was before.  Also test the result of (empty my-custom-non-empty-collection), too.  Consider using the collection-check library.

Project/LibraryWhat needs updating?Status
Tests in Clojure itself

Some tests were disabled when the hash changes were first made, with this commit.

DONE (except one test still disabled): Most of them were updated to be independent of Clojure's hash function with this commit, but the following line at or near line 311 in file test/clojure/test_clojure/java_interop.clj is still commented out:

;;(test-to-passed-array-for hash-set)

core.cacheSome tests failed due to hash-dependent sequence ordersDONE: Patch for ticket CCACHE-33 committed
core.matchSome tests failed due to hash-dependent sequence ordersDONE: Fix for ticket MATCH-94 committed
core.matrixMike Anderson says it should be ready for Clojure 1.6 with no updates needed.DONE: No changes needed.
core.rrb-vectorHash calculation needs updatingDONE: Fix for ticket CRRBV-3 committed.
data.avlHash calculation needs updatingDONE: Fix for ticket DAVL-3 committed in commit 1, 2, and 3.
data.finger-treeHash calculation for finger trees needs updating for 1.6DONE: Fix for ticket DFINGER-2 committed.
immutable-bitsetHash calculation for bitsets needs updating for 1.6DONE: Fixed with this commit.  Note that this data structure is a special case, in that it can only contain Long values, thus in Clojure 1.5.1 and earlier it was guaranteed that (.hashCode immutable-bitset) and (hash immutable-bitset) were equal for all immutable-bitsets.  This is not true in general for other collections, because (.hashCode x) and (hash x) are in general different for Integers, Shorts, Bytes, etc.
instaparseMark Engelberg says this relies on incremental hashing, and will need some rethinking on how to update.Some changes have been made on branch "1.6" in instaparse's Github repository and tested for correctness and performance.  Progress reports here, here.
orderedNamespaces flatland.ordered.map and flatland.ordered.set appear to need updates
potemkinA map implementation needs updatingDONE: Fixed with this commit.
tools.namespaceSome tests failed due to hash-dependent sequence ordersDONE: Ticket TNS-16 created, fix comitted.
tools.nreplSome tests failed due to hash-dependent sequence ordersDONE: Fixed with this commit.
usefulflatland.useful.deftype/defmap probably needs updating.  The namespace flatland.useful.map appears not to need any updates, as it uses Clojure maps but does not customize them at a low level that overrides the hash calculation.