data.avl

Implement NavigableMap interface or similar

Details

  • Type: Enhancement Enhancement
  • Status: Resolved Resolved
  • Priority: Minor Minor
  • Resolution: Completed
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

This allows one to find the nearest matching key if no exact match is found.

I'm implementing a decaying history collection.
I started out with Clojures sorted-map, mapping from revision to value.
However, it turns out that I need a fast way to delete the nth element.
So your AVL tree saves the day.

Except, it would be really nice that if someone looked up an old revision that has been decayed, I could serve them the nearest one instead.

I think it's pretty easy to implement at least that part of it. Just the terminal case of the lookup fn needs to be adjusted. The submap part might be more involved.

  1. 0001-DAVL-1-first-cut-at-clojure.data.avl-nearest.patch
    12/Jan/14 1:13 PM
    3 kB
    Michał Marczyk
  2. 0001-NavigableMap-impl.patch
    12/Jan/14 10:37 AM
    4 kB
    Pepijn de Vos
  3. 0002-DAVL-1-tests-for-nearest.patch
    12/Jan/14 1:36 PM
    1 kB
    Michał Marczyk
  4. 0002-NavigableMap-tests.patch
    12/Jan/14 11:19 AM
    2 kB
    Pepijn de Vos
  5. 0003-test-reverse-sorted-set.patch
    12/Jan/14 1:12 PM
    1 kB
    Pepijn de Vos

Activity

Hide
Michał Marczyk added a comment - - edited

Hey Pepijn, thanks for the ticket!

Actually I was already thinking about possibly providing something like a "rank of nearest neighbour" function; getting key at nearest neighbour would be equally straightforward.

I think you can already do this with subseq / rsubseq:

(defn rank-of-<= [coll x]
  (if (empty? coll)
    -1
    (let [r (avl/rank-of coll x)]
      (if (== -1 r)
        (let [s (rsubseq coll <= x)]
          (avl/rank-of coll (first s)))
        r))))

(mapv #(rank-of-<= (avl/sorted-set 0 2 4) %) [-1 0 1 2 3 4 5])
;= [-1 0 0 1 1 2 2]

But a direct implementation would no doubt be faster. Any ideas as to how a Clojure API for that might look? (I was thinking about rank-of-<= etc., but perhaps we could do better.)

Implementing NavigableMap incompletely – without the submap methods – is a possibility. With submaps is a different story, unless of course the submap methods were to be O(n). I guess this might actually be acceptable, with appropriate documentation? View maps are a possibility as well, but I don't think they'd be as straightforward to implement, since I'd really want them to be first class (in particular, to accept assoc outside of the original range).

Also see the announcement thread for 0.0.10 on the mailing list, where I mentioned this thread and described rank-of-<=:

https://groups.google.com/d/msg/clojure/8T-Dorhq6xQ/HwvRi-1noecJ

Show
Michał Marczyk added a comment - - edited Hey Pepijn, thanks for the ticket! Actually I was already thinking about possibly providing something like a "rank of nearest neighbour" function; getting key at nearest neighbour would be equally straightforward. I think you can already do this with subseq / rsubseq:
(defn rank-of-<= [coll x]
  (if (empty? coll)
    -1
    (let [r (avl/rank-of coll x)]
      (if (== -1 r)
        (let [s (rsubseq coll <= x)]
          (avl/rank-of coll (first s)))
        r))))

(mapv #(rank-of-<= (avl/sorted-set 0 2 4) %) [-1 0 1 2 3 4 5])
;= [-1 0 0 1 1 2 2]
But a direct implementation would no doubt be faster. Any ideas as to how a Clojure API for that might look? (I was thinking about rank-of-<= etc., but perhaps we could do better.) Implementing NavigableMap incompletely – without the submap methods – is a possibility. With submaps is a different story, unless of course the submap methods were to be O(n). I guess this might actually be acceptable, with appropriate documentation? View maps are a possibility as well, but I don't think they'd be as straightforward to implement, since I'd really want them to be first class (in particular, to accept assoc outside of the original range). Also see the announcement thread for 0.0.10 on the mailing list, where I mentioned this thread and described rank-of-<=: https://groups.google.com/d/msg/clojure/8T-Dorhq6xQ/HwvRi-1noecJ
Hide
Pepijn de Vos added a comment -

I spent a lot of time today writing a function that can take a predicate and look up the nearest match. This covers floorEntry(<=), ceilEntry(>=), lowerEntry(<) and higherEntry(>).

The conditional could be cleaner, and might still contain edge cases, but it seems to work so far.

In human language:
The first two statements in the cond check if there is a better node in either branch.
Else return the current node, if it matches the predicate.
Else return nil(backtrack)
The third line sees if an exact key match is valid, and if so retruns it. (> and < dont match)
The remaining two lines check if the child nodes match the predicate and search there.

(defn nearest [comp pred node k]
  (if (nil? node)
    nil
    (let [c (.compare comp k (.getKey node))
          ln (.getLeft node)
          rn (.getRight node)]
      (cond
        (neg? c) (or (nearest comp pred ln k)
                     (when (pred (.getKey node) k) node))
        (pos? c) (or (nearest comp pred rn k)
                     (when (pred (.getKey node) k) node))
        (pred (.getKey node) k) node
        (and ln (pred (.getKey ln) k)) (nearest comp pred ln k)
        (and rn (pred (.getKey rn) k)) (nearest comp pred rn k)))))

My proposed clojure interface would be this very function, but with the comparator and tree taken from the map/set, so you'd write (nearest map < 4) instead of (nearest (.comparator m) < (.getTree m) 4).

Show
Pepijn de Vos added a comment - I spent a lot of time today writing a function that can take a predicate and look up the nearest match. This covers floorEntry(<=), ceilEntry(>=), lowerEntry(<) and higherEntry(>). The conditional could be cleaner, and might still contain edge cases, but it seems to work so far. In human language: The first two statements in the cond check if there is a better node in either branch. Else return the current node, if it matches the predicate. Else return nil(backtrack) The third line sees if an exact key match is valid, and if so retruns it. (> and < dont match) The remaining two lines check if the child nodes match the predicate and search there.
(defn nearest [comp pred node k]
  (if (nil? node)
    nil
    (let [c (.compare comp k (.getKey node))
          ln (.getLeft node)
          rn (.getRight node)]
      (cond
        (neg? c) (or (nearest comp pred ln k)
                     (when (pred (.getKey node) k) node))
        (pos? c) (or (nearest comp pred rn k)
                     (when (pred (.getKey node) k) node))
        (pred (.getKey node) k) node
        (and ln (pred (.getKey ln) k)) (nearest comp pred ln k)
        (and rn (pred (.getKey rn) k)) (nearest comp pred rn k)))))
My proposed clojure interface would be this very function, but with the comparator and tree taken from the map/set, so you'd write (nearest map < 4) instead of (nearest (.comparator m) < (.getTree m) 4).
Hide
Pepijn de Vos added a comment -

If you are okay with the proposal, I can continue implementing the interface and the public function.

I was also thinking about testing. Maybe a generative test would be good, to fuzz all the edge cases.

Show
Pepijn de Vos added a comment - If you are okay with the proposal, I can continue implementing the interface and the public function. I was also thinking about testing. Maybe a generative test would be good, to fuzz all the edge cases.
Hide
Pepijn de Vos added a comment -

initial implementation, no tests yet, but seems to work.

Show
Pepijn de Vos added a comment - initial implementation, no tests yet, but seems to work.
Hide
Pepijn de Vos added a comment -

some tests. life is good.

Show
Pepijn de Vos added a comment - some tests. life is good.
Hide
Pepijn de Vos added a comment -

It does work with (avl/sorted-set-by >) as well

Show
Pepijn de Vos added a comment - It does work with (avl/sorted-set-by >) as well
Hide
Michał Marczyk added a comment -

Hey,

Thanks for pushing on this, and for the actual proposal with tests! I think something along these lines would be a good addition to the API.

I do have a few comments on the approach you've taken with the current patch. First of all, the docstring on nearest promises something different to what is actually delivered: (1) it is not key that needs to satisfy pred, but rather key and the returned item together (this is perhaps slightly nitpicky); (2) in any case, this is not quite true, because one cannot pass in an arbitrary predicate – say, one that is satisfied only by a single item in the collection that lives very far away in the tree; the expectation that the predicate will be something like <, <=, >=, > is built-in. Secondly, with nearest implemented in this way, it is inconvenient to use with collections of arbitrary comparable items, because users are forced to supply ordering predicates which make sense for all items in the given collection.

I think that the first point above is not a problem – it's perfectly reasonable for nearest to accept the four basic ordering predicates exclusively and we should accept this as a design constraint. Given this constraint, I think the natural way to address the second point is the approach of subseq and rsubseq, both of which accept < etc. specifically and treat them in a slightly "magic" fashion.

I have implemented this approach on a branch, see [0] or the patch attached with this comment.

(One thing I didn't mention above is that I decided to skip the public protocol in my impl – for internal data.avl use, an interface is more appropriate, and I'm not sure where a public version of this should live. Will think about this some more.)

[0] https://github.com/clojure/data.avl/tree/1-nearest

Show
Michał Marczyk added a comment - Hey, Thanks for pushing on this, and for the actual proposal with tests! I think something along these lines would be a good addition to the API. I do have a few comments on the approach you've taken with the current patch. First of all, the docstring on nearest promises something different to what is actually delivered: (1) it is not key that needs to satisfy pred, but rather key and the returned item together (this is perhaps slightly nitpicky); (2) in any case, this is not quite true, because one cannot pass in an arbitrary predicate – say, one that is satisfied only by a single item in the collection that lives very far away in the tree; the expectation that the predicate will be something like <, <=, >=, > is built-in. Secondly, with nearest implemented in this way, it is inconvenient to use with collections of arbitrary comparable items, because users are forced to supply ordering predicates which make sense for all items in the given collection. I think that the first point above is not a problem – it's perfectly reasonable for nearest to accept the four basic ordering predicates exclusively and we should accept this as a design constraint. Given this constraint, I think the natural way to address the second point is the approach of subseq and rsubseq, both of which accept < etc. specifically and treat them in a slightly "magic" fashion. I have implemented this approach on a branch, see [0] or the patch attached with this comment. (One thing I didn't mention above is that I decided to skip the public protocol in my impl – for internal data.avl use, an interface is more appropriate, and I'm not sure where a public version of this should live. Will think about this some more.) [0] https://github.com/clojure/data.avl/tree/1-nearest
Hide
Michał Marczyk added a comment -

Tests for nearest using subseq / rsubseq. Also pushed to the previously mentioned branch.

Show
Michał Marczyk added a comment - Tests for nearest using subseq / rsubseq. Also pushed to the previously mentioned branch.
Hide
Michał Marczyk added a comment -

Oops, wasn't testing sorted-set-by, updated last patch.

Show
Michał Marczyk added a comment - Oops, wasn't testing sorted-set-by, updated last patch.
Hide
Pepijn de Vos added a comment - - edited

I agree on the doc string, and don't mind whether a protocol or interface is used. The protocol generates the function, which is nice.

No matter which implementation gets used, you could probably use my tests and java interop, right?

I'm not a big fan of magic predicates.
I tried quite hard not to special-case those predicates.
Though I have to admit I did not for a second think about non-numeric keys.
I just realized > does not work on anything but numbers.
My world has been shattered.

I suppose your magic is a necessary evil.
The alternative would be the Java interface, which is more verbose but less magic.

I'm tempted to suggest algo.generic.comparison, but that's probably madness.

Show
Pepijn de Vos added a comment - - edited I agree on the doc string, and don't mind whether a protocol or interface is used. The protocol generates the function, which is nice. No matter which implementation gets used, you could probably use my tests and java interop, right? I'm not a big fan of magic predicates. I tried quite hard not to special-case those predicates. Though I have to admit I did not for a second think about non-numeric keys. I just realized > does not work on anything but numbers. My world has been shattered. I suppose your magic is a necessary evil. The alternative would be the Java interface, which is more verbose but less magic. I'm tempted to suggest algo.generic.comparison, but that's probably madness.
Hide
Michał Marczyk added a comment -

I'm a believer in using protocols as low-level building blocks:

https://kotka.de/blog/2011/07/Separation_of_concerns.html

I'll absolutely pull in your tests regardless of impl. As for Java interop, I hope so, I certainly see no reason to rewrite anything. The one remaining question concerns submaps – I'm not sure if it's really appropriate to provide an incomplete implementation for this sort of interface, and as for complete implementations, O(n) submaps break Java expectations, I think, since the standard classes use view maps, and view maps are kind of inconvenient with assoc out of bounds. (Java classes throw in this situation, but is it ok for us to do so? Feels a bit unsavoury.)

As a side note, I suppose we could appropriate mk-bound-fn to bring nearest* close to subseq's test handling easily. I'd expect a very slight perf impact, if any.

Haven't done much benchmarking so far beyond this:

(require '[criterium.core :as c])
(require '[clojure.data.avl :as avl])
(def large-set (apply avl/sorted-set (range 0 100000 2)))
(c/bench (avl/nearest large-set < 91390))

(Getting to 91390 in this tree happens to involve a few hops in both directions.)

With the Navigable* patches applied (actually state of repo as of the "test reverse-sorted set" commit), that's 373.690939 ns. With the DAVL-1 patches, 329.592223 ns.

Show
Michał Marczyk added a comment - I'm a believer in using protocols as low-level building blocks: https://kotka.de/blog/2011/07/Separation_of_concerns.html I'll absolutely pull in your tests regardless of impl. As for Java interop, I hope so, I certainly see no reason to rewrite anything. The one remaining question concerns submaps – I'm not sure if it's really appropriate to provide an incomplete implementation for this sort of interface, and as for complete implementations, O(n) submaps break Java expectations, I think, since the standard classes use view maps, and view maps are kind of inconvenient with assoc out of bounds. (Java classes throw in this situation, but is it ok for us to do so? Feels a bit unsavoury.) As a side note, I suppose we could appropriate mk-bound-fn to bring nearest* close to subseq's test handling easily. I'd expect a very slight perf impact, if any. Haven't done much benchmarking so far beyond this:
(require '[criterium.core :as c])
(require '[clojure.data.avl :as avl])
(def large-set (apply avl/sorted-set (range 0 100000 2)))
(c/bench (avl/nearest large-set < 91390))
(Getting to 91390 in this tree happens to involve a few hops in both directions.) With the Navigable* patches applied (actually state of repo as of the "test reverse-sorted set" commit), that's 373.690939 ns. With the DAVL-1 patches, 329.592223 ns.
Hide
Michał Marczyk added a comment -

Ok, I implemented O(log n) submaps/subsets. The new function exposing this functionality is clojure.data.avl/subrange (name subject to change). See the branch mentioned above:

https://github.com/clojure/data.avl/tree/1-nearest

Any comments on the API?

In particular, I'm wondering whether to expose the split operation (map * key -> map * entry * map, meaning submap of all keys less than the given key, the entry at the given key (if any), submap of all keys greater than the given key) and whether subrange is the best name.

In any case, with this in place, we can totally implement NavigableMap with O(log n) complexity for all operations.

Show
Michał Marczyk added a comment - Ok, I implemented O(log n) submaps/subsets. The new function exposing this functionality is clojure.data.avl/subrange (name subject to change). See the branch mentioned above: https://github.com/clojure/data.avl/tree/1-nearest Any comments on the API? In particular, I'm wondering whether to expose the split operation (map * key -> map * entry * map, meaning submap of all keys less than the given key, the entry at the given key (if any), submap of all keys greater than the given key) and whether subrange is the best name. In any case, with this in place, we can totally implement NavigableMap with O(log n) complexity for all operations.
Hide
Michał Marczyk added a comment -

Also added split-at:

(avl/split-at (avl/sorted-set 0 1 2 3 4 5 6 7 8) 4)
;= [#{0 1 2 3} 4 #{5 6 7 8}]

It's a building block in subrange anyway and, I think, likely to be a useful primitive in other scenarios.

Show
Michał Marczyk added a comment - Also added split-at:
(avl/split-at (avl/sorted-set 0 1 2 3 4 5 6 7 8) 4)
;= [#{0 1 2 3} 4 #{5 6 7 8}]
It's a building block in subrange anyway and, I think, likely to be a useful primitive in other scenarios.
Hide
Pepijn de Vos added a comment -

Ship it

Observation: the result of split-at is a proper binary tree. Maybe not a balanced one.

Teasing: is there a parallel to split-at using rank? i.e. [(take n coll) (drop n coll)]

Wondering: Why don't clojures red-black trees do all this cool stuff...

Show
Pepijn de Vos added a comment - Ship it Observation: the result of split-at is a proper binary tree. Maybe not a balanced one. Teasing: is there a parallel to split-at using rank? i.e. [(take n coll) (drop n coll)] Wondering: Why don't clojures red-black trees do all this cool stuff...
Hide
Michał Marczyk added a comment -

split-at using rank:

(defn split-at-rank [coll n]
  (avl/split-at coll (nth coll n)))

I could certainly add it to the API, I wonder if there's a better name though...

Show
Michał Marczyk added a comment - split-at using rank:
(defn split-at-rank [coll n]
  (avl/split-at coll (nth coll n)))
I could certainly add it to the API, I wonder if there's a better name though...
Hide
Pepijn de Vos added a comment -

Nah. I'm not sure if it warrants api inclusion. Maybe.
And if it does, should it include the middle item? Less confusing to mirror core behavior IMO.

I'd rather give key-based split-at a new name, because it shadows a core function that does what the ranked version does.

Show
Pepijn de Vos added a comment - Nah. I'm not sure if it warrants api inclusion. Maybe. And if it does, should it include the middle item? Less confusing to mirror core behavior IMO. I'd rather give key-based split-at a new name, because it shadows a core function that does what the ranked version does.
Hide
Michał Marczyk added a comment -

Hm, that's a good point. My reasoning was that sorted collections are in a sense "indexed" by their keys, so the operation is analogous to the core split-at. I guess the ranked version can be viewed as being even closer...

As for not including the middle term – that would mean giving preference to left or right, where the concept behind the operation is symmetric. For maps, checking for nil tells us if the key is present in the original map (admittedly not too important). Additionally, I believe the version with the middle term is what's typically referred to as "split" in the literature.

Hm, split is actually a possible public name that I did consider for a moment...

I guess this wouldn't free up split-at for the ranked version – having both split and split-at would probably be confusing? But might be a better name anyway. Or maybe there's another possible split-*-style name?

What about subrange? Do you think this sounds all right? Any other ideas?

Show
Michał Marczyk added a comment - Hm, that's a good point. My reasoning was that sorted collections are in a sense "indexed" by their keys, so the operation is analogous to the core split-at. I guess the ranked version can be viewed as being even closer... As for not including the middle term – that would mean giving preference to left or right, where the concept behind the operation is symmetric. For maps, checking for nil tells us if the key is present in the original map (admittedly not too important). Additionally, I believe the version with the middle term is what's typically referred to as "split" in the literature. Hm, split is actually a possible public name that I did consider for a moment... I guess this wouldn't free up split-at for the ranked version – having both split and split-at would probably be confusing? But might be a better name anyway. Or maybe there's another possible split-*-style name? What about subrange? Do you think this sounds all right? Any other ideas?
Hide
Michał Marczyk added a comment - - edited

Just sent an RFC to clojure-dev. Now I'll take a short break from naming woes and do the ClojureScript port and such.

Show
Michał Marczyk added a comment - - edited Just sent an RFC to clojure-dev. Now I'll take a short break from naming woes and do the ClojureScript port and such.
Hide
Michał Marczyk added a comment -

I've changed the API a bit. split-at is now the index-based splitting function; the original one has been renamed to split-key. Also, subrange now takes subseq-like test arguments (and thus supports both single and double-ended ranges, as well as ranges exclusive of the given endpoints). nearest is available with the original parameters (coll, test, key).

Does this make sense?

I'm aiming at a release in the next couple of days, but I'll mark it alpha to keep the possibility of adjusting the API open for a while.

Show
Michał Marczyk added a comment - I've changed the API a bit. split-at is now the index-based splitting function; the original one has been renamed to split-key. Also, subrange now takes subseq-like test arguments (and thus supports both single and double-ended ranges, as well as ranges exclusive of the given endpoints). nearest is available with the original parameters (coll, test, key). Does this make sense? I'm aiming at a release in the next couple of days, but I'll mark it alpha to keep the possibility of adjusting the API open for a while.
Hide
Pepijn de Vos added a comment -

Very nice. Ship it!

Show
Pepijn de Vos added a comment - Very nice. Ship it!
Hide
Michał Marczyk added a comment -

Just pushed the ClojureScript port of the recent changes to master. Almost there now.

Show
Michał Marczyk added a comment - Just pushed the ClojureScript port of the recent changes to master. Almost there now.
Hide
Michał Marczyk added a comment -

Just fixed a bug in split-at whereby the split point would effectively be shifted by one. Also, split-at and split-key now take the split point argument first to match core split-at and split-with. Finally, the new features are now documented in the README.

Show
Michał Marczyk added a comment - Just fixed a bug in split-at whereby the split point would effectively be shifted by one. Also, split-at and split-key now take the split point argument first to match core split-at and split-with. Finally, the new features are now documented in the README.
Hide
Michał Marczyk added a comment -

I think subrange may become slice. (This whole naming issue is taking way too much time.)

Show
Michał Marczyk added a comment - I think subrange may become slice. (This whole naming issue is taking way too much time.)
Hide
Michał Marczyk added a comment -

Hey Pepijn, just wanted to let you know that your tests are now on master.

Show
Michał Marczyk added a comment - Hey Pepijn, just wanted to let you know that your tests are now on master.
Hide
Michał Marczyk added a comment -

0.0.12-alpha1 is out.

Here's the announcement thread on the mailing list:

https://groups.google.com/d/msg/clojure/UBYwNRd956g/GVio8bpgWO4J

Show
Michał Marczyk added a comment - 0.0.12-alpha1 is out. Here's the announcement thread on the mailing list: https://groups.google.com/d/msg/clojure/UBYwNRd956g/GVio8bpgWO4J

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: