Mailing List Archive: hash algorithms...

hash algorithms...

Apr 10, 2006, 1:36 PM

Post #1 of 10 (2752 views)

Anders: the is mostly n?rd talk, don't despair :-)

Between the phone and my email, I have been playing with hash
algorithms today.

I looked at the google code, and it's very neat, but it's C++
templates and I don't feel like having two languages involved.

Also, the google code is very focused on memory usage, but we're
going to store objects of typically a few hundred bytes for each
index, so it is not critical that we spend only a few bits per
item on the hashing.

In fact, for us it is more important that inserts/deletes interfere
as little as possible with lookups.

One thing struck me, and that is that ideally we would need to hash
more than just the URL. A given document could exist in different
languages or formats, selected by HTTP header, rather than by URL.
For now that is going to be an item in the "notes". One way to
handle it later on is with a VCL function doing the lookup, and the
lookup command in VCL taking arguments of some kind.

I will pack the hash-table functionality in a well defined interface
(Insert, Delete, Lookup) so that we can have multiple implementations
later on. Which one to use will then be a command-line parameter.

My guess is that the default hash table will be something like this:

calculate a hash of the URL. (exact algorithm TBD, I've used
MD5 for my experiments).

First byte of the hash indexes an array of 256 pointers.

while (object not resolved) {
Pointer either points to a sorted (by hash) linked
list or another 256 pointer array (indexed by the next
byte of the hash).
}

When a linked list grows more than X items, it is converted
to a 256 pointer array. Conversion from array to list
is never done.

I played with having both 256 and 16 bucket arrays, the closer one
gets to the leaf the less dense the tree is and therefore a linked
list seems to be behave better in the range from the twigs to the
leaves.

Up to around a thousand URLs this gives one array indexing and up
to approx 8 list entries to check.

Up to around 100k URLs it gives two index operations and a few list
entries.

Up to over 10M URls it gives three index operations and a few
list entries.

Overall, memory usage is around 13-19 bytes per URL.

Locking, is like a filesystems directory tree:

A global lock (actually inside the root 256-array) protects
all pointers in the root 256-array.

When the first array has been indexed, the listhead or
256-array will contain another lock which we grab before
releaseing the toplevel lock.

To change a list to an array, both the parent and the list
lock must be held.

I think mutexes will be OK for this because the fan-out is quite
big. Rw-locks could be used if there is too much contention, but
I'd prefer to avoid the lock-upgrade issue.

More later when I try to implement this.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.