Mailing List Archive

hash algorithms...
Anders: the is mostly n?rd talk, don't despair :-)

Between the phone and my email, I have been playing with hash
algorithms today.

I looked at the google code, and it's very neat, but it's C++
templates and I don't feel like having two languages involved.

Also, the google code is very focused on memory usage, but we're
going to store objects of typically a few hundred bytes for each
index, so it is not critical that we spend only a few bits per
item on the hashing.

In fact, for us it is more important that inserts/deletes interfere
as little as possible with lookups.

One thing struck me, and that is that ideally we would need to hash
more than just the URL. A given document could exist in different
languages or formats, selected by HTTP header, rather than by URL.
For now that is going to be an item in the "notes". One way to
handle it later on is with a VCL function doing the lookup, and the
lookup command in VCL taking arguments of some kind.


I will pack the hash-table functionality in a well defined interface
(Insert, Delete, Lookup) so that we can have multiple implementations
later on. Which one to use will then be a command-line parameter.

My guess is that the default hash table will be something like this:

calculate a hash of the URL. (exact algorithm TBD, I've used
MD5 for my experiments).

First byte of the hash indexes an array of 256 pointers.

while (object not resolved) {
Pointer either points to a sorted (by hash) linked
list or another 256 pointer array (indexed by the next
byte of the hash).
}

When a linked list grows more than X items, it is converted
to a 256 pointer array. Conversion from array to list
is never done.

I played with having both 256 and 16 bucket arrays, the closer one
gets to the leaf the less dense the tree is and therefore a linked
list seems to be behave better in the range from the twigs to the
leaves.

Up to around a thousand URLs this gives one array indexing and up
to approx 8 list entries to check.

Up to around 100k URLs it gives two index operations and a few list
entries.

Up to over 10M URls it gives three index operations and a few
list entries.

Overall, memory usage is around 13-19 bytes per URL.

Locking, is like a filesystems directory tree:

A global lock (actually inside the root 256-array) protects
all pointers in the root 256-array.

When the first array has been indexed, the listhead or
256-array will contain another lock which we grab before
releaseing the toplevel lock.

To change a list to an array, both the parent and the list
lock must be held.

I think mutexes will be OK for this because the fan-out is quite
big. Rw-locks could be used if there is too much contention, but
I'd prefer to avoid the lock-upgrade issue.

More later when I try to implement this.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
hash algorithms... [ In reply to ]
Anders: the is mostly n?rd talk, don't despair :-)

Between the phone and my email, I have been playing with hash
algorithms today.

I looked at the google code, and it's very neat, but it's C++
templates and I don't feel like having two languages involved.

Also, the google code is very focused on memory usage, but we're
going to store objects of typically a few hundred bytes for each
index, so it is not critical that we spend only a few bits per
item on the hashing.

In fact, for us it is more important that inserts/deletes interfere
as little as possible with lookups.

One thing struck me, and that is that ideally we would need to hash
more than just the URL. A given document could exist in different
languages or formats, selected by HTTP header, rather than by URL.
For now that is going to be an item in the "notes". One way to
handle it later on is with a VCL function doing the lookup, and the
lookup command in VCL taking arguments of some kind.


I will pack the hash-table functionality in a well defined interface
(Insert, Delete, Lookup) so that we can have multiple implementations
later on. Which one to use will then be a command-line parameter.

My guess is that the default hash table will be something like this:

calculate a hash of the URL. (exact algorithm TBD, I've used
MD5 for my experiments).

First byte of the hash indexes an array of 256 pointers.

while (object not resolved) {
Pointer either points to a sorted (by hash) linked
list or another 256 pointer array (indexed by the next
byte of the hash).
}

When a linked list grows more than X items, it is converted
to a 256 pointer array. Conversion from array to list
is never done.

I played with having both 256 and 16 bucket arrays, the closer one
gets to the leaf the less dense the tree is and therefore a linked
list seems to be behave better in the range from the twigs to the
leaves.

Up to around a thousand URLs this gives one array indexing and up
to approx 8 list entries to check.

Up to around 100k URLs it gives two index operations and a few list
entries.

Up to over 10M URls it gives three index operations and a few
list entries.

Overall, memory usage is around 13-19 bytes per URL.

Locking, is like a filesystems directory tree:

A global lock (actually inside the root 256-array) protects
all pointers in the root 256-array.

When the first array has been indexed, the listhead or
256-array will contain another lock which we grab before
releaseing the toplevel lock.

To change a list to an array, both the parent and the list
lock must be held.

I think mutexes will be OK for this because the fan-out is quite
big. Rw-locks could be used if there is too much contention, but
I'd prefer to avoid the lock-upgrade issue.

More later when I try to implement this.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
hash algorithms... [ In reply to ]
>
> Anders: the is mostly n?rd talk, don't despair :-)

I'm cool :)

> One thing struck me, and that is that ideally we would need to hash
> more than just the URL. A given document could exist in different
> languages or formats, selected by HTTP header, rather than by URL.
> For now that is going to be an item in the "notes". One way to
> handle it later on is with a VCL function doing the lookup, and the
> lookup command in VCL taking arguments of some kind.

Yes, you are right. To implement support for gzip'ed files we need more
than the URL, probably URL, Content-Encoding and User-Agent.
Potentially that means many different versions of each file, and will
probably be a hassle to work with in VCL. The reason I threw in User-Agent
are clients that don't behave according to RFC 2616. I am told that they
exist. This really makes it hard to implement support for gzip'ed files
without a performance dent, and it could also render Varnish less useful,
with many browsers having exotic names:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ABS)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SP/6.36/1.01/HP; .NET
CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts;
.NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Crazy Browser 2.0.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Dialer/1.10/ADSL;
.NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322;
.NET CLR 1.0.3705; InfoPath.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; SIMBAR Enabled;
.NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)


(I am not making this up. This is a _extract_ from a logg file over 5 min.)

So to find a good solution to this problem we need to digg deep (do all
the MSIE 6.0 behave alike? etc.).

But I guess thats for later. Be sure to tell me if you need logs. I have
loads of them.

Just my thoughts about "hashing" :)

Anders Berg
hash algorithms... [ In reply to ]
>
> Anders: the is mostly n?rd talk, don't despair :-)

I'm cool :)

> One thing struck me, and that is that ideally we would need to hash
> more than just the URL. A given document could exist in different
> languages or formats, selected by HTTP header, rather than by URL.
> For now that is going to be an item in the "notes". One way to
> handle it later on is with a VCL function doing the lookup, and the
> lookup command in VCL taking arguments of some kind.

Yes, you are right. To implement support for gzip'ed files we need more
than the URL, probably URL, Content-Encoding and User-Agent.
Potentially that means many different versions of each file, and will
probably be a hassle to work with in VCL. The reason I threw in User-Agent
are clients that don't behave according to RFC 2616. I am told that they
exist. This really makes it hard to implement support for gzip'ed files
without a performance dent, and it could also render Varnish less useful,
with many browsers having exotic names:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ABS)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SP/6.36/1.01/HP; .NET
CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts;
.NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Crazy Browser 2.0.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Dialer/1.10/ADSL;
.NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322;
.NET CLR 1.0.3705; InfoPath.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; SIMBAR Enabled;
.NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)


(I am not making this up. This is a _extract_ from a logg file over 5 min.)

So to find a good solution to this problem we need to digg deep (do all
the MSIE 6.0 behave alike? etc.).

But I guess thats for later. Be sure to tell me if you need logs. I have
loads of them.

Just my thoughts about "hashing" :)

Anders Berg
hash algorithms... [ In reply to ]
Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> Between the phone and my email, I have been playing with hash
> algorithms today.

I'd like to point out the difference between a hash function and a
hash table... I agree that we need a hash function, but if we use
md5, it's probably simpler and faster to keep the data in a red-black
tree with the hash value as lookup key than in a hash table where we
need to further hash the md5 digest and are guaranteed to get
collisions.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no
hash algorithms... [ In reply to ]
Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> Between the phone and my email, I have been playing with hash
> algorithms today.

I'd like to point out the difference between a hash function and a
hash table... I agree that we need a hash function, but if we use
md5, it's probably simpler and faster to keep the data in a red-black
tree with the hash value as lookup key than in a hash table where we
need to further hash the md5 digest and are guaranteed to get
collisions.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no
hash algorithms... [ In reply to ]
In message <ujrpsjov9tn.fsf at cat.linpro.no>, Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= wri
tes:
>Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
>> Between the phone and my email, I have been playing with hash
>> algorithms today.
>
>I'd like to point out the difference between a hash function and a
>hash table...

Sorry for sloppy terminology here.

>hash table... I agree that we need a hash function, but if we use
>md5, it's probably simpler and faster to keep the data in a red-black
>tree with the hash value as lookup key than in a hash table where we
>need to further hash the md5 digest and are guaranteed to get
>collisions.

The number of URLs you're managing will have a lot to do with which
algorithm is best. If you have a million URLS, the search will be
quite deep and the cost of using a hign-entropy hash like MD5 would
quickly pay off, whereas if you have only 100 URLs it would never
pay off compared to a plain linked list.

I know red-black trees are all the rage these days, but I'm not
convinced they are the silver bullet for us.

If we use a high-entropy hash function like MD5, there is no need
to do red-black, a simple binary tree will be better. Red-black
tress shine when your keyspace is lumpy and/or sparse, but since
md5 hashes are dense, a strict two-branch binary tree has less than
a quarter single-branch nodes when populated with more than a few
hundred MD5 outputs (I tested this up to 10M).

But mostly, a red-black tree does not offer us a way to distribute
locking over the tree unless we put a lock in every node, which on
the other hand would be way too expensive.

One very interesting and simple idea we may want to try out at some
point is a tree (of some kind) indexed by the URL read back to
front. The idea is that the closer we get to the tail of the url,
the more information there will be.

For instance, if practically all your URLS look like:
http://www.myplace.xxx/article.php/id=#######
there would be very little point in bothering with a hash in the
first place: what we need is right there at the end.

I fear that the object lookup may be a significantly interesting
performance hotspot, and I can see so many creative things one could
do, so I have decided to wrap the "URL lookup" facility in a named
API (struct hash_slinger :-), that way all you have to write is
three functions (maybe four if you want to report stats) to try
out a new idea.


--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
hash algorithms... [ In reply to ]
In message <ujrpsjov9tn.fsf at cat.linpro.no>, Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= wri
tes:
>Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
>> Between the phone and my email, I have been playing with hash
>> algorithms today.
>
>I'd like to point out the difference between a hash function and a
>hash table...

Sorry for sloppy terminology here.

>hash table... I agree that we need a hash function, but if we use
>md5, it's probably simpler and faster to keep the data in a red-black
>tree with the hash value as lookup key than in a hash table where we
>need to further hash the md5 digest and are guaranteed to get
>collisions.

The number of URLs you're managing will have a lot to do with which
algorithm is best. If you have a million URLS, the search will be
quite deep and the cost of using a hign-entropy hash like MD5 would
quickly pay off, whereas if you have only 100 URLs it would never
pay off compared to a plain linked list.

I know red-black trees are all the rage these days, but I'm not
convinced they are the silver bullet for us.

If we use a high-entropy hash function like MD5, there is no need
to do red-black, a simple binary tree will be better. Red-black
tress shine when your keyspace is lumpy and/or sparse, but since
md5 hashes are dense, a strict two-branch binary tree has less than
a quarter single-branch nodes when populated with more than a few
hundred MD5 outputs (I tested this up to 10M).

But mostly, a red-black tree does not offer us a way to distribute
locking over the tree unless we put a lock in every node, which on
the other hand would be way too expensive.

One very interesting and simple idea we may want to try out at some
point is a tree (of some kind) indexed by the URL read back to
front. The idea is that the closer we get to the tail of the url,
the more information there will be.

For instance, if practically all your URLS look like:
http://www.myplace.xxx/article.php/id=#######
there would be very little point in bothering with a hash in the
first place: what we need is right there at the end.

I fear that the object lookup may be a significantly interesting
performance hotspot, and I can see so many creative things one could
do, so I have decided to wrap the "URL lookup" facility in a named
API (struct hash_slinger :-), that way all you have to write is
three functions (maybe four if you want to report stats) to try
out a new idea.


--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
hash algorithms... [ In reply to ]
"Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
> I know red-black trees are all the rage these days, but I'm not
> convinced they are the silver bullet for us.
>
> If we use a high-entropy hash function like MD5, there is no need
> to do red-black, a simple binary tree will be better.

True. The hash distribution should be good enough to keep the tree
balanced.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no
hash algorithms... [ In reply to ]
"Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
> I know red-black trees are all the rage these days, but I'm not
> convinced they are the silver bullet for us.
>
> If we use a high-entropy hash function like MD5, there is no need
> to do red-black, a simple binary tree will be better.

True. The hash distribution should be good enough to keep the tree
balanced.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no