Mailing List Archive

dict vs kjBuckets vs ???
I'm currently using a dict for storing references to objects (call them
customer objects, referenced by customer ID (a number)). I'm using a
dict (instead of SQL or something similar) because we're willing to
trade lots of memory to get speed. However, I've heard that after about
10K items in a dict, it starts having problems. Is this a specious
rumor, should I switch to kjBuckets, or should I do something else?
Does the answer change if we have 100K objects? A million?

(The only operations are object insert, update, delete, and lookup.)
--
--- Aahz (@netcom.com)

Hugs and backrubs -- I break Rule 6 <*> http://www.rahul.net/aahz/
Androgynous poly kinky vanilla queer het

"I'm not tense, i'm just terribly, terribly alert" -- unknown
dict vs kjBuckets vs ??? [ In reply to ]
[Aahz Maruch]
> I'm currently using a dict for storing references to objects (call them
> customer objects, referenced by customer ID (a number)). I'm using a
> dict (instead of SQL or something similar) because we're willing to
> trade lots of memory to get speed. However, I've heard that after about
> 10K items in a dict, it starts having problems.

11,523 to be exact. After that, dicts drink to excess and show up for work
late the morning after. We don't like to talk about it, though.

> Is this a specious rumor,

It wasn't before you spread it <wink>. Seriously, haven't heard a peep
about this before, and routinely use dicts with 40K objects -- performance
stays O(1) all the way. It's certainly possible-- but unlikely --that you
assign "customer IDs" in a pattern that interacts badly with the hash
function, leading to lots of collisions.

> should I switch to kjBuckets, or should I do something else?

The latter: investigate! Create a reproducible bad case and we'll fix it,
or retract your slander <wink>.

> Does the answer change if we have 100K objects?

No.

> A million?

No.

It all depends on how much memory you have and how well the hash function is
working. Note that dicts work hard to "randomize" the addresses at which
they store keys, so memory locality can be terrible. But if there's no
pattern to customer ID accesses, locality would be terrible in a 1M array
too.

> (The only operations are object insert, update, delete, and lookup.)

Kinda figured that, else you wouldn't be using a dict <wink>.

dicts-are-great-use-more-of-them-ly y'rs - tim
dict vs kjBuckets vs ??? [ In reply to ]
On Tue, 8 Jun 1999 22:27:39 -0400, "Tim Peters"
<tim_one@email.msn.com> wrote:
>> A million?

>No.

>It all depends on how much memory you have and how well the hash function is
>working. Note that dicts work hard to "randomize" the addresses at which
>they store keys, so memory locality can be terrible. But if there's no
>pattern to customer ID accesses, locality would be terrible in a 1M array
>too.

In some book on algorithms I've read that after inserting limited
number of items performance of operating on hash tables
drops dramatically. I plan to write a program that would store
lots (in range of 10M or even more) of relatively small objects
(a few hundred bytes at most), so what do you think I should use? I
thought about dictionaries, kjBuckets, or maybe even library called
Metakit for Python
(http://www.equi4.com/metakit/info/README-Python.html).

what-do-you-think-ly y'rs



--------------------------------------------------
Reality is something that does not disappear after
you cease believing in it - VALIS, Philip K. Dick
--------------------------------------------------

Delete _removethis_ from address to email me
dict vs kjBuckets vs ??? [ In reply to ]
MK wrote:
>
> In some book on algorithms I've read that after inserting limited
> number of items performance of operating on hash tables
> drops dramatically.

My understanding is that Python dicts expand their
hash tables before they become dangerously full,
so this behaviour doesn't occur.

Greg
dict vs kjBuckets vs ??? [ In reply to ]
[MK]
> In some book on algorithms I've read that after inserting limited
> number of items performance of operating on hash tables
> drops dramatically.

Depends on the details. What you read is true of some kinds of hash tables.
Python's dicts dynamically expand to keep the "load factor" under 2/3, so
what you read isn't applicable to Python in normal use.

> I plan to write a program that would store lots (in range of 10M or even
> more) of relatively small objects (a few hundred bytes at most), so what
> do you think I should use?

Let's do a little math <wink>: 10M * 100 = ?, a lower bound on what you're
contemplating. Do you have gigabytes of RAM?

> I thought about dictionaries, kjBuckets, or maybe even library called
> Metakit for Python (http://www.equi4.com/metakit/info/README-Python.html).
>
> what-do-you-think-ly y'rs

You don't really want to know <wink>. Memory-based data structures aren't
going to work for the size of thing you have in mind. If you can make it
fly it all, you'll likely require a powerful database, so of those choices
Metakit is the only approach that's not dead on arrival.

better-still-write-it-in-perl<wink>-ly y'rs - tim
dict vs kjBuckets vs ??? [ In reply to ]
On Thu, 10 Jun 1999 22:23:15 -0400, "Tim Peters"
<tim_one@email.msn.com> wrote:
>[MK]
>> In some book on algorithms I've read that after inserting limited
>> number of items performance of operating on hash tables
>> drops dramatically.

>Depends on the details. What you read is true of some kinds of hash tables.
>Python's dicts dynamically expand to keep the "load factor" under 2/3, so
>what you read isn't applicable to Python in normal use.

Great.

>> I plan to write a program that would store lots (in range of 10M or even
>> more) of relatively small objects (a few hundred bytes at most), so what
>> do you think I should use?

>Let's do a little math <wink>: 10M * 100 = ?, a lower bound on what you're
>contemplating. Do you have gigabytes of RAM?

I'm opening a boutique.

>> I thought about dictionaries, kjBuckets, or maybe even library called
>> Metakit for Python (http://www.equi4.com/metakit/info/README-Python.html).

>> what-do-you-think-ly y'rs

>You don't really want to know <wink>. Memory-based data structures aren't
>going to work for the size of thing you have in mind. If you can make it
>fly it all, you'll likely require a powerful database, so of those choices
>Metakit is the only approach that's not dead on arrival.

A few additional informations: items stored would be natural language
text fragments (several sentences at most, several words typically)
+ binary descriptions, primary operation would be lots of searching.
Is there anything else that would be better for this kind of program?
Object database?

>better-still-write-it-in-perl<wink>-ly y'rs - tim

it's-a-stiff-ly yours




--------------------------------------------------
Reality is something that does not disappear after
you cease believing in it - VALIS, Philip K. Dick
--------------------------------------------------

Delete _removethis_ from address to email me
dict vs kjBuckets vs ??? [ In reply to ]
Mark R. This:
> >> I plan to write a program that would store lots (in range of 10M or even
> >> more) of relatively small objects (a few hundred bytes at most), so what
> >> do you think I should use?
[Tim]
> >Let's do a little math <wink>: 10M * 100 = ?, a lower bound on what you're
> >contemplating. Do you have gigabytes of RAM?
[Brian, er, Mark]
> I'm opening a boutique.
[Tim]
> >...Memory-based data structures aren't
> >going to work for the size of thing you have in mind. If you can make it
> >fly it all, you'll likely require a powerful database, so of those choices
> >Metakit is the only approach that's not dead on arrival.
[Mark]
> A few additional informations: items stored would be natural
> language text fragments (several sentences at most, several words
> typically)
> + binary descriptions, primary operation would be lots of searching.
> Is there anything else that would be better for this kind of
> program? Object database?

Searching 10M items is going to strain just about anything. I get
wonderful response times out of MetaKit, but I've never tried
anything approaching that size. The state of the art for searching
big amounts of data are the big boys of SQL databases. But even
there you'll have an index of 6 or 10 levels, and unless you have
huge amounts of RAM, only a few of them will be in memory. So the
disk will grind and grind for each search. I'd try MetaKit first,
since it's a whole lot simpler and lighter wieght.

The best solution would be to exploit any interrelationships in your
items. If you could reduce it to 100K items, you could burn a whole
lot of CPU and still come out ahead.

and-good-luck-with-the-boutique-Brian-ly y'rs

- Gordon
dict vs kjBuckets vs ??? [ In reply to ]
In article <000801beb3b1$5c5af240$329e2299@tim>,
Tim Peters <tim_one@email.msn.com> wrote:
>[MK]
>> I plan to write a program that would store lots (in range of 10M or even
>> more) of relatively small objects (a few hundred bytes at most), so what
>> do you think I should use?
>
>Let's do a little math <wink>: 10M * 100 = ?, a lower bound on what you're
>contemplating. Do you have gigabytes of RAM?

That might actually be reasonable, if you do a little more math. RAM
is going for about a dollar a meg right now, so 2 gigs of RAM will
cost around $2000. This is price-competitive with tying up a
programmer for a week or two.

I'm always surprised by how effective brute force and ignorance can
be. :)


Neel
dict vs kjBuckets vs ??? [ In reply to ]
[Tim]
> Let's do a little math <wink>: 10M * 100 = ?, a lower bound on
> what you're contemplating. Do you have gigabytes of RAM?

[Neel Krishnaswami]
> That might actually be reasonable, if you do a little more math. RAM
> is going for about a dollar a meg right now, so 2 gigs of RAM will
> cost around $2000. This is price-competitive with tying up a
> programmer for a week or two.

Alas, you still need to throw in somebody else to find and install a
motherboard capable of using that much RAM, and an OS capable of exposing
it -- we're bashing into the limits of 32-bit addressing here. I had the
feeling the original questioner didn't have a 64-bit supercomputer just
sitting around waiting to be applied <wink>.

> I'm always surprised by how effective brute force and ignorance can
> be. :)

Oh yes, the above is temporary. When the barbarians come knocking on your
door with an appetite for rape and pillage, it's rarely effective to point
out that this simply isn't the way it's done in your country <wink>.

the-shortest-distance-between-two-points-is-a-hammer-with-a-head-
that-wide-ly y'rs - tim