Mailing List Archive

shrinking dicts
Currently, dictionaries always grow until they are deallocated from
memory. This happens in PyDict_SetItem according to the following
code (before inserting the new item into the dict):

/* if fill >= 2/3 size, double in size */
if (mp->ma_fill*3 >= mp->ma_size*2) {
if (dictresize(mp, mp->ma_used*2) != 0) {
if (mp->ma_fill+1 > mp->ma_size)
return -1;
}
}

The symmetric case is missing and this has intrigued me for a long time,
but I've never had the courage to look deeply into this portion of code
and try to propose a solution. Which is: reduce the size of the dict by
half when the nb of used items <= 1/6 the size.

This situation occurs far less frequently than dict growing, but anyways,
it seems useful for the degenerate cases where a dict has a peek usage,
then most of the items are deleted. This is usually the case for global
dicts holding dynamic object collections, etc.

A bonus effect of shrinking big dicts with deleted items is that
the lookup speed may be improved, because of the cleaned <dummy> entries
and the reduced overall size (resulting in a better hit ratio).

The (only) solution I could came with for this pb is the appended patch.
It is not immediately obvious, but in practice, it seems to work fine.
(inserting a print statement after the condition, showing the dict size
and current usage helps in monitoring what's going on).

Any other ideas on how to deal with this? Thoughts, comments?

--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

-------------------------------[ cut here ]---------------------------
*** dictobject.c-1.5.2 Fri Aug 6 18:51:02 1999
--- dictobject.c Tue Aug 10 12:21:15 1999
***************
*** 417,423 ****
ep->me_value = NULL;
mp->ma_used--;
Py_DECREF(old_value);
! Py_DECREF(old_key);
return 0;
}

--- 417,430 ----
ep->me_value = NULL;
mp->ma_used--;
Py_DECREF(old_value);
! Py_DECREF(old_key);
! /* For bigger dictionaries, if used <= 1/6 size, half the size */
! if (mp->ma_size > MINSIZE*4 && mp->ma_used*6 <= mp->ma_size) {
! if (dictresize(mp, mp->ma_used*2) != 0) {
! if (mp->ma_fill > mp->ma_size)
! return -1;
! }
! }
return 0;
}
Re: shrinking dicts [ In reply to ]
I wrote:
>
> The (only) solution I could came with for this pb is the appended patch.
> It is not immediately obvious, but in practice, it seems to work fine.
> (inserting a print statement after the condition, showing the dict size
> and current usage helps in monitoring what's going on).
>
> Any other ideas on how to deal with this? Thoughts, comments?
>

To clarify a bit what the patch does "as is", here's a short description:

The code is triggered in PyDict_DelItem only for sizes which are > MINSIZE*4,
i.e. greater than 4*4 = 16. Therefore, resizing will occur for a min size of
32 items.

one third 32 / 3 = 10
two thirds 32 * 2/3 = 21

one sixth 32 / 6 = 5

So the shinking will happen for a dict size of 32, of which 5 items are used
(the sixth was just deleted). After the dictresize, the size will be 16, of
which 5 items are used, i.e. one third.

The threshold is fixed by the first condition of the patch. It could be
made 64, instead of 32. This is subject to discussion...

Obviously, this is most useful for bigger dicts, not for small ones.
A threshold of 32 items seemed to me to be a reasonable compromise.

--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
Re: shrinking dicts [ In reply to ]
Vladimir Marangozov wrote:
>
> Currently, dictionaries always grow until they are deallocated from
> memory. This happens in PyDict_SetItem according to the following
> code (before inserting the new item into the dict):
>
> /* if fill >= 2/3 size, double in size */
> if (mp->ma_fill*3 >= mp->ma_size*2) {
> if (dictresize(mp, mp->ma_used*2) != 0) {
> if (mp->ma_fill+1 > mp->ma_size)
> return -1;
> }
> }
>
> The symmetric case is missing and this has intrigued me for a long time,
> but I've never had the courage to look deeply into this portion of code
> and try to propose a solution. Which is: reduce the size of the dict by
> half when the nb of used items <= 1/6 the size.
>
> This situation occurs far less frequently than dict growing, but anyways,
> it seems useful for the degenerate cases where a dict has a peek usage,
> then most of the items are deleted. This is usually the case for global
> dicts holding dynamic object collections, etc.
>
> A bonus effect of shrinking big dicts with deleted items is that
> the lookup speed may be improved, because of the cleaned <dummy> entries
> and the reduced overall size (resulting in a better hit ratio).
>
> The (only) solution I could came with for this pb is the appended patch.
> It is not immediately obvious, but in practice, it seems to work fine.
> (inserting a print statement after the condition, showing the dict size
> and current usage helps in monitoring what's going on).
>
> Any other ideas on how to deal with this? Thoughts, comments?

I think that integrating this into the C code is not really that
effective since the situation will not occur that often and then
it often better to let the programmer decide rather than integrate
an automatic downsize.

You can call dict.update({}) to force an internal
resize (the empty dictionary can be made global since it is not
manipulated in any way and thus does not cause creation overhead).

Perhaps a new method .resize(approx_size) would make this even
clearer. This would also have the benefit of allowing a programmer
to force allocation of the wanted size, e.g.

d = {}
d.resize(10000)
# Insert 10000 items in a batch insert

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 143 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: shrinking dicts [ In reply to ]
M.-A. Lemburg wrote:
>
> [me]
> > Any other ideas on how to deal with this? Thoughts, comments?
>
> I think that integrating this into the C code is not really that
> effective since the situation will not occur that often and then
> it often better to let the programmer decide rather than integrate
> an automatic downsize.

Agreed that the situation is rare. But if it occurs, its Python's
responsability to manage its data structures (and system resources)
efficiently. As a programmer, I really don't want to be bothered with
internals -- I trust the interpreter for that. Moreover, how could
I decide that at some point, some dict needs to be resized in my
fairly big app, say IDLE?

>
> You can call dict.update({}) to force an internal
> resize (the empty dictionary can be made global since it is not
> manipulated in any way and thus does not cause creation overhead).

I know that I can force the resize in other ways, but this is not
the point. I'm usually against the idea of changing the programming
logic because of my advanced knowledge of the internals.

>
> Perhaps a new method .resize(approx_size) would make this even
> clearer. This would also have the benefit of allowing a programmer
> to force allocation of the wanted size, e.g.
>
> d = {}
> d.resize(10000)
> # Insert 10000 items in a batch insert

This is interesting, but the two ideas are not mutually excusive.
Python has to dowsize dicts automatically (just the same way it doubles
the size automatically). Offering more through an API is a plus for
hackers. ;-)

--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
Re: shrinking dicts [ In reply to ]
Vladimir Marangozov wrote:
>
> M.-A. Lemburg wrote:
> >
> > [me]
> > > Any other ideas on how to deal with this? Thoughts, comments?
> >
> > I think that integrating this into the C code is not really that
> > effective since the situation will not occur that often and then
> > it often better to let the programmer decide rather than integrate
> > an automatic downsize.
>
> Agreed that the situation is rare. But if it occurs, its Python's
> responsability to manage its data structures (and system resources)
> efficiently. As a programmer, I really don't want to be bothered with
> internals -- I trust the interpreter for that. Moreover, how could
> I decide that at some point, some dict needs to be resized in my
> fairly big app, say IDLE?

You usually don't ;-) because "normal" dict only grow (well, more or
less). The downsizing thing will only become a problem if you use
dictionaries in certain algorithms and there you handle the problem
manually.

My stack implementation uses the same trick, BTW. Memory is cheap
and with an extra resize method (which the mxStack implementation
has), problems can be dealt with explicitly for everyone to see
in the code.

> > You can call dict.update({}) to force an internal
> > resize (the empty dictionary can be made global since it is not
> > manipulated in any way and thus does not cause creation overhead).
>
> I know that I can force the resize in other ways, but this is not
> the point. I'm usually against the idea of changing the programming
> logic because of my advanced knowledge of the internals.

True, that why I mentioned...

> >
> > Perhaps a new method .resize(approx_size) would make this even
> > clearer. This would also have the benefit of allowing a programmer
> > to force allocation of the wanted size, e.g.
> >
> > d = {}
> > d.resize(10000)
> > # Insert 10000 items in a batch insert
>
> This is interesting, but the two ideas are not mutually excusive.
> Python has to dowsize dicts automatically (just the same way it doubles
> the size automatically). Offering more through an API is a plus for
> hackers. ;-)

It's not really for hackers: the point is that it makes the technique
visible and understandable (as opposed to the hack above). The same
could be useful for lists too (the hack there is l = [None] * size,
which I find rather difficult to understand at first sight...).

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 143 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: shrinking dicts [ In reply to ]
Looking over the messages from Marc and Vladimir, Im going to add my 2c
worth.

IMO, Marc's position is untenable iff it can be demonstrated that the
"average" program is likely to see "sparse" dictionaries, and such
dictionaries have an adverse effect on either speed or memory.

The analogy is quite simple - you dont need to manually resize lists or
dicts before inserting (to allocate more storage - an internal
implementation issue) so neither should you need to manually resize when
deleting (to reclaim that storage - still internal implementation).
Suggesting that the allocation of resources should be automatic, but the
recycling of them not be automatic flies in the face of everything else -
eg, you dont need to delete each object - when it is no longer referenced,
its memory is reclaimed automatically.

Marc's position is only reasonable if the specific case we are talking
about is very very rare, and unlikely to be hit by anyone with normal,
real-world requirements or programs. In this case, exposing the
implementation detail is reasonable.

So, the question comes down to: "What is the benefit to Vladmir's patch?"

Maybe we need some metrics on some dictionaries. For example, maybe a
doctored Python that kept stats for each dictionary and log this info. The
output of this should be able to tell you what savings you could possibly
expect. If you find that the average program really would not benefit at
all (say only a few K from a small number of dicts) then the horse was
probably dead well before we started flogging it. If however you can
demonstrate serious benefits could be achieved, then interest may pick up
and I too would lobby for automatic downsizing.

Mark.
RE: shrinking dicts [ In reply to ]
[Vladimir]
> Currently, dictionaries always grow until they are deallocated from
> memory.

It's more accurate to say they never shrink <0.9 wink>. Even that has
exceptions, though, starting with:

> This happens in PyDict_SetItem according to the following
> code (before inserting the new item into the dict):
>
> /* if fill >= 2/3 size, double in size */
> if (mp->ma_fill*3 >= mp->ma_size*2) {
> if (dictresize(mp, mp->ma_used*2) != 0) {
> if (mp->ma_fill+1 > mp->ma_size)
> return -1;
> }
> }

This code can shrink the dict too. The load factor computation is based on
"fill", but the resize is based on "used". If you grow a huge dict, then
delete all the entries one by one, "used" falls to 0 but "fill" stays at its
high-water mark. At least 1/3rd of the entries are NULL, so "fill"
continues to climb as keys are added again: when the load factor
computation triggers again, "used" may be as small as 1, and dictresize can
shrink the dict dramatically.

The only clear a priori return I see in your patch is that I might save
memory if I delete gobs of stuff from a dict and then neither get rid of it
nor add keys to it again. But my programs generally grow dicts forever,
grow then delete them entirely, or cycle through fat and lean times (in
which case the code above already shrinks them from time to time). So I
don't expect that your patch would be buy me anything I want, but would cost
me more on every delete.

> ...
> Any other ideas on how to deal with this? Thoughts, comments?

Just that slowing the expected case to prevent theoretical bad cases is
usually a net loss -- I think the onus is on you to demonstrate that this
change is an exception to that rule. I do recall one real-life complaint
about it on c.l.py a couple years ago: the poster had a huge dict,
eventually deleted most of the items, and then kept it around purely for
lookups. They were happy enough to copy the dict into a fresh one a
key+value pair at a time; today they could just do

d = d.copy()

or even

d.update({})

to shrink the dict.

It would certainly be good to document these tricks!

if-it-wasn't-a-problem-the-first-8-years-of-python's-life-it's-hard-to-
see-why-1999-is-special-ly y'rs - tim
Re: shrinking dicts [ In reply to ]
Tim Peters wrote:
>
> [Vladimir]
> > Currently, dictionaries always grow until they are deallocated from
> > memory.
>
> It's more accurate to say they never shrink <0.9 wink>. Even that has
> exceptions, though, starting with:
>
> > This happens in PyDict_SetItem according to the following
> > code (before inserting the new item into the dict):
> >
> > /* if fill >= 2/3 size, double in size */
> > if (mp->ma_fill*3 >= mp->ma_size*2) {
> > if (dictresize(mp, mp->ma_used*2) != 0) {
> > if (mp->ma_fill+1 > mp->ma_size)
> > return -1;
> > }
> > }
>
> This code can shrink the dict too. The load factor computation is based on
> "fill", but the resize is based on "used". If you grow a huge dict, then
> delete all the entries one by one, "used" falls to 0 but "fill" stays at its
> high-water mark. At least 1/3rd of the entries are NULL, so "fill"
> continues to climb as keys are added again: when the load factor
> computation triggers again, "used" may be as small as 1, and dictresize can
> shrink the dict dramatically.

Thanks for clarifying this!

> [snip]
>
> > ...
> > Any other ideas on how to deal with this? Thoughts, comments?
>
> Just that slowing the expected case to prevent theoretical bad cases is
> usually a net loss -- I think the onus is on you to demonstrate that this
> change is an exception to that rule.

I won't, because this case is rare in practice, classifying it already
as an exception. A real exception. I'll have to think a bit more about
all this. Adding 1/3 new entries to trigger the next resize sounds
suboptimal (if it happens at all).

> I do recall one real-life complaint
> about it on c.l.py a couple years ago: the poster had a huge dict,
> eventually deleted most of the items, and then kept it around purely for
> lookups. They were happy enough to copy the dict into a fresh one a
> key+value pair at a time; today they could just do
>
> d = d.copy()
>
> or even
>
> d.update({})
>
> to shrink the dict.
>
> It would certainly be good to document these tricks!

I think that officializing these tricks in the documentation is a bad idea.

>
> if-it-wasn't-a-problem-the-first-8-years-of-python's-life-it's-hard-to-
> see-why-1999-is-special-ly y'rs - tim
>

This is a good (your favorite ;-) argument, but don't forget that you've
been around, teaching people various tricks.

And 1999 is special -- we just had a solar eclipse today, the next being
scheduled for 2081.

--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
RE: shrinking dicts [ In reply to ]
[Tim]
>> ...slowing the expected case to prevent theoretical bad cases is
>> usually a net loss -- I think the onus is on you to demonstrate
>> that this change is an exception to that rule.

[Vladimir Marangozov]
> I won't, because this case is rare in practice, classifying it already
> as an exception. A real exception. I'll have to think a bit more about
> all this. Adding 1/3 new entries to trigger the next resize sounds
> suboptimal (if it happens at all).

"Suboptimal" with respect to which specific cost model? Exhibiting a
specific bad case isn't compelling, and especially not when it's considered
to be "a real exception". Adding new expense to every delete is an obvious
new burden -- where's the payback, and is the expected net effect amortized
across all dict usage a win or loss? Offhand it sounds like a small loss to
me, although I haven't worked up a formal cost model either <wink>.

> ...
> I think that officializing these tricks in the documentation is a
> bad idea.

It's rarely a good idea to keep truths secret, although
implementation-du-jour tricks don't belong in the current doc set. Probably
in a HowTo.

>> if-it-wasn't-a-problem-the-first-8-years-of-python's-life-it's-hard-
>> to-see-why-1999-is-special-ly y'rs - tim

> This is a good (your favorite ;-) argument,

I actually hate that kind of argument -- it's one of *Guido's* favorites,
and in his current silent state I'm simply channeling him <wink>.

> but don't forget that you've been around, teaching people various
> tricks.

As I said, this particular trick has come up only once in real life in my
experience; it's never come up in my own code; it's an anti-FAQ. People are
100x more likely to whine about theoretical quadratic-time list growth
nobody has ever encountered (although it looks like they may finally get it
under an out-of-the-box BDW collector!).

> And 1999 is special -- we just had a solar eclipse today, the next being
> scheduled for 2081.

Ya, like any of us will survive Y2K to see it <wink>.

1999-is-special-cuz-it's-the-end-of-civilization-ly y'rs - tim
Re: shrinking dicts [ In reply to ]
Tim Peters wrote:
>
> [Tim]
> >> ...slowing the expected case to prevent theoretical bad cases is
> >> usually a net loss -- I think the onus is on you to demonstrate
> >> that this change is an exception to that rule.
>
> [Vladimir Marangozov]
> > I won't, because this case is rare in practice, classifying it already
> > as an exception. A real exception. I'll have to think a bit more about
> > all this. Adding 1/3 new entries to trigger the next resize sounds
> > suboptimal (if it happens at all).
>
> "Suboptimal" with respect to which specific cost model? Exhibiting a
> specific bad case isn't compelling, and especially not when it's considered
> to be "a real exception". Adding new expense to every delete is an obvious
> new burden -- where's the payback, and is the expected net effect amortized
> across all dict usage a win or loss? Offhand it sounds like a small loss to
> me, although I haven't worked up a formal cost model either <wink>.

C'mon Tim, don't try to impress me with cost models. I'm already impressed :-)
Anyways, I've looked at some traces. As expected, the conclusion is that
this case is extremely rare wrt the average dict usage. There are 3 reasons:
(1) dicts are usually deleted entirely and (2) del d[key] is rare in practice
(3) often d[key] = None is used instead of (2).

There is, however, a small percentage of dicts which are used below 1/3 of
their size. I must say, below 1/3 of their peek size, because dowsizing
is also rare. To trigger a downsize, 1/3 new entries of the peek size must
be inserted.

Besides these observations, after looking at the code one more time, I can't
really understand why the resize logic is based on the "fill" watermark
and not on "used". fill = used + dummy, but since lookdict returns the
first free slot (null or dummy), I don't really see what's the point of
using a fill watermark... Perhaps you can enlighten me on this. Using only
the "used" metrics seems fine to me. I even deactivated "fill" and replaced
it with "used" to see what happens -- no visible changes, except a tiny
speedup I'm willing to neglect.

--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
RE: shrinking dicts [ In reply to ]
[.Vladimir Marangozov, *almost* seems ready to give up on a counter-
productive dict pessimization <wink>]

> ...
> There is, however, a small percentage of dicts which are used
> below 1/3 of their size. I must say, below 1/3 of their peek size,
> because dowsizing is also rare. To trigger a downsize, 1/3 new
> entries of the peek size must be inserted.

Not so, although "on average" 1/6 may be correct. Look at an extreme: Say
a dict has size 333 (it can't, but it makes the math obvious ...). Say it
contains 221 items. Now someone deletes them all, one at a time. used==0
and fill==221 at this point. They insert one new key that happens to hit
one of the 333-221 = 112 remaining NULL keys. Then used==1 and fill==222.
They insert a 2nd key, and before the dict is searched the new fill of 222
triggers the 2/3rds load-factor resizing -- which asks for a new size of 1*2
== 2.

For the minority of dicts that go up and down in size wildly many times, the
current behavior is fine.

> Besides these observations, after looking at the code one more
> time, I can't really understand why the resize logic is based on
> the "fill" watermark and not on "used". fill = used + dummy, but
> since lookdict returns the first free slot (null or dummy), I don't
> really see what's the point of using a fill watermark...

Let's just consider an unsuccessful search. Then it does return "the first"
free slot, but not necessarily at the time it *sees* the first free slot.
So long as it sees a dummy, it has to keep searching; the search doesn't end
until it finds a NULL. So consider this, assuming the resize triggered only
on "used":

d = {}
for i in xrange(50000):
d[random.randrange(1000000)] = 1
for k in d.keys():
del d[k]
# now there are 50000 dummy dict keys, and some number of NULLs

# loop invariant: used == 0
for i in xrange(sys.maxint):
j = random.randrange(10000000)
d[j] = 1
del d[j]
assert not d.has_key(i)

However many NULL slots remained, the last loop eventually transforms them
*all* into dummies. The dummies act exactly like "real keys" with respect
to expected time for an unsuccessful search, which is why it's thoroughly
appropriate to include dummies in the load factor computation. The loop
will run slower and slower as the percentage of dummies approaches 100%, and
each failing has_key approaches O(N) time.

In most hash table implementations that's the worst that can happen (and
it's a disaster), but under Python's implementation it's worse: Python
never checks to see whether the probe sequence "wraps around", so the first
search after the last NULL is changed to a dummy never ends.

Counting the dummies in the load-factor computation prevents all that: no
matter how much inserts and deletes are intermixed, the "effective load
factor" stays under 2/3rds so gives excellent expected-case behavior; and it
also protects against an all-dummy dict, making the lack of an expensive
inner-loop "wrapped around?" check safe.

> Perhaps you can enlighten me on this. Using only the "used" metrics
> seems fine to me. I even deactivated "fill" and replaced it with "used"
> to see what happens -- no visible changes, except a tiny speedup I'm
> willing to neglect.

You need a mix of deletes and inserts for the dummies to make a difference;
dicts that always grow don't have dummies, so they're not likely to have any
dummy-related problems either <wink>. Try this (untested):

import time
from random import randrange
N = 1000
thatmany = [None] * N

while 1:
start = time.clock()
for i in thatmany:
d[randrange(10000000)] = 1
for i in d.keys():
del d[i]
finish = time.clock()
print round(finish - start, 3)

Succeeding iterations of the outer loop should grow dramatically slower, and
finally get into an infinite loop, despite that "used" never exceeds N.

Short course rewording: for purposes of predicting expected search time, a
dummy is the same as a live key, because finding a dummy doesn't end a
search -- it has to press on until either finding the key it was looking
for, or finding a NULL. And with a mix of insertions and deletions, and if
the hash function is doing a good job, then over time all the slots in the
table will become either live or dummy, even if "used" stays within a very
small range.

So, that's why <wink>.

dictobject-may-be-the-subtlest-object-there-is-ly y'rs - tim