Mailing List Archive: Why does shelve make such large files?

Why does shelve make such large files?

gerrit.holl at pobox

Jul 1, 1999, 12:14 PM

Post #1 of 14 (1269 views)

Hi,

Is it really necesarry for shelve to make such large files?
Have a look at this:
/tmp> python
Python 1.5.2 (#1, Apr 18 1999, 00:16:12) [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import shelve
>>> d = shelve.open('database')
>>> d['key'] = 'value'
>>>
/tmp> ls -l database
-rw-rw-r-- 1 gerrit gerrit 16384 Jul 1 21:13 database
^^^^^

16 KB for only one key!!?
pickle seems to make _much_ smaller files!

Why is this?

regards,
Gerrit.

--
The Dutch Linuxgames homepage: http://linuxgames.nl.linux.org
Personal homepage: http://www.nl.linux.org/~gerrit/

Discoverb is a python program (in several languages) which tests the words you
learned by asking it. Homepage: http://www.nl.linux.org/~gerrit/discoverb/
Oh my god! They killed init! You bastards!

Why does shelve make such large files? [ In reply to ]

Jul 1, 1999, 3:47 PM

Post #2 of 14 (1259 views)

Gerrit Holl wrote:

> Is it really necesarry for shelve to make such large files?
> Have a look at this:
> /tmp> python
> Python 1.5.2 (#1, Apr 18 1999, 00:16:12) [GCC 2.7.2.3] on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import shelve
> >>> d = shelve.open('database')
> >>> d['key'] = 'value'
> >>>
> /tmp> ls -l database
> -rw-rw-r-- 1 gerrit gerrit 16384 Jul 1 21:13 database
> ^^^^^
>
> 16 KB for only one key!!?
> pickle seems to make _much_ smaller files!
>
> Why is this?

The shelve module uses DBM which is like a small database that allows
you to store objects and search it for a given key. DBM allows you to
store millions of objects and search for them later without requiring
you to load all the file in memory first.

The pickle module on the other hand is serializing the objects with the
purpose of deserializing them _all_ from the file later. Pickle does not
offer you any way to search for data based on a key, you have to do this
yourself after the objects have been created from the file. This is
opposed to the way shelve handles this, all the key accesses and
insertions in a shelve object are actually reads or writes to or from
the DBM file.

And to answer your question, DBM is creating these big files because of
the way it manages the database. The data in the database file could
have gaps as a result of multiple insertions and deletions. Pickle's
data in files is a simple representation of the objects that were
written and there is no way to update the file other than rewriting it
entirely.

--
Ovidiu Predescu <ovidiu@cup.hp.com>
http://www.geocities.com/SiliconValley/Monitor/7464/

Why does shelve make such large files? [ In reply to ]

Ovidiu.Predescu at p98

Jul 1, 1999, 3:47 PM

Post #3 of 14 (1260 views)

From: Ovidiu Predescu <ovidiu@cup.hp.com>

Gerrit Holl wrote:

> Is it really necesarry for shelve to make such large files?
> Have a look at this:
> /tmp> python
> Python 1.5.2 (#1, Apr 18 1999, 00:16:12) [GCC 2.7.2.3] on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import shelve
> >>> d = shelve.open('database')
> >>> d['key'] = 'value'
> >>>
> /tmp> ls -l database
> -rw-rw-r-- 1 gerrit gerrit 16384 Jul 1 21:13 database
> ^^^^^
>
> 16 KB for only one key!!?
> pickle seems to make _much_ smaller files!
>
> Why is this?

The shelve module uses DBM which is like a small database that allows
you to store objects and search it for a given key. DBM allows you to
store millions of objects and search for them later without requiring
you to load all the file in memory first.

The pickle module on the other hand is serializing the objects with the
purpose of deserializing them _all_ from the file later. Pickle does not
offer you any way to search for data based on a key, you have to do this
yourself after the objects have been created from the file. This is
opposed to the way shelve handles this, all the key accesses and
insertions in a shelve object are actually reads or writes to or from
the DBM file.

And to answer your question, DBM is creating these big files because of
the way it manages the database. The data in the database file could
have gaps as a result of multiple insertions and deletions. Pickle's
data in files is a simple representation of the objects that were
written and there is no way to update the file other than rewriting it
entirely.

--
Ovidiu Predescu <ovidiu@cup.hp.com>
http://www.geocities.com/SiliconValley/Monitor/7464/

Why does shelve make such large files? [ In reply to ]

Gerrit.Holl at p98

Jul 1, 1999, 4:14 PM

Post #4 of 14 (1260 views)

From: Gerrit Holl <gerrit.holl@pobox.com>

Hi,

Is it really necesarry for shelve to make such large files?
Have a look at this:
/tmp> python
Python 1.5.2 (#1, Apr 18 1999, 00:16:12) [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import shelve
>>> d = shelve.open('database')
>>> d['key'] = 'value'
>>>
/tmp> ls -l database
-rw-rw-r-- 1 gerrit gerrit 16384 Jul 1 21:13 database
^^^^^

16 KB for only one key!!?
pickle seems to make _much_ smaller files!

Why is this?

regards,
Gerrit.

--
The Dutch Linuxgames homepage: http://linuxgames.nl.linux.org
Personal homepage: http://www.nl.linux.org/~gerrit/

Discoverb is a python program (in several languages) which tests the words you
learned by asking it. Homepage: http://www.nl.linux.org/~gerrit/discoverb/
Oh my god! They killed init! You bastards!

Why does shelve make such large files? [ In reply to ]

gerrit.holl at pobox

Jul 2, 1999, 3:31 AM

Post #5 of 14 (1260 views)

On Thu, Jul 01, 1999 at 10:47:57PM +0000, Ovidiu Predescu wrote:
> Gerrit Holl wrote:
>
> > Is it really necesarry for shelve to make such large files?
> > Have a look at this:
> > /tmp> python
> > Python 1.5.2 (#1, Apr 18 1999, 00:16:12) [GCC 2.7.2.3] on linux2
> > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> > >>> import shelve
> > >>> d = shelve.open('database')
> > >>> d['key'] = 'value'
> > >>>
> > /tmp> ls -l database
> > -rw-rw-r-- 1 gerrit gerrit 16384 Jul 1 21:13 database
> > ^^^^^
> >
> > 16 KB for only one key!!?
> > pickle seems to make _much_ smaller files!
> >
> > Why is this?
>
> The shelve module uses DBM which is like a small database that allows <
> you to store objects and search it for a given key. DBM allows you to <
> store millions of objects and search for them later without requiring <
> you to load all the file in memory first. <
>

Interesting...

> The pickle module on the other hand is serializing the objects with the
> purpose of deserializing them _all_ from the file later. Pickle does not
> offer you any way to search for data based on a key, you have to do this
> yourself after the objects have been created from the file. This is
> opposed to the way shelve handles this, all the key accesses and
> insertions in a shelve object are actually reads or writes to or from
> the DBM file.
>
> And to answer your question, DBM is creating these big files because of
> the way it manages the database. The data in the database file could
> have gaps as a result of multiple insertions and deletions. Pickle's
> data in files is a simple representation of the objects that were
> written and there is no way to update the file other than rewriting it
> entirely.
>

Ah, I understand.
So pickle is useful for very small datases, but when they're really huge, one
should use shelve. Isn't it?

regards,
Gerrit.

--
The Dutch Linuxgames homepage: http://linuxgames.nl.linux.org
Personal homepage: http://www.nl.linux.org/~gerrit/

Discoverb is a python program (in several languages) which tests the words you
learned by asking it. Homepage: http://www.nl.linux.org/~gerrit/discoverb/
Oh my god! They killed init! You bastards!

Why does shelve make such large files? [ In reply to ]

Jul 2, 1999, 6:13 AM

Post #6 of 14 (1258 views)

Hi

Ovidiu Predescu wrote:
>[...]
>
> The shelve module uses DBM which is like a small database that allows
> you to store objects and search it for a given key. DBM allows you to
> store millions of objects and search for them later without requiring
> you to load all the file in memory first.
>
> The pickle module on the other hand is serializing the objects with the
> purpose of deserializing them _all_ from the file later. Pickle does not
> offer you any way to search for data based on a key, you have to do this
> yourself after the objects have been created from the file. This is
> opposed to the way shelve handles this, all the key accesses and
> insertions in a shelve object are actually reads or writes to or from
> the DBM file.
>
> And to answer your question, DBM is creating these big files because of
> the way it manages the database. The data in the database file could
> have gaps as a result of multiple insertions and deletions. Pickle's
> data in files is a simple representation of the objects that were
> written and there is no way to update the file other than rewriting it
> entirely.
>
> --
> Ovidiu Predescu <ovidiu@cup.hp.com>
> http://www.geocities.com/SiliconValley/Monitor/7464/

So, just to make shure I follow, in short terms
* pickle makes "persistant objects" flushed to a file
* shelve + dbm is in fact a (relational) database storing
(arbitary**) objects in some hash-searchable order

(**) By arbitary I mean _any_ kind of object, or do they need to be
of the same class?? (In order for searching to work)

About usage: shelves when I need to search and pickle when I just need
to save my objects?? Are there other considerations based on number and
size of objects?

Best regards
-- Thomas S. Strinnhed, thstr@serop.abb.se

Why does shelve make such large files? [ In reply to ]

johanw at easics

Jul 2, 1999, 6:41 AM

Post #7 of 14 (1259 views)

Gerrit Holl wrote:

> Ah, I understand.
> So pickle is useful for very small datases, but when they're really huge, one
> should use shelve. Isn't it?

Pickle will serialise your data into a certain format. This way you can
"store" objects like lists as a whole. The serialising has the benefit that
you can eg. put your object (be it a list or something else) through a pipe
or socket and depickle it at the other side. This way you can have python
processes that exchange data in a sort of native python format.

Shelve gives you alot more than just storing the data as a long sequence.
Imagine the extra space being the administration that is needed for all
the extra functionality (like tagging what data items are valid etc)

The overhead induced by a database management system can be enormous
compared to the bare data you want to access, but this is mostly a
space/time tradeoff: the more space you use the faster you can do
things like searching and sorting etc etc.

kind regards,
===================================================================
Johan Wouters === Easics ===
ASIC Designer === System-on-Chip design services ===
Tel: +32-16-395 616 ===================================
Fax: +32-16-395 619 Interleuvenlaan 86, B-3001 Leuven, BELGIUM
mailto:johanw@easics.be http://www.easics.com

Why does shelve make such large files? [ In reply to ]

Gerrit.Holl at p98

Jul 2, 1999, 7:31 AM

Post #8 of 14 (1265 views)

From: Gerrit Holl <gerrit.holl@pobox.com>

On Thu, Jul 01, 1999 at 10:47:57PM +0000, Ovidiu Predescu wrote:
> Gerrit Holl wrote:
>
> > Is it really necesarry for shelve to make such large files?
> > Have a look at this:
> > /tmp> python
> > Python 1.5.2 (#1, Apr 18 1999, 00:16:12) [GCC 2.7.2.3] on linux2
> > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> > >>> import shelve
> > >>> d = shelve.open('database')
> > >>> d['key'] = 'value'
> > >>>
> > /tmp> ls -l database
> > -rw-rw-r-- 1 gerrit gerrit 16384 Jul 1 21:13 database
> > ^^^^^
> >
> > 16 KB for only one key!!?
> > pickle seems to make _much_ smaller files!
> >
> > Why is this?
>
> The shelve module uses DBM which is like a small database that allows
<
> you to store objects and search it for a given key. DBM allows you to
<
> store millions of objects and search for them later without requiring
<
> you to load all the file in memory first. <
>

Interesting...

> The pickle module on the other hand is serializing the objects with the
> purpose of deserializing them _all_ from the file later. Pickle does not
> offer you any way to search for data based on a key, you have to do this
> yourself after the objects have been created from the file. This is
> opposed to the way shelve handles this, all the key accesses and
> insertions in a shelve object are actually reads or writes to or from
> the DBM file.
>
> And to answer your question, DBM is creating these big files because of
> the way it manages the database. The data in the database file could
> have gaps as a result of multiple insertions and deletions. Pickle's
> data in files is a simple representation of the objects that were
> written and there is no way to update the file other than rewriting it
> entirely.
>

Ah, I understand.
So pickle is useful for very small datases, but when they're really huge, one
should use shelve. Isn't it?

regards,
Gerrit.

--
The Dutch Linuxgames homepage: http://linuxgames.nl.linux.org
Personal homepage: http://www.nl.linux.org/~gerrit/

Discoverb is a python program (in several languages) which tests the words you
learned by asking it. Homepage: http://www.nl.linux.org/~gerrit/discoverb/
Oh my god! They killed init! You bastards!

Why does shelve make such large files? [ In reply to ]

Jul 2, 1999, 7:41 AM

Post #9 of 14 (1261 views)

Johan Wouters wrote:
>
> Gerrit Holl wrote:
>
> > Ah, I understand.
> > So pickle is useful for very small datases, but when they're really
> > huge, one should use shelve. Isn't it?
>
> Pickle will serialise your data into a certain format. This way you
> can "store" objects like lists as a whole. The serialising has the
> benefit that you can eg. put your object (be it a list or something
> else) through a pipe or socket and depickle it at the other side. This
> way you can have python processes that exchange data in a sort of
> native python format.
>
> Shelve gives you alot more than just storing the data as a long
> sequence. Imagine the extra space being the administration that is
> needed for all the extra functionality (like tagging what data items
> are valid etc)
>
> The overhead induced by a database management system can be enormous
> compared to the bare data you want to access, but this is mostly a
> space/time tradeoff: the more space you use the faster you can do
> things like searching and sorting etc etc.

Ahem, with all due respect to everyone... this is humbug.

Pickle serializes your data in one sweep to/from disk. It's compact.

Shelve stores serialized pieces in a keyed-access form, and uses file
space allocation to support add/modify/delete. It starts out with more
empty space, but as the amount of stored data grows, that space does not
expand linearly. Consider shelves to have half their file space being
used, on *average* (plus or minus a factor 2 perhaps).

But the last claim is horrendously misleading, I'm afraid: data storage
is not a space/time tradeoff. It's about throughput (I/O bottlenecks)
and overhead in managing the supported data and indexing schemes. There
are order-of-magnitude performance differences in how several solutions
work, because of this.

Shudder. The notion that a large database package, or a large datafile,
is faster, is so far from reality that it has to be corrected, even in
this Python-oriented newsgroup. My apologies for the S/N ratio drop.

-- Jean-Claude

P.S. Geloof niet alles wat je leest, Gerrit.

Why does shelve make such large files? [ In reply to ]

johanw at easics

Jul 2, 1999, 8:45 AM

Post #10 of 14 (1258 views)

<snip>
> Ahem, with all due respect to everyone... this is humbug.
>

Why so?

> Pickle serializes your data in one sweep to/from disk. It's compact.

So? I don't remember me having any trouble with that statement. As a matter
of fact this is just what I stated.

> But the last claim is horrendously misleading, I'm afraid: data storage
> is not a space/time tradeoff. It's about throughput (I/O bottlenecks)
> and overhead in managing the supported data and indexing schemes. There
> are order-of-magnitude performance differences in how several solutions
> work, because of this.

The general idea was some typical engineering tradeoff. Larger database
environments *might* give you better performance for some criteria like
searching or combining data. Say you have an amount of data you want to
search. You could keep it as a serial stream (minimal space) or you could
use some clever hash method and fixed size records to gain speed. This
way you can easily bypass the overhead of accessing the whole database
(and the slow disks and, ...) by simply calculating where the data should
be. On the other hand, this structuring of the data will incur some extra
space. It looks as if this is just the space/time tradeoff I was talking about ...

>
> Shudder. The notion that a large database package, or a large datafile,
> is faster, is so far from reality that it has to be corrected, even in
> this Python-oriented newsgroup. My apologies for the S/N ratio drop.

I was talking about the datafile, not the package! Also I never stated
that bigger files WILL give you better performance.

Maybe the whole story was a little simplistic, but with all respect: how
does your contribution add to the S?

So Gerrit, I hope you still got some insight from all this!

Kind regards,
Johan Wouters

Why does shelve make such large files? [ In reply to ]

Thomas.S..Strinnhed at p98

Jul 2, 1999, 10:13 AM

Post #11 of 14 (1260 views)

From: "Thomas S. Strinnhed" <thstr@serop.abb.se>

Hi

Ovidiu Predescu wrote:
>[...]
>
> The shelve module uses DBM which is like a small database that allows
> you to store objects and search it for a given key. DBM allows you to
> store millions of objects and search for them later without requiring
> you to load all the file in memory first.
>
> The pickle module on the other hand is serializing the objects with the
> purpose of deserializing them _all_ from the file later. Pickle does not
> offer you any way to search for data based on a key, you have to do this
> yourself after the objects have been created from the file. This is
> opposed to the way shelve handles this, all the key accesses and
> insertions in a shelve object are actually reads or writes to or from
> the DBM file.
>
> And to answer your question, DBM is creating these big files because of
> the way it manages the database. The data in the database file could
> have gaps as a result of multiple insertions and deletions. Pickle's
> data in files is a simple representation of the objects that were
> written and there is no way to update the file other than rewriting it
> entirely.
>
> --
> Ovidiu Predescu <ovidiu@cup.hp.com>
> http://www.geocities.com/SiliconValley/Monitor/7464/

So, just to make shure I follow, in short terms
* pickle makes "persistant objects" flushed to a file
* shelve + dbm is in fact a (relational) database storing
(arbitary**) objects in some hash-searchable order

(**) By arbitary I mean _any_ kind of object, or do they need to be
of the same class?? (In order for searching to work)

About usage: shelves when I need to search and pickle when I just need
to save my objects?? Are there other considerations based on number and
size of objects?

Best regards
-- Thomas S. Strinnhed, thstr@serop.abb.se

Why does shelve make such large files? [ In reply to ]

Johan.Wouters at p98

Jul 2, 1999, 10:41 AM

Post #12 of 14 (1258 views)

From: Johan Wouters <johanw@easics.be>

Gerrit Holl wrote:

> Ah, I understand.
> So pickle is useful for very small datases, but when they're really huge, one
> should use shelve. Isn't it?

Pickle will serialise your data into a certain format. This way you can
"store" objects like lists as a whole. The serialising has the benefit that
you can eg. put your object (be it a list or something else) through a pipe
or socket and depickle it at the other side. This way you can have python
processes that exchange data in a sort of native python format.

Shelve gives you alot more than just storing the data as a long sequence.
Imagine the extra space being the administration that is needed for all
the extra functionality (like tagging what data items are valid etc)

The overhead induced by a database management system can be enormous
compared to the bare data you want to access, but this is mostly a
space/time tradeoff: the more space you use the faster you can do
things like searching and sorting etc etc.

kind regards,
===================================================================
Johan Wouters === Easics ===
ASIC Designer === System-on-Chip design services ===
Tel: +32-16-395 616 ===================================
Fax: +32-16-395 619 Interleuvenlaan 86, B-3001 Leuven, BELGIUM
mailto:johanw@easics.be http://www.easics.com

Why does shelve make such large files? [ In reply to ]

Jean-Claude.Wippler at p98

Jul 2, 1999, 11:41 AM

Post #13 of 14 (1261 views)

From: Jean-Claude Wippler <jcw@equi4.com>

Johan Wouters wrote:
>
> Gerrit Holl wrote:
>
> > Ah, I understand.
> > So pickle is useful for very small datases, but when they're really
> > huge, one should use shelve. Isn't it?
>
> Pickle will serialise your data into a certain format. This way you
> can "store" objects like lists as a whole. The serialising has the
> benefit that you can eg. put your object (be it a list or something
> else) through a pipe or socket and depickle it at the other side. This
> way you can have python processes that exchange data in a sort of
> native python format.
>
> Shelve gives you alot more than just storing the data as a long
> sequence. Imagine the extra space being the administration that is
> needed for all the extra functionality (like tagging what data items
> are valid etc)
>
> The overhead induced by a database management system can be enormous
> compared to the bare data you want to access, but this is mostly a
> space/time tradeoff: the more space you use the faster you can do
> things like searching and sorting etc etc.

Ahem, with all due respect to everyone... this is humbug.

Pickle serializes your data in one sweep to/from disk. It's compact.

Shelve stores serialized pieces in a keyed-access form, and uses file
space allocation to support add/modify/delete. It starts out with more
empty space, but as the amount of stored data grows, that space does not
expand linearly. Consider shelves to have half their file space being
used, on *average* (plus or minus a factor 2 perhaps).

But the last claim is horrendously misleading, I'm afraid: data storage
is not a space/time tradeoff. It's about throughput (I/O bottlenecks)
and overhead in managing the supported data and indexing schemes. There
are order-of-magnitude performance differences in how several solutions
work, because of this.

Shudder. The notion that a large database package, or a large datafile,
is faster, is so far from reality that it has to be corrected, even in
this Python-oriented newsgroup. My apologies for the S/N ratio drop.

-- Jean-Claude

P.S. Geloof niet alles wat je leest, Gerrit.

Why does shelve make such large files? [ In reply to ]

Johan.Wouters at p98

Jul 2, 1999, 12:45 PM

Post #14 of 14 (1258 views)

From: Johan Wouters <johanw@easics.be>

<snip>
> Ahem, with all due respect to everyone... this is humbug.
>

Why so?

> Pickle serializes your data in one sweep to/from disk. It's compact.

So? I don't remember me having any trouble with that statement. As a matter
of fact this is just what I stated.

> But the last claim is horrendously misleading, I'm afraid: data storage
> is not a space/time tradeoff. It's about throughput (I/O bottlenecks)
> and overhead in managing the supported data and indexing schemes. There
> are order-of-magnitude performance differences in how several solutions
> work, because of this.

The general idea was some typical engineering tradeoff. Larger database
environments *might* give you better performance for some criteria like
searching or combining data. Say you have an amount of data you want to
search. You could keep it as a serial stream (minimal space) or you could
use some clever hash method and fixed size records to gain speed. This
way you can easily bypass the overhead of accessing the whole database
(and the slow disks and, ...) by simply calculating where the data should
be. On the other hand, this structuring of the data will incur some extra
space. It looks as if this is just the space/time tradeoff I was talking about
..

>
> Shudder. The notion that a large database package, or a large datafile,
> is faster, is so far from reality that it has to be corrected, even in
> this Python-oriented newsgroup. My apologies for the S/N ratio drop.

I was talking about the datafile, not the package! Also I never stated
that bigger files WILL give you better performance.

Maybe the whole story was a little simplistic, but with all respect: how
does your contribution add to the S?

So Gerrit, I hope you still got some insight from all this!

Kind regards,
Johan Wouters