Mailing List Archive

pickle vs .pyc
I need to be able to read a couple very complex (dictionary of arrays
of dictionaries, and array of dictionaries of array of dictionaries)
data structures into python. To generate it by hand takes too long,
so I want to generate it once, and read it each time (the data doesn't
change).

The obvious choice is, of course pickle, or some flavor thereof.
But can someone tell me why this wouldn't be faster:

In the code that does the "pickling", simply do:
f = open("cache.py", "w")
f.write("# cache file for fast,slow\n")
f.write("fast = "+`fast`+'\n')
f.write("slow = "+`slow'+'\n')
f.close()
import cache

Then, later, when I want the data, I just do:

from cache import fast,slow

and it's right there. It's compiled, and seems really fast (loading a
50k file in .12 seconds). I just tried the same data using cPickle, and
it took 1.4 seconds. It's also not as portable. There is a space savings
with pickle, but it's only 5% (well, 56% if you count both the .py and
.pyc files), but that doesn't really matter to me.

Am I missing something here? This sounds like an obvious, and fast,
way to do things. True, the caching part may take longer. But I
really don't care about that, since it's done only once, and in the
background.

Michael
pickle vs .pyc [ In reply to ]
Michael Vezie <mlv@pobox.com> writes:

> from cache import fast,slow
>
> and it's right there. It's compiled, and seems really fast (loading
> a 50k file in .12 seconds).

I understand that .pyc files use marshal. Maybe this is simply a
matter of marshal being faster than cPickle? Have you tried using
marshal directly?

(It makes sense for pickling to be slower than marshaling because it
does more; for example, it takes care about recursive relationships
and such.)
pickle vs .pyc [ In reply to ]
Michael Vezie <mlv@pobox.com> writes:

> I need to be able to read a couple very complex (dictionary of arrays
> of dictionaries, and array of dictionaries of array of dictionaries)
> data structures into python. To generate it by hand takes too long,
> so I want to generate it once, and read it each time (the data doesn't
> change).
>
> The obvious choice is, of course pickle, or some flavor thereof.
> But can someone tell me why this wouldn't be faster:
>
> In the code that does the "pickling", simply do:
> f = open("cache.py", "w")
> f.write("# cache file for fast,slow\n")
> f.write("fast = "+`fast`+'\n')
> f.write("slow = "+`slow'+'\n')
> f.close()
> import cache
>
> Then, later, when I want the data, I just do:
>
> from cache import fast,slow
>
> and it's right there. It's compiled, and seems really fast (loading a
> 50k file in .12 seconds). I just tried the same data using cPickle, and
> it took 1.4 seconds. It's also not as portable. There is a space savings
> with pickle, but it's only 5% (well, 56% if you count both the .py and
> .pyc files), but that doesn't really matter to me.
>
> Am I missing something here? This sounds like an obvious, and fast,
> way to do things. True, the caching part may take longer. But I
> really don't care about that, since it's done only once, and in the
> background.
>
> Michael

Hmm, you're relying on all the data you're storing having faithful
__repr__ methods. This certainly isn't universally true. I'd regard
this method as too fragile.

If you're only storing simple data (by which I mean simple types of
data, not that the data is simple) (and I think you must be for the
approach you're using to work) give the marshal module a whirl.

I think it will be substantially faster than your repr-based method
(cryptic hint: if it wasn't, the marshal module probably wouldn't
exist).

Eg:

import marshal

complex_data_structure = {'key1':['nested list'],9:"mixed types"}

marshal.dump(complex_data_structure,open('/tmp/foo','w'))

print marshal.load(open('/tmp/foo'))

HTH
Michael

Random aside: something fishy's going on when I try to try to marshal
*arrays* (as opposed to mere lists):

>>> import array,marshal
>>> marshal.loads(marshal.dumps(array.array('f',[0,1])))
'\000\000\000\000\000\000\200?'
>>>

That shouldn't be happening should it? Surely that should be raising
an unmarshalable object exception? Oh well...
pickle vs .pyc [ In reply to ]
Michael Hudson <mwh21@cam.ac.uk> writes:

> Michael Vezie <mlv@pobox.com> writes:
>
> > I need to be able to read a couple very complex (dictionary of arrays
> > of dictionaries, and array of dictionaries of array of dictionaries)
> > data structures into python. To generate it by hand takes too long,
> > so I want to generate it once, and read it each time (the data doesn't
> > change).
> >
> > The obvious choice is, of course pickle, or some flavor thereof.
> > But can someone tell me why this wouldn't be faster:
> >
> > In the code that does the "pickling", simply do:
> > f = open("cache.py", "w")
> > f.write("# cache file for fast,slow\n")
> > f.write("fast = "+`fast`+'\n')
> > f.write("slow = "+`slow'+'\n')
> > f.close()
> > import cache
> >
> > Then, later, when I want the data, I just do:
> >
> > from cache import fast,slow
> >
> > and it's right there. It's compiled, and seems really fast (loading a
> > 50k file in .12 seconds). I just tried the same data using cPickle, and
> > it took 1.4 seconds. It's also not as portable. There is a space savings
> > with pickle, but it's only 5% (well, 56% if you count both the .py and
> > .pyc files), but that doesn't really matter to me.
> >
> > Am I missing something here? This sounds like an obvious, and fast,
> > way to do things. True, the caching part may take longer. But I
> > really don't care about that, since it's done only once, and in the
> > background.
> >
> > Michael
>
> Hmm, you're relying on all the data you're storing having faithful
> __repr__ methods. This certainly isn't universally true. I'd regard
> this method as too fragile.
>
> If you're only storing simple data (by which I mean simple types of
> data, not that the data is simple) (and I think you must be for the
> approach you're using to work) give the marshal module a whirl.
>
> I think it will be substantially faster than your repr-based method
> (cryptic hint: if it wasn't, the marshal module probably wouldn't
> exist).
>
> Eg:
>
> import marshal
>
> complex_data_structure = {'key1':['nested list'],9:"mixed types"}
>
> marshal.dump(complex_data_structure,open('/tmp/foo','w'))
>
> print marshal.load(open('/tmp/foo'))
>
> HTH
> Michael

Duh! Of course, once you've imported cache.py once, it's compiled to a
.pyc file and all the literals within it will be marshalled
anyway. Still using marshal directly is certainly more robust and
probably faster...

one-day-I'll-learn-to-read-thesubject-as-part-of-the-message-ly y'rs
Michael
pickle vs .pyc [ In reply to ]
Michael Hudson <mwh21@cam.ac.uk> writes:

> Random aside: something fishy's going on when I try to try to marshal
> *arrays* (as opposed to mere lists):

The documentation for marshal says:

Not all Python object types are supported; in general, only
objects whose value is independent from a particular invocation of
Python can be written and read by this module. The following types
are supported: `None', integers, long integers, floating point
numbers, strings, tuples, lists, dictionaries, and code objects
(...)
pickle vs .pyc [ In reply to ]
Michael Vezie writes:
[saving / loading complex structures]
> The obvious choice is, of course pickle, or some flavor thereof. But
> can someone tell me why this wouldn't be faster:
>
> In the code that does the "pickling", simply do:
> f = open("cache.py", "w")
> f.write("# cache file for fast,slow\n")
> f.write("fast = "+`fast`+'\n')
> f.write("slow = "+`slow'+'\n')
> f.close()
> import cache
>
> Then, later, when I want the data, I just do:
>
> from cache import fast,slow

> Am I missing something here? This sounds like an obvious, and fast,
> way to do things. True, the caching part may take longer. But I
> really don't care about that, since it's done only once, and in the
> background.

Not at all. Where x == eval(repr(x)), this is dandy. You can even use
pprint to dump into a humanly digestible format. You can then edit
and reload.

But marshall should be even faster.

- Gordon
pickle vs .pyc [ In reply to ]
In article <m3pv3eian1.fsf@atrus.jesus.cam.ac.uk>,
Michael Hudson <mwh21@cam.ac.uk> wrote:
>
>Hmm, you're relying on all the data you're storing having faithful
>__repr__ methods. This certainly isn't universally true. I'd regard
>this method as too fragile.

Granted. It wouldn't do for the file to contain:

fast = <__main__.fast instance at 8137590>
slow = <__main__.slow instance at 8137590>

But the data I'm dealing with is all dictionaries of arrays of dictionaries,
all containing strings. Pretty boring; no class at all.


>If you're only storing simple data (by which I mean simple types of
>data, not that the data is simple) (and I think you must be for the
>approach you're using to work) give the marshal module a whirl.

This is the first I've heard of marshal. I take it from various
hints here and there that the .pyc files are handled internally
by marshal, yes?

Thanks for your (and everyone else's) help.

One final asside, I've seen disagreements both online and in the
web docs about the spelling of marshal. I did a search and found
12 hits for "marshall" (now granted, some are probably unrelated, I
didn't trace them all). But there were enough to make me think
that 'marshall' was the right spelling.

But the fact that 'import marshal' worked while 'import marshall'
failed convinced me otherwise. :)

marshaling-up-the-courage-to-convert-working-code-to-marshal'ly yrs

... (hoping-this-'ly yrs-tag-isn't-trademarked-or-only-available-
to-those-who've-achieved-some-level-of-pythondom'ly yrs)

Michael
pickle vs .pyc [ In reply to ]
[Michael Vezie]
> ...
> This is the first I've heard of marshal. I take it from various
> hints here and there that the .pyc files are handled internally
> by marshal, yes?

Yes, "code objects" in particular are the one type marshal can handle that
pickle can't. I think we try to sell that as "a feature", though <wink>>

> ...
> One final asside, I've seen disagreements both online and in the
> web docs about the spelling of marshal.

Guido's spelling is the only one we'll mention here, thank you.

> ...
> marshaling-up-the-courage-to-convert-working-code-to-marshal'ly yrs
>
> ... (hoping-this-'ly yrs-tag-isn't-trademarked-or-only-available-
> to-those-who've-achieved-some-level-of-pythondom'ly yrs)

No particular qualifications, fees or exams are needed to use a "-ly y'rs"
closing. I do own exclusive worldwide rights to it, but grant everyone an
unlimited license in perpetuity. How can I afford to do this? Easy: there
are only so many people out there who want to come across as an ass year
after year <wink>.

it's-not-the-form-it's-the-content-ly y'rs - tim
pickle vs .pyc [ In reply to ]
At 12:01 AM 6/3/99 -0400, Tim Peters wrote:
>> This is the first I've heard of marshal. I take it from various
>> hints here and there that the .pyc files are handled internally
>> by marshal, yes?
>
>Yes, "code objects" in particular are the one type marshal can handle that
>pickle can't. I think we try to sell that as "a feature", though <wink>>

And code objects can't be handled by my repr/pyc method. There is one place
where I would have need to store a code object.

Sounds like I'm going to be spending some quality time with marshal...


>> One final asside, I've seen disagreements both online and in the
>> web docs about the spelling of marshal.
>
>Guido's spelling is the only one we'll mention here, thank you.

And 'marshalling' being with two 'l's doesn't help.


>> ...
>> marshaling-up-the-courage-to-convert-working-code-to-marshal'ly yrs
>>
>> ... (hoping-this-'ly yrs-tag-isn't-trademarked-or-only-available-
>> to-those-who've-achieved-some-level-of-pythondom'ly yrs)
>
>No particular qualifications, fees or exams are needed to use a "-ly y'rs"
>closing. I do own exclusive worldwide rights to it, but grant everyone an
>unlimited license in perpetuity. How can I afford to do this? Easy: there
>are only so many people out there who want to come across as an ass year
>after year <wink>.

Of course I need to learn how to spell "y'rs"...

Michael
pickle vs .pyc [ In reply to ]
Hrvoje Niksic <hniksic@srce.hr> writes:

> Michael Hudson <mwh21@cam.ac.uk> writes:
>
> > Random aside: something fishy's going on when I try to try to marshal
> > *arrays* (as opposed to mere lists):
>
> The documentation for marshal says:
>
> Not all Python object types are supported; in general, only
> objects whose value is independent from a particular invocation of
> Python can be written and read by this module. The following types
> are supported: `None', integers, long integers, floating point
> numbers, strings, tuples, lists, dictionaries, and code objects
> (...)

Yes, but usually you get an exception to show that somethings gone
awry:

>>> import md5
>>> marshal.loads(marshal.dumps(md5.md5()))
Traceback (innermost last):
File "<stdin>", line 1, in ?
ValueError: unmarshallable object
>>> class F:
... pass
...
>>> marshal.loads(marshal.dumps(F()))
Traceback (innermost last):
File "<stdin>", line 1, in ?
ValueError: unmarshallable object

It's not that important I suppose, it's just odd.

Michael
pickle vs .pyc [ In reply to ]
Michael Hudson <mwh21@cam.ac.uk> writes:

> Yes, but usually you get an exception to show that somethings gone
> awry:
>
> >>> import md5
> >>> marshal.loads(marshal.dumps(md5.md5()))
> Traceback (innermost last):
> File "<stdin>", line 1, in ?
> ValueError: unmarshallable object

A good point. However, that is exactly the kind of behaviour I get
for array objects -- in Python 1.5.1. In 1.5.2, I can repeat the bug
you describe. Maybe 1.5.2 is supposed to allow marshalling arrays,
but the support is buggy?

<search search grep grep>

OK, I think I got it. In 1.5.2, marshal.c contains this code:

else if ((pb = v->ob_type->tp_as_buffer) != NULL &&
pb->bf_getsegcount != NULL &&
pb->bf_getreadbuffer != NULL &&
(*pb->bf_getsegcount)(v, NULL) == 1)
{
/* Write unknown buffer-style objects as a string */
char *s;
w_byte(TYPE_STRING, p);
n = (*pb->bf_getreadbuffer)(v, 0, (void **)&s);
w_long((long)n, p);
w_string(s, n, p);
}

...which means that objects with tp_as_buffer property (whatever that
means) get marshalled as strings. This wasn't the case in 1.5.1 and I
have no idea why it would be useful to anyone because during
unmarshalling, such simply remain strings instead of getting converted
to the original form. Not that the conversion would work across
different architectures anyway.

I'm not sure what the new feature is supposed to buy us, but 1.5.1
behaviour looks more correct to me.
pickle vs .pyc [ In reply to ]
On 2 Jun 1999 16:22:15 -0400, Michael Vezie <mlv@pobox.com> wrote:

>from cache import fast,slow

Note that importing compiles the cache module only if the .py file is newer than
the .pyc.
While this should be no problem on a single machine, there are (unlikely) cases
where it could hurt you when working with multiple machines ot transporting your
files. At least it is an external depency you may want to avoid.

Stefan