Mailing List Archive: Saving a bit memory (was: [Bug 3383] Folded message-id unrecognized)

On Sunday 16 May 2004 10:53 CET Theo wrote:
> On Sat, May 15, 2004 at 06:26:27AM -0700, bugzilla-daemon wrote:
> > About your comment to my second patch, I don't really understand what
> > you mean. The points of these patch are:
> > 1. Don't reinvent the wheel and do the header parsing in spamd.raw but
> > use Message.pm. That's trivial.
>
> Right, we agree on that.

Ok.

> > 2. Don't store two copies of the mail in the memory. Currently the mail
> > is first completely read into @msglines which is then copied inside the
> > Message.pm ctor. At that point, two copies of the mail exist in memory
> > while we theoretically only need one if we could just feed the incoming
> > stuff directly line-per-line to the Message.pm. That's what the
> > anonymous subroutine is for. As that one also has to check for the
> > size, I can't just feed the ctor a reference to $client->getline().
>
> Well, a few things here.
>
> 1) The current method is cleaner.

Right, but the other one takes less memory :)

> 2) I don't think the patch works. It references non-global/non-local
> variables in the anon-sub. Not only that, but the variables it wants to
> use really have no way of being passed in without mucking up the Message
> API, which I don't want to do.

It does work, at least did it look like when I tested it. There's Perl
Voodoo involved: The anonymous subroutine is created in spamd's namespace
and has access to all variables of its parent routine. Even after it got
fed to the Message.pm ctor. I first couldn't believe it either, but it
works. I'll check it another time to be really sure though :)

> 3) We should do an 'undef @msglines' after the parse. The Message code
> specifically wants to get a copy of the message for itself, since it'll
> be messing with it during the parse.

Yes, that was my first idea too. But with this we still have two instances
of the messages in the memory, though for a shorter time:

We first read stuff into @msglines, feed that array to the ctor which then
copies it into its own "cache". At this point we've got the two instances
of the message. Even if we undef @msglines after the ctor returns, we still
had that "spike" of memory usage.

We also copied a whole potentially large array from A to B but I don't know
if that has much impact on performance.

My point is that we don't really need to cache the whole message first if we
can feed the incoming stuff directly line-per-line to the Message object.

Another idea I had was to create an "empty" object first and then offer a
method to feed later on. But that would have even more impact on the API.

> 4) Changing the M::SA::Message code in a non-trivial way is beyond this
> ticket IMNSHO.

Actually, the change is pretty trivial, I just reordered some of the other
variable handling and made sure that it doesn't bail out with a nasty error
if you feed it a ref to something it doesn't like (a has for example).

But you're right, that bug is fixed, that's why I took this to the list ;-)

> So what I was saying was we can easily put the parse before the use of
> the message-id. I was also commenting that we could probably put the
> expected!=actual test before the message-id output, so we don't have
> to worry about calling $mail->finish(). We can also undef the msglines
> array to free up some memory during processing.

Nope, we can't put the size checking before we extract the Message-Id
because else we would have some random messages about mail with incorrect
sizes in the syslog for which the user can't find out which the culprit
was.

Cheers,
Malte

--
[SGT] Simon G. Tatham: "How to Report Bugs Effectively"
<http://www.chiark.greenend.org.uk/~sgtatham/bugs.html>
[ESR] Eric S. Raymond: "How To Ask Questions The Smart Way"
<http://www.catb.org/~esr/faqs/smart-questions.html>

On Sun, May 16, 2004 at 12:18:13PM +0200, Malte S. Stretz wrote:
> fed to the Message.pm ctor. I first couldn't believe it either, but it
> works. I'll check it another time to be really sure though :)

Eww! ;)

> of the message. Even if we undef @msglines after the ctor returns, we still
> had that "spike" of memory usage.

Yeah, but it's pretty much moot since when the parsing actually happens,
there'll be a bunch of other "copies" in memory anyway via the parse tree,
raw vs decoded vs rendered, etc. So having the array temporarily twice
in memory isn't really a big deal. If we undef the first one after the
parse() call, the memory will get freed and then get used by the full
parse version. If we do the getline schwacky instead, the memory will
eventually be used by the parse version.

So as far as I can see, it'll be a wash at the end, so we may as well
keep it clean and get to the same end point.

> We also copied a whole potentially large array from A to B but I don't know
> if that has much impact on performance.

Well, it could theoretically be large, but in reality it'll be a max of
~300KB, which I doubt will be a problem anywhere.

> But you're right, that bug is fixed, that's why I took this to the list ;-)

:)

> Nope, we can't put the size checking before we extract the Message-Id
> because else we would have some random messages about mail with incorrect
> sizes in the syslog for which the user can't find out which the culprit
> was.

I was figuring that since we couldn't trust the message data enough to
scan it, we didn't want to trust that the header would be available.
But either way I guess...

--
Randomly Generated Tagline:
"Security is mostly a superstition. It does not exist in nature, nor
do the children of men as a whole experience it. Avoiding danger is no
safer in the long run than outright exposure. Life is either a daring
adventure, or nothing." - Helen Keller