Mailing List Archive: NNTP

Re: Cached Headers (was NNTP) [ In reply to ]

Mar 18, 2004, 9:46 AM

Post #26 of 31 (577 views)

> But I am still not sure how much the performance gain is by
> putting it into separate tables compared to searching on the
> message blocks.

Consider a hypothetical user who passes 2-5MB documents back and
forth between colleagues all day long, and for whatever reason (he
is part of management?) saves a copy of all mail sent and received.
(He must be using a native webmail program, it looks like :).
If he wants to search for a Subject including "Jan-Promo.doc" in
all his messages, there's going to be a huge difference in the two
setups.

Once again, the biggest benefit of this is for native applications
like webmail and the proposed nntp server (and even native nntp clients,
should any be written). Ilja thinks imap's important, too. :) If
it's written generically, the dbmail-supplied headers to be cached would
be be optimized for imap, and eg. weDBmail install doc could tell you
how to set it up for weDBmail. I still think it'd be benefitial to have
a seperate document or seperate doc section for optimizing for imap-only,
pop3-only, or combined use (eg. pop3 users could turn header caching off).

--
Jesse Norell

administrator@kci.net is not my email address;
change "administrator" to my first name.
--

Re: Cached Headers (was NNTP) [ In reply to ]

Magnus.Sundberg at dican

Mar 18, 2004, 10:10 AM

Post #27 of 31 (571 views)

Permalink

Jesse Norell wrote:
>
>>But I am still not sure how much the performance gain is by
>>putting it into separate tables compared to searching on the
>>message blocks.
>
>
> Consider a hypothetical user who passes 2-5MB documents back and
> forth between colleagues all day long, and for whatever reason (he
> is part of management?) saves a copy of all mail sent and received.
> (He must be using a native webmail program, it looks like :).
> If he wants to search for a Subject including "Jan-Promo.doc" in
> all his messages, there's going to be a huge difference in the two
> setups.
>

If I understand everything correct, the headers goes into the
first messageblock and the rest of the message goes into the next
message blocks.
With a flag on the message block containing marking it as a
header block will we only search the header blocks and these are
usually not that many kilobytes.

But it remains to be implemented.

My $0.02 is that we should make a simple solution first, that
maybe isn't that effective, but mostly good enough. We can then
refine it and add separate tables.

Magnus

Re: Cached Headers (was NNTP) [ In reply to ]

ed at hp

Mar 18, 2004, 10:22 AM

Post #28 of 31 (573 views)

Permalink

> >>But I am still not sure how much the performance gain is by
> >>putting it into separate tables compared to searching on the
> >>message blocks.
> >
> >
> > Consider a hypothetical user who passes 2-5MB documents back and
> > forth between colleagues all day long, and for whatever reason (he
> > is part of management?) saves a copy of all mail sent and received.
> > (He must be using a native webmail program, it looks like :).
> > If he wants to search for a Subject including "Jan-Promo.doc" in
> > all his messages, there's going to be a huge difference in the two
> > setups.
> >
>
> If I understand everything correct, the headers goes into the
> first messageblock and the rest of the message goes into the next
> message blocks.
> With a flag on the message block containing marking it as a
> header block will we only search the header blocks and these are
> usually not that many kilobytes.
>
> But it remains to be implemented.
>
> My $0.02 is that we should make a simple solution first, that
> maybe isn't that effective, but mostly good enough. We can then
> refine it and add separate tables.
>
> Magnus
>

yes, but don't you have to load, from the database, the entire field?
that would be the entire 5MB file.

ed

p.s. I've been known to archive, as proof of delivery, 5MB files in my
dbmail server.

Security on the internet is impossible without strong, open,
and unhindered encryption.

Re: Cached Headers (was NNTP) [ In reply to ]

jesse at kci

Mar 18, 2004, 10:24 AM

Post #29 of 31 (577 views)

Permalink

> If I understand everything correct, the headers goes into the
> first messageblock and the rest of the message goes into the next
> message blocks.
> With a flag on the message block containing marking it as a
> header block will we only search the header blocks and these are
> usually not that many kilobytes.

Right, that could help a bunch. With your (I think) comment of
using regex queries, that would be a lot better than where things
are today (ie. either make multiple queries to search the first block
of each message, or return all message blocks and parse in the client).

--
Jesse Norell

administrator@kci.net is not my email address;
change "administrator" to my first name.
--

RE: Cached Headers [ In reply to ]

mark at orcon

Mar 18, 2004, 1:30 PM

Post #30 of 31 (577 views)

Permalink

A flag on the messageblks field would certainly help, and allow webmail
applications to generate message lists with one query

E.g. select messages.*, messageblks.messageblk as headers from
messages,messageblks where
messages.message_idnr=messageblks.messageblks_idnr and
messageblks.isheader=1 order by <SORT KEY>

Currently this query requires a group to get the first block, which is very
costly in database CPU.

However what this does is only give the option to sort on sort-keys in the
messages table (e.g. itnernal date, etc) and still obviously requires
parsing of the headers. Certainly an improvement though.

It doesn't easly allow access to the 'metadata' discussed earlier though,
which would be nice. E.g. in a webmail application it's good to show whether
there is an attachment with the application. Whether it's been replied to,
etc. If we're heading down the Exchange replacement path this may be a
requirement -- as with Outlook you can add all sorts of custom storage
fields.

Ed's proposed structure (or perhaps Magnus' by storing the field
defininition (e.g. "from) in the header table certainly helps from a parsing
mechanism, but for a mailbox of 1000 messages, you're going to need 1 query
to get message list, and then 1000 queries -- vs. 1 join if you get the
table structure into a 1-to-1 form.

Parsing certainly isn't a huge overhead. So the 'isheader' flag on
messageblks means you get this. But parsing header blocks can be tedious at
times, as the format varies (and you get badly constructed headers being
sent through), etc. You need to decode quoted printable, etc or handle
varying character sets.

Maybe we could try a dbmail-specific header table / metadata format; that
was heavily optimised for simple-parsing; and perhaps only cached the
headers/metadata we needed. I'm guessing that at this time the charset would
be coverted to the 'native charset' for the user of the system.

E.g. "key:value|key2.value2"

Provided this is matched 1-to-1, joins would be pretty trivial;
output-parsers (not necessarily the parser that initially creates the
metadata) would be really basic, etc. For searches it would probably also be
easy to do just using SQL:

select ... From msgs,msgblks ... And metadata like
'%|SEARCHFIELD:SEARCHVALUE|%'

/Mark

RE: Cached Headers [ In reply to ]

aaron at serendipity

Mar 18, 2004, 2:05 PM

Post #31 of 31 (572 views)

Permalink

This query will get you the headers and the flags of each message in a give
mailbox. A strictly ascending messageblk_idnr column is assumed.

select messages.*, min(messageblk_idnr), messageblk from messages,
physmessage, messageblks where messages.physmessage_id = physmessage.id =
messageblks.physmessage_id and mailbox_idnr = 1 group by physmessage_id \G

However, that assumption is exactly one that needs to be thrown away. (gets
onto soapbox) We need to move towards the elimination of reliance on
autoincrement columns so that eventually we can use [statistically] unique
hashes that can be generated by a number of hosts without coordination and
with minimal or no risk of collisions.

So basically that's 2.1 and later stuff. But keep it in mind. (jumps off soapbox).

In the meantime, I'm beginning to believe in the JOIN stuff. I'm not convinced
that it's faster than an indexed char column, but I'm going to do some fake
data benchmarks to find out.

I'm still hugely against storing the headers in a parsed-only format. It just
doesn't seem worth the space savings not to keep a copy of the headers in
their original off-the-wire format, though provided that it gets eaiser, say
by adding an is_header column to messageblks.

By the way, the problems about passing the headers around if they huge sized
are greatly alleviated if they can be inserted into the database as they
arrive and are quickly looked up on an as-needed basis. Of course, that means
many more database calls during delivery...

Aaron

""Mark Mackay - Orcon"" <mark@orcon.net.nz> said:

> A flag on the messageblks field would certainly help, and allow webmail
> applications to generate message lists with one query
>
> E.g. select messages.*, messageblks.messageblk as headers from
> messages,messageblks where
> messages.message_idnr=messageblks.messageblks_idnr and
> messageblks.isheader=1 order by <SORT KEY>
>
> Currently this query requires a group to get the first block, which is very
> costly in database CPU.
>
> However what this does is only give the option to sort on sort-keys in the
> messages table (e.g. itnernal date, etc) and still obviously requires
> parsing of the headers. Certainly an improvement though.
>
> It doesn't easly allow access to the 'metadata' discussed earlier though,
> which would be nice. E.g. in a webmail application it's good to show whether
> there is an attachment with the application. Whether it's been replied to,
> etc. If we're heading down the Exchange replacement path this may be a
> requirement -- as with Outlook you can add all sorts of custom storage
> fields.
>
> Ed's proposed structure (or perhaps Magnus' by storing the field
> defininition (e.g. "from) in the header table certainly helps from a parsing
> mechanism, but for a mailbox of 1000 messages, you're going to need 1 query
> to get message list, and then 1000 queries -- vs. 1 join if you get the
> table structure into a 1-to-1 form.
>
> Parsing certainly isn't a huge overhead. So the 'isheader' flag on
> messageblks means you get this. But parsing header blocks can be tedious at
> times, as the format varies (and you get badly constructed headers being
> sent through), etc. You need to decode quoted printable, etc or handle
> varying character sets.
>
> Maybe we could try a dbmail-specific header table / metadata format; that
> was heavily optimised for simple-parsing; and perhaps only cached the
> headers/metadata we needed. I'm guessing that at this time the charset would
> be coverted to the 'native charset' for the user of the system.
>
> E.g. "key:value|key2.value2"
>
> Provided this is matched 1-to-1, joins would be pretty trivial;
> output-parsers (not necessarily the parser that initially creates the
> metadata) would be really basic, etc. For searches it would probably also be
> easy to do just using SQL:
>
> select ... From msgs,msgblks ... And metadata like
> '%|SEARCHFIELD:SEARCHVALUE|%'
>
> /Mark
>
>
>
>
>
>
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
>

--