Mailing List Archive

Current db schema & future improvements
Hi all,

i've been a bit too busy lately so i was pleasantly surprised by all
the new discussion started :-) We do have some plans about optimising
the database structure; i'll share out current view with y'all.

First off, i'd like to make clear the current dbmail database model:

* message general information (size, internaldate, flags, uid,
mailboxid) is stored in the messages table
* message data is stored in the messageblks table. These blocks are
inserted as following:
- the first block contains only the main header -i.e. the part until
the first double-newline is encountered
- the following blocks contain message data in chunks of (currently)
512K

This design was based upon POP: pop is regarding a message like this:

<header><newline newline><body>

hence the current model.

IMAP is far more advanced, it is based upon complete message parsing;
the FETCH command is capable of retrieving parts of a message specified
by their logical offset and size, not absolute byte counts. Current
performance problems/optimisation ideas are:

1) pre-parsed messages (from the imap-servre point of view)
2) heavy copy command
3) separate message header-fields storage
4) checking wheter a message already exists

i'll discuss these points below:

1) message parsing required each time a message is queried
the imap server has a *very* basic caching mechanism: the parsed
message structure is cached until another message is parsed, subsequent
calls to FETCH different parts of the same message only require one
parse.
Options here are letting the imap server parse messages and then save
the parse-information or the insertor process could do this. We prefer
the first option - it'll put all the parsing-load away from the imap
server.
Main problem is defining a neat database structure to save this
information: logical message pieces (i.e. BODY[1.3.4.MIME] etc.) should
refer directly to offset/count values within the messageblks table. My
goal is to be able to query all the information needed directly from
the database. Understanding the imap protocol is essential here: in
general first the unique-id, bodystructure/bodyenvelope are requested
for a list of messages (most likely the new ones that have arrived).
Afterwards, different parts of the messages are requested. This last
part is very dependend on the mailclient used - some just ask the whole
message and parse it itself, others do depend on the information an
imap server should provide.
A total solution should be found, having a fast FETCH BODYSTRUCTURE
with parsing required when the message parts are requested will not
gain much speed.

2) COPY is heavy
Major pain here is that IMAP does not support a move command - moving
your messages require them to be copied and deleted. We have 2 options
for solving this:

* adding an abstraction layer
this layer would link messages with mailboxes. Copying a message would
only require an insert of an mailboxid and a messageid. Drawback:
requesting the messages from a certain mailbox will require an extra
query as this table should be accessed as well as the messages and
messageblks table.

* dropping the unique-constraint on message_idnr, adding a unique
constraint on (message_idnr, mailbox_idnr)
This will enable the same ease of copying without having to access an
extra table when querying a message. However, (but that could be
personal) i find this from a database point of view less attractive.

3) message header fields
I think this would be not too much of a benefit for FETCH's - it would
certainly if fetches always used some standard, usefull fields (i.e.
subject, to, cc, from) but somehow, each client asks for different
fields yielding from X-Forwarded to X-even-more-exotic - making it a
big bunch of fields we would have to store. Moreover, the first
messageblk just contains all the fields and parsing these does not
require much overhead: in general the amount of data a header is very
limited.
Adding some separately stored header fields could help the SEARCH -
very much, indeed. Typical search fields are from, to, subject, cc so
these could be stored separately for performance issues. Creating
database indexes on these (combined with a mailboxid as imap can only
search folders) will make the search *very* fast.

4) duplicate messages on insertion
This i don't see as a real option using checksums. It would be nice,
but probably no two messages are exactly the same - deliver time,
internal date and stuff will probably break the checksum giving only
minimal storage benefits with a lot of program logic added.
I'm not sure how the MTA calls upon dbmail-smtp, but calling it with
all the receipients at once would enable storing the message once for
multiple dbmail users - this would give the scenario of sending 10
coworkers a 10M video a huge benefit.


Please comment!

regards roel


_________________________
R.A. Rozendaal
IC&S
T: +31 30 2322878
F: +31 30 2322305
www.ic-s.nl
RE: Current db schema & future improvements [ In reply to ]
Hello,

> 4) duplicate messages on insertion
> This i don't see as a real option using checksums. It would be nice,
> but probably no two messages are exactly the same - deliver time,
> internal date and stuff will probably break the checksum giving only
> minimal storage benefits with a lot of program logic added.
> I'm not sure how the MTA calls upon dbmail-smtp, but calling it with
> all the receipients at once would enable storing the message once for
> multiple dbmail users - this would give the scenario of sending 10
> coworkers a 10M video a huge benefit.

The checksum idea was concomitant with breaking messages down
and saving individual components (file attachments, etc.), and yes,
it would certainly be of little/no value against a whole message.

As for the MTA->dbmail-smtp interaction, I'm sure it depends on
what MTA you're using. Postfix hands the entire recipient address
list to to dbmail-smtp, so it will be able to take advantage of this.
From pipe(8) man page:

A command-line argument that contains ${recipient} expands
into as many command-line arguments as there are recipients.


Jesse


--
Jesse Norell
jesse (at) kci.net
Re: Current db schema & future improvements [ In reply to ]
Roel Rozendaal - IC&S wrote:
>...
> 4) duplicate messages on insertion
> This i don't see as a real option using checksums. It would be nice,
> but probably no two messages are exactly the same - deliver time,
> internal date and stuff will probably break the checksum giving only
> minimal storage benefits with a lot of program logic added.
> I'm not sure how the MTA calls upon dbmail-smtp, but calling it with all
> the receipients at once would enable storing the message once for
> multiple dbmail users - this would give the scenario of sending 10
> coworkers a 10M video a huge benefit.
>...

I see only one practical way to attain this, use the message body. This
would require abstracting messages from their headers. In other words,
each message would have a unique set of headers, but may share it's body
with another message. Attempting to split the body up into pieces along
MIME/etc. lines sounds like a fool's errand to me. I've dealt with code
to split/decode mail, and it was a mirky swamp.

I see this as a good middle ground solution, not too complicated, but
still useful.

Blake