Mailing List Archive: db schema changes discussion

db schema changes discussion

Jul 7, 2003, 9:59 AM

Post #1 of 17 (2904 views)

Hello,

(moving this to dbmail-dev)

> My vision here is a "fastheaders" table which has a message number and
> half a dozen columns defined. On some regular basis, or upon message
> insertion, the messageblks table is scanned and the first entry of each
> new message number, which is the header, is parsed for these fast headers
> and they are stored for future searches and quick/short header listings.

An idea we've started (not completed) implimenting in weDBmail
along these lines is dynamically parsing/caching the headers. Any
time the message headers are requested, it'll check the "header cache"
table and use entries if found, but if not, it'll parse the headers
from messageblks and use the results while saving appropriate ones in
the cache. Could make the pop3 and imap servers do that as well.

--
Jesse Norell
jesse (at) kci.net

Re: db schema changes discussion [ In reply to ]

Magnus.Sundberg at dican

Jul 8, 2003, 2:15 AM

Post #2 of 17 (2882 views)

Permalink

Hello,
I patched my imap server (uw :-() to log the requests from my
mail client (Mozilla 1.2.1).

/Magnus

The mozilla client generated the FETCH requests below

FETCH (FLAGS)
FETCH (FLAGS)
FETCH (UID RFC822.SIZE BODY[])
FETCH (UID RFC822.SIZE BODY[])
FETCH (UID RFC822.SIZE BODY[])
FETCH (BODYSTRUCTURE)
FETCH (BODY[HEADER] BODY[1.MIME] BODY[2.MIME])
FETCH (BODY[1])
FETCH (FLAGS)
FETCH (FLAGS)
FETCH (UID RFC822.SIZE FLAGS BODY.PEEK[.HEADER.FIELDS (From To Cc
Subject Date Message-ID Priority X-Priority References Newsgroups
In-Reply-To)])
FETCH (FLAGS)
FETCH (UID RFC822.SIZE FLAGS BODY.PEEK[.HEADER.FIELDS (From To Cc
Subject Date Message-ID Priority X-Priority References Newsgroups
In-Reply-To)])
FETCH (FLAGS)
FETCH (UID RFC822.SIZE BODY[])
FETCH (FLAGS)
FETCH (FLAGS)
FETCH (UID RFC822.SIZE FLAGS BODY.PEEK[.HEADER.FIELDS (From To Cc
Subject Date Message-ID Priority X-Priority References Newsgroups
In-Reply-To)])
FETCH (UID RFC822.SIZE FLAGS BODY.PEEK[.HEADER.FIELDS (From To Cc
Subject Date Message-ID Priority X-Priority References Newsgroups
In-Reply-To)])
FETCH (FLAGS)
FETCH (UID RFC822.SIZE BODY[])
FETCH (UID RFC822.SIZE BODY[])
FETCH (UID RFC822.SIZE BODY[])
FETCH (FLAGS)
FETCH (UID RFC822.SIZE FLAGS BODY.PEEK[.HEADER.FIELDS (From To Cc
Subject Date Message-ID Priority X-Priority References Newsgroups
In-Reply-To)])

Jesse Norell wrote:
> Hello,
>
> (moving this to dbmail-dev)
>
>
>>My vision here is a "fastheaders" table which has a message number and
>>half a dozen columns defined. On some regular basis, or upon message
>>insertion, the messageblks table is scanned and the first entry of each
>>new message number, which is the header, is parsed for these fast headers
>>and they are stored for future searches and quick/short header listings.
>
>
> An idea we've started (not completed) implimenting in weDBmail
> along these lines is dynamically parsing/caching the headers. Any
> time the message headers are requested, it'll check the "header cache"
> table and use entries if found, but if not, it'll parse the headers
> from messageblks and use the results while saving appropriate ones in
> the cache. Could make the pop3 and imap servers do that as well.
>
>
>
>
> --
> Jesse Norell
> jesse (at) kci.net
>
>
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
>

Re: db schema changes discussion [ In reply to ]

Magnus.Sundberg at dican

Jul 8, 2003, 2:49 AM

Post #3 of 17 (2874 views)

Permalink

Hello again,
You know, you can turn on quite heavy logging for the mozilla
imap client.
This is just a short log from my surfing my inbox and sending a
small mail to myself (That is copied to the sent folder).

To get the imap commands use "grep SendData imaplog.txt"

I am sorry, but I can not supply you with Outlook operations,
since I do not approve of outlook usage in my company network for
security reasons.

/Magnus

Jesse Norell wrote:
> Hello,
>
> (moving this to dbmail-dev)
>
>
>>My vision here is a "fastheaders" table which has a message number and
>>half a dozen columns defined. On some regular basis, or upon message
>>insertion, the messageblks table is scanned and the first entry of each
>>new message number, which is the header, is parsed for these fast headers
>>and they are stored for future searches and quick/short header listings.
>
>
> An idea we've started (not completed) implimenting in weDBmail
> along these lines is dynamically parsing/caching the headers. Any
> time the message headers are requested, it'll check the "header cache"
> table and use entries if found, but if not, it'll parse the headers
> from messageblks and use the results while saving appropriate ones in
> the cache. Could make the pop3 and imap servers do that as well.
>
>
>
>
> --
> Jesse Norell
> jesse (at) kci.net
>
>
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
>

Re: db schema changes discussion [ In reply to ]

Magnus.Sundberg at dican

Jul 8, 2003, 3:44 AM

Post #4 of 17 (2871 views)

Permalink

Hi again,
I have done some more thinking about the database structure.
I am not sure of the performance gain with a fast header table
structure.
This would be ideal with a single table and one VARCHAR column
for each _interesting_ header. How do we know what the clients
think are interesting headers?
The ideal world would be SQL statements like
SELECT subject, from, To FROM fastheadertable WHERE mailbox=

The question is, how many of these multiple selects do the imap
clients issue?

WE DO GAIN if the clients very often request headers from
multiple mail in the mailbox, but do the clients make these
requests often? Mozilla only do this when the mailbox cache is
non existant.

I beleive in splitting the message into two parts, header and body.

What is the mail server performance loss to do something like
1. SELECT header FROM headers WHERE mailbox=
2. With some carefully written C code extract the interesting
headers.
Operation 2 can be distributed to another server.

This last option shuffles more data, but there is only one select
statement, as well as in the first example.
With the last option we do not need to do any more client request
logging :-).

A header, might be around 4 kbyte. How much does this cost?

I know there are people with huge mail boxes.

Well, any comments?

/Magnus

P.S. Why is a message split into multiple message blocks? Is this
because of some limitations in the SQL servers?

Jesse Norell wrote:
> Hello,
>
> (moving this to dbmail-dev)
>
>
>>My vision here is a "fastheaders" table which has a message number and
>>half a dozen columns defined. On some regular basis, or upon message
>>insertion, the messageblks table is scanned and the first entry of each
>>new message number, which is the header, is parsed for these fast headers
>>and they are stored for future searches and quick/short header listings.
>
>
> An idea we've started (not completed) implimenting in weDBmail
> along these lines is dynamically parsing/caching the headers. Any
> time the message headers are requested, it'll check the "header cache"
> table and use entries if found, but if not, it'll parse the headers
> from messageblks and use the results while saving appropriate ones in
> the cache. Could make the pop3 and imap servers do that as well.
>
>
>
>
> --
> Jesse Norell
> jesse (at) kci.net
>
>
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
>

Re: db schema changes discussion [ In reply to ]

matthew at zeut

Jul 8, 2003, 6:54 AM

Post #5 of 17 (2882 views)

Permalink

On Tue, 2003-07-08 at 06:44, Magnus Sundberg wrote:
> Hi again,
> I have done some more thinking about the database structure.
> I am not sure of the performance gain with a fast header table
> structure.
> This would be ideal with a single table and one VARCHAR column
> for each _interesting_ header. How do we know what the clients
> think are interesting headers?

This is why I suggested the value pair design, since we won't always
know what's interesting.

> The ideal world would be SQL statements like
> SELECT subject, from, To FROM fastheadertable WHERE mailbox=
>
> The question is, how many of these multiple selects do the imap
> clients issue?

I have no idea, unfortunately, I don't really know how the imap clients
really work.

> WE DO GAIN if the clients very often request headers from
> multiple mail in the mailbox, but do the clients make these
> requests often? Mozilla only do this when the mailbox cache is
> non existant.
>
> I beleive in splitting the message into two parts, header and body.
>
> What is the mail server performance loss to do something like
> 1. SELECT header FROM headers WHERE mailbox=
> 2. With some carefully written C code extract the interesting
> headers.
> Operation 2 can be distributed to another server.
>
> This last option shuffles more data, but there is only one select
> statement, as well as in the first example.
> With the last option we do not need to do any more client request
> logging :-).
>
> A header, might be around 4 kbyte. How much does this cost?
>
> I know there are people with huge mail boxes.
>
> Well, any comments?

My real question is what queries are being done by the server. When I
use evolution and ask it to search the message body for a piece of text,
does it send this request to the server? Or does it only search what is
cached.

Also, IMAP clients are not all we care about. We also have POP3 (less
interesting) and direct clients that bypass the IMAP interface, querying
the database themselves. I know several web apps do this already.
Also, there is talk extending dbmail to include more exchange type
functionality and implementing a SOAP interface, thus I think it's a bit
foolish to optimize the design solely around IMAP.

I think dbmail is great, and really has the potential to evolve into a
killer groupware server. The best part is it already works well for
what it does, it's not some pie in the sky project that is trying to do
everything all at once.

> /Magnus
>
> P.S. Why is a message split into multiple message blocks? Is this
> because of some limitations in the SQL servers?

I was told it is due to a limitation of MySQL but also due to the fact
that the headers are stored in the first block, which helps to make
header lookups faster.

> Jesse Norell wrote:
> > Hello,
> >
> > (moving this to dbmail-dev)
> >
> >
> >>My vision here is a "fastheaders" table which has a message number and
> >>half a dozen columns defined. On some regular basis, or upon message
> >>insertion, the messageblks table is scanned and the first entry of each
> >>new message number, which is the header, is parsed for these fast headers
> >>and they are stored for future searches and quick/short header listings.
> >
> >
> > An idea we've started (not completed) implimenting in weDBmail
> > along these lines is dynamically parsing/caching the headers. Any
> > time the message headers are requested, it'll check the "header cache"
> > table and use entries if found, but if not, it'll parse the headers
> > from messageblks and use the results while saving appropriate ones in
> > the cache. Could make the pop3 and imap servers do that as well.
> >
> >
> >
> >
> > --
> > Jesse Norell
> > jesse (at) kci.net
> >
> >
> > _______________________________________________
> > Dbmail-dev mailing list
> > Dbmail-dev@dbmail.org
> > http://twister.fastxs.net/mailman/listinfo/dbmail-dev
> >
>
>
>
>
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
>

Re: db schema changes discussion [ In reply to ]

matthew at zeut

Jul 8, 2003, 7:04 AM

Post #6 of 17 (2878 views)

Permalink

On Mon, 2003-07-07 at 12:59, Jesse Norell wrote:
> > My vision here is a "fastheaders" table which has a message number and
> > half a dozen columns defined. On some regular basis, or upon message
> > insertion, the messageblks table is scanned and the first entry of each
> > new message number, which is the header, is parsed for these fast headers
> > and they are stored for future searches and quick/short header listings.
>
> An idea we've started (not completed) implimenting in weDBmail
> along these lines is dynamically parsing/caching the headers. Any
> time the message headers are requested, it'll check the "header cache"
> table and use entries if found, but if not, it'll parse the headers
> from messageblks and use the results while saving appropriate ones in
> the cache. Could make the pop3 and imap servers do that as well.

Could you expand a little on what you mean by "dynamically
parsing/caching the headers" do you mean writing it to the database,
keeping them in application memory? etc... Would be interesting to hear
this flushed out a little.

The idea of doing this at message header request time somewhat defeats
the purpose. I think doing this work at message delivery time is
optimal. Yes it adds overhead, but messages arrive in a fairly steady
stream thus distributing this load over time as opposed to having to do
all this work while the user is waiting. A little work at message
delivery time to get get the data into an "optimal" state (still to be
determined what optimal really means) will certainly help response
times, perhaps with a slight increase in overall server load as a
consequence.

Re: db schema changes discussion [ In reply to ]

jesse at kci

Jul 8, 2003, 8:18 AM

Post #7 of 17 (2881 views)

Permalink

> > This would be ideal with a single table and one VARCHAR column
> > for each _interesting_ header. How do we know what the clients
> > think are interesting headers?
>
> This is why I suggested the value pair design, since we won't always
> know what's interesting.

Yep. We need to be able to specify what it interesting (either
at compile time or preferrably in config file), which could
potentially vary from site to site, and also be able to turn it
off completely for performance (eg. on a pop3-only site).

> > WE DO GAIN if the clients very often request headers from
> > multiple mail in the mailbox, but do the clients make these
> > requests often? Mozilla only do this when the mailbox cache is
> > non existant.
> >
> > I beleive in splitting the message into two parts, header and body.

Yeah, either in seperate tables or as it is now, but add a flag
for the blocks which contain headers. I don't know what would be
faster, of if there'd really be a huge difference. Also, another
table for the (optional) common message headers.

> My real question is what queries are being done by the server. When I
> use evolution and ask it to search the message body for a piece of text,
> does it send this request to the server? Or does it only search what is
> cached.

Along with "what queries are being done by the server," make sure
we also look at whether these queries make sense. Ie. after looking
at common requests for IMAP clients, can the current queries done by
the server be improved upon?

I'm not at all familiar with IMAP either, but I know from previous
discussions, the header caching that would be done is to handle the
most common cases (eg. searches within From/To/Subject), but not 100%
of them (eg. full message searches). And it will be even more of a
help to direct access webmail applications (which typically read,
parse, and display From, Subject, Date, etc. every time you look at
the mailbox).

[From another message]
> > An idea we've started (not completed) implimenting in weDBmail
> > along these lines is dynamically parsing/caching the headers. Any
> > time the message headers are requested, it'll check the "header cache"
> > table and use entries if found, but if not, it'll parse the headers
> > from messageblks and use the results while saving appropriate ones in
> > the cache. Could make the pop3 and imap servers do that as well.
>
> Could you expand a little on what you mean by "dynamically
> parsing/caching the headers" do you mean writing it to the
> database,
> keeping them in application memory? etc... Would be interesting to hear
> this flushed out a little.

This is basically a workaround/workalike for what dbmail will
eventually do itsself. Just a seperate table along the lines of
(message_indr, hdr_name, header). In weDBMail we make large use of
From, To, Subject and Date headers, and have to parse it in php
for every message display, every folder view, etc. Instead of the
current {read message header, parse, use data} we'll have a process
of {check header cache table, read message header, parse, save in
header cache table, use data} for a non-cached message, and a
process of {check header cache table, use data} for cached messages.

> The idea of doing this at message header request time somewhat defeats
> the purpose. I think doing this work at message delivery time is
> optimal. Yes it adds overhead, but messages arrive in a fairly steady
> stream thus distributing this load over time as opposed to having to do
> all this work while the user is waiting. A little work at message
> delivery time to get get the data into an "optimal" state (still to be
> determined what optimal really means) will certainly help response
> times, perhaps with a slight increase in overall server load as a
> consequence.

Right - we're working on the dynamic caching because with us being
too lazy to change dbmail to handle it itsself, that's the only
practical option available. The only reason to add that to the pop3/
imap daemons would be for seemless migration (all new messages get
cached at injection time, old messages happen as needed). It may
be easier just to write that into dbmail-maintenance though, and
when you update from the old non-header-caching setup to the new,
you just have to run dbmail-maintenance to fix things up.

One other thought; I'm not recommending the approach for splitting
message components (eg. file attachments) and saving as discrete
components, as I think the work involved and complexity introduced
may not be worth the benefits, but one of the benefits not yet
mentioned could be less storage requirements for duplicate
attachments. As I watched my wife enjoy a flash application that
was emailed to her, and knew I'd seen it in her inbox at least once
in the past, I thought, "we could save the md5 checksum of each
decoded message component and would only have to store a single copy
of any given file within the entire mail spool!" That would
complicate matters even more for proper message reconstruction, but
is not entirely without appeal.

Jesse

--
Jesse Norell
jesse (at) kci.net

Re: db schema changes discussion [ In reply to ]

matthew at zeut

Jul 8, 2003, 8:34 AM

Post #8 of 17 (2869 views)

Permalink

On Tue, 2003-07-08 at 11:18, Jesse Norell wrote:
> One other thought; I'm not recommending the approach for splitting
> message components (eg. file attachments) and saving as discrete
> components, as I think the work involved and complexity introduced
> may not be worth the benefits, but one of the benefits not yet
> mentioned could be less storage requirements for duplicate
> attachments. As I watched my wife enjoy a flash application that
> was emailed to her, and knew I'd seen it in her inbox at least once
> in the past, I thought, "we could save the md5 checksum of each
> decoded message component and would only have to store a single copy
> of any given file within the entire mail spool!" That would
> complicate matters even more for proper message reconstruction, but
> is not entirely without appeal.

I agree this would be good to do and it's one of the things I think
exchange does well (I think...) if I sent a 10M video to 20 coworkers
the total data store on exchange only increases 10M, not 200M. I don't
think we need to store anything outside the database to do this (not
sure if that is what you were saying).

Question: Right now in dbmail, if I copy a message from my inbox to a
saved folder, does dbmail also copy all of the message_blks? Or does it
just make a new entry in the messages table that is also referenced by
by the message_blks? If not, I think this would be easy to do, and
since even moving a message in IMAP is actually a copy (I think), this
is probably a worthwhile optimization.

Matthew

Re: db schema changes discussion [ In reply to ]

Magnus.Sundberg at dican

Jul 8, 2003, 8:47 AM

Post #9 of 17 (2880 views)

Permalink

Matthew T. O'Connor wrote:
> On Tue, 2003-07-08 at 06:44, Magnus Sundberg wrote:
>
>>Hi again,
>>I have done some more thinking about the database structure.
>>I am not sure of the performance gain with a fast header table
>>structure.
>>This would be ideal with a single table and one VARCHAR column
>>for each _interesting_ header. How do we know what the clients
>>think are interesting headers?
>
>
> This is why I suggested the value pair design, since we won't always
> know what's interesting.
>
Do you have any ideas of how to create an index, to get good
request performance?
something like
CREATE TABLE interesting_headers (
header_id BIGINT DEFAULT '0' NOT NULL AUTO_INCREMENT,
message_idnr BIGINT DEFAULT '0' NOT NULL,
header_name VARCHAR(32),
header_content VARCHAR(255),
UNIQUE(message_idnr, header_name)
);

I beleive in that the interesting headers are defined in the
configuration file. But remember, you need to recreate the fast
header file every time you add another header to be stored in the
fast header table.

We could also add something to dbmail, so that dbmail keeps
statistics about which headers that are requested.

I actually beleive that these statistics could be collected and
then just stored into another table in the database, once an hour
or so, just to get it nonvolatile.

>
> Also, IMAP clients are not all we care about. We also have POP3 (less
> interesting) and direct clients that bypass the IMAP interface, querying
> the database themselves. I know several web apps do this already.
> Also, there is talk extending dbmail to include more exchange type
> functionality and implementing a SOAP interface, thus I think it's a bit
> foolish to optimize the design solely around IMAP.
>

I beleive that IMAP is the most important application, with a
well defined standard.
I think the IMAP way of operation is typical for a mail client, I
mean the data flow and the requests that are done between the
server and the client, wether it is SQL, IMAP or POP3.
I think the authors of the direct clients can come up with their
requests for indexes and database structure.

/Magnus

Re: db schema changes discussion [ In reply to ]

jesse at kci

Jul 8, 2003, 8:57 AM

Post #10 of 17 (2881 views)

Permalink

> On Tue, 2003-07-08 at 11:18, Jesse Norell wrote:
> > One other thought; I'm not recommending the approach for splitting
> > message components (eg. file attachments) and saving as discrete
> > components, as I think the work involved and complexity introduced
> > may not be worth the benefits, but one of the benefits not yet
> > mentioned could be less storage requirements for duplicate
> > attachments. As I watched my wife enjoy a flash application that
> > was emailed to her, and knew I'd seen it in her inbox at least once
> > in the past, I thought, "we could save the md5 checksum of each
> > decoded message component and would only have to store a single copy
> > of any given file within the entire mail spool!" That would
> > complicate matters even more for proper message reconstruction, but
> > is not entirely without appeal.
>
> I agree this would be good to do and it's one of the things I think
> exchange does well (I think...) if I sent a 10M video to 20 coworkers
> the total data store on exchange only increases 10M, not 200M. I don't

That scenario could/will(?) be handled by shared message blocks,
which has been discussed. I was thinking even further than that,
in which seperate emails sent to seperate users at different times
will still share the same storage space, if the decoded message
checksum is the same (ie. it is the exact same file).

> think we need to store anything outside the database to do this (not
> sure if that is what you were saying).

No, more what exactly were the mime part boundaries, encoding
scheme, etc. - to reproduce the exact original message, you'd have
to store all the individual mime headers for all the individual
messages and components, and be able to re-encode the attachment
in the poper form when supplying it to the client. You'd also have
to look at how many levels deep of decoding to do, etc. It all
gets pretty complex to be able to reproduce exactly the original
message but still get the advantage of shared message components
when they are identical. It's more a "neat idea" than something
I expect would be done. :)

> Question: Right now in dbmail, if I copy a message from my inbox to a
> saved folder, does dbmail also copy all of the message_blks? Or does it
> just make a new entry in the messages table that is also referenced by
> by the message_blks? If not, I think this would be easy to do, and
> since even moving a message in IMAP is actually a copy (I think), this
> is probably a worthwhile optimization.

I believe this is currently the inefficient copy-everything
approach. If you move a message from one folder to another, I don't
know what it does, it may just update the mailbox_idnr in the
messages table entry. Search for shared folder discussions, as this
is planned on being addressed (and thought to be doable within the
current db schema).

--
Jesse Norell
jesse (at) kci.net

Re: db schema changes discussion [ In reply to ]

jesse at kci

Jul 8, 2003, 9:06 AM

Post #11 of 17 (2887 views)

Permalink

> I believe this is currently the inefficient copy-everything
> approach. If you move a message from one folder to another, I don't
> know what it does, it may just update the mailbox_idnr in the
> messages table entry. Search for shared folder discussions, as this
> is planned on being addressed (and thought to be doable within the
> current db schema).

Sorry, try searching for shared message blocks - I'm confusing
the two. Shared folders is similar, and has been discussed at the
same time, though. Shared message blocks would be what handles the
scenarios of copying a message from one folder to another and of one
message being delivered to multiple recipients. You have only one
set of message_blks entries, and multiple messages entries pointing
to them (which cannot be done with the current schema). Shared
folders is multiple people having access to the messages stored in
a folder, and that can be done in the current schema (with changes to
the applications).

--
Jesse Norell
jesse (at) kci.net

Re: db schema changes discussion [ In reply to ]

Magnus.Sundberg at dican

Jul 8, 2003, 9:23 AM

Post #12 of 17 (2874 views)

Permalink

Matthew T. O'Connor wrote:
> On Tue, 2003-07-08 at 11:18, Jesse Norell wrote:
>
>> One other thought; I'm not recommending the approach for splitting
>>message components (eg. file attachments) and saving as discrete
>>components, as I think the work involved and complexity introduced
>>may not be worth the benefits, but one of the benefits not yet
>>mentioned could be less storage requirements for duplicate
>>attachments. As I watched my wife enjoy a flash application that
>>was emailed to her, and knew I'd seen it in her inbox at least once
>>in the past, I thought, "we could save the md5 checksum of each
>>decoded message component and would only have to store a single copy
>>of any given file within the entire mail spool!" That would
>>complicate matters even more for proper message reconstruction, but
>>is not entirely without appeal.
>
>
> I agree this would be good to do and it's one of the things I think
> exchange does well (I think...) if I sent a 10M video to 20 coworkers
> the total data store on exchange only increases 10M, not 200M. I don't
> think we need to store anything outside the database to do this (not
> sure if that is what you were saying).
>
> Question: Right now in dbmail, if I copy a message from my inbox to a
> saved folder, does dbmail also copy all of the message_blks? Or does it
> just make a new entry in the messages table that is also referenced by
> by the message_blks? If not, I think this would be easy to do, and
> since even moving a message in IMAP is actually a copy (I think), this
> is probably a worthwhile optimization.

The database size will increase with the size of your copy.
There is nothing in the database structure that supports multiple
entries of a single message.
This also impairs the performance of the IMAP coy command, as
perviously has been stated on this list.

I actually do not know where the soft spot between disk usage and
CPU usage is, you know,
The flow would be something like this:
1 Split the message
2 Calculate the MD5 hash for each part GREATER than 10kbyte (TBD)
(This way we get NULL in the MD5 index for small chunks and
the select speed will probably improve)
3 Store the unique parts.

In organisations where people store their mail in a sent folder
and then distribute the message to a lot of people in their
organisation, you will certainly save disk space, certainly when
they are attaching Word documents.
But as Jesse stated during the time I was writing this mail, it
is a nice feature, but quite troublesome to implement.

What are we aiming for
- Minimal disk space
- Minimum CPU usage, this can actually be done with the scheme
above, depending on the organization mail usage
- Minimum problems during the development phase

/Magnus

>
> Matthew
>
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
>

Re: db schema changes discussion [ In reply to ]

jesse at kci

Jul 8, 2003, 9:36 AM

Post #13 of 17 (2883 views)

Permalink

> Along with "what queries are being done by the server," make sure
> we also look at whether these queries make sense. Ie. after looking
> at common requests for IMAP clients, can the current queries done by
> the server be improved upon?

Another thing I forgot to mention; along with optimizing the
queries, etc. where appropriate, it almost goes without saying that
there would be optimized indexes, but it would be good if a note
could be made of what applications the indexes speed up, ie. the
pop3 server, the imap server, and maybe even include a file with
"other indexes" appropriate if you're running application X. That
would just make it easier for people to determine what indexes are
appropriate for their site (eg. we don't run IMAP here, so most of
the indexes recommended in the past are of no use to us - but they
are, of course, to other people).

--
Jesse Norell
jesse (at) kci.net

Re: db schema changes discussion [ In reply to ]

jesse at kci

Jul 8, 2003, 9:40 AM

Post #14 of 17 (2882 views)

Permalink

> Right - we're working on the dynamic caching because with us being
> too lazy to change dbmail to handle it itsself, that's the only
> practical option available. The only reason to add that to the pop3/
> imap daemons would be for seemless migration (all new messages get
> cached at injection time, old messages happen as needed). It may
> be easier just to write that into dbmail-maintenance though, and
> when you update from the old non-header-caching setup to the new,
> you just have to run dbmail-maintenance to fix things up.

Adding that support to dbmail-maintenance is probably not a bad
idea. Put the list of "interesting" headers in dbmail.conf and run
the maintenance program to bring the header cache up to date. Then
if you change your list of what is interesting, just run
dbmail-maintenance again (to both add in new and delete out old
headers).

--
Jesse Norell
jesse (at) kci.net

Re: db schema changes discussion [ In reply to ]

rbutler at adiis

Jul 8, 2003, 9:57 AM

Post #15 of 17 (2880 views)

Permalink

On Tue, 2003-07-08 at 08:54, Matthew T. O'Connor wrote:

> I have no idea, unfortunately, I don't really know how the imap clients
> really work.
>

> My real question is what queries are being done by the server. When I
> use evolution and ask it to search the message body for a piece of text,
> does it send this request to the server? Or does it only search what is
> cached.
>

Here's a neat little evolution trick:

evolution <=1.3

open a shell.
export CAMEL_VERBOSE_DEBUG=1
evolution-mail &
evolution &

evolution >= 1.4

open a shell
export CAMEL_VERBOSE_DEBUG=1
evolution &

This will load evolution, and if you look in the terminal window you
will see all the imap traffic generated. To answer your question, if
its not in its cache of searches (starts with no entries in the cache)
it sends it to the server to be searched.

> Also, IMAP clients are not all we care about. We also have POP3 (less
> interesting) and direct clients that bypass the IMAP interface, querying
> the database themselves. I know several web apps do this already.
> Also, there is talk extending dbmail to include more exchange type
> functionality and implementing a SOAP interface, thus I think it's a bit
> foolish to optimize the design solely around IMAP.
>

Expanding exchange like features into dbmail in my opinion should be an
add on product that uses the soap interface, or a forked project. I
work for an ISP and we have centered our whole mail system around
dbmail, and I really don't need the group functions of exchange for our
offering, and would prefer it not to be bloated up with them. We chose
dbmail as an email solution.

That being said, a separate add on, fork, or configuration directives
disabling these functions would be good solutions.

> I think dbmail is great, and really has the potential to evolve into a
> killer groupware server. The best part is it already works well for
> what it does, it's not some pie in the sky project that is trying to do
> everything all at once.
>
> > /Magnus
> >
>

--
Ryan Butler <rbutler@adiis.net>
ADI Internet Solutions

Re: db schema changes discussion [ In reply to ]

matthew at zeut

Jul 8, 2003, 10:05 AM

Post #16 of 17 (2883 views)

Permalink

From: "Jesse Norell" <jesse@kci.net>
> > Right - we're working on the dynamic caching because with us being
> > too lazy to change dbmail to handle it itsself, that's the only
> > practical option available. The only reason to add that to the pop3/
> > imap daemons would be for seemless migration (all new messages get
> > cached at injection time, old messages happen as needed). It may
> > be easier just to write that into dbmail-maintenance though, and
> > when you update from the old non-header-caching setup to the new,
> > you just have to run dbmail-maintenance to fix things up.
>
> Adding that support to dbmail-maintenance is probably not a bad
> idea. Put the list of "interesting" headers in dbmail.conf and run
> the maintenance program to bring the header cache up to date. Then
> if you change your list of what is interesting, just run
> dbmail-maintenance again (to both add in new and delete out old
> headers).

What is the advantage of adding this list to dbmail.conf rather than to a
"interesting_headers" lookup table that is inside the database. Putting it
in the database has several advantages: a) an application that has access to
the database but not to the dbmail.conf file can see the list. b) the list
can be joined with tables and used to enforce integrity checks etc... c)
easily updated via sql.

I think dbmail.conf should be used for external settings such as database
name, port number etc etc etc... I think settings that effect the internal
structure of the application should be stored in the database.

Re: db schema changes discussion [ In reply to ]

matthew at zeut

Jul 8, 2003, 10:14 AM

Post #17 of 17 (2883 views)

Permalink

From: "Magnus Sundberg" <Magnus.Sundberg@dican.se>
> Matthew T. O'Connor wrote:
> > This is why I suggested the value pair design, since we won't always
> > know what's interesting.
> >
> Do you have any ideas of how to create an index, to get good
> request performance?
> something like
> CREATE TABLE interesting_headers (
> header_id BIGINT DEFAULT '0' NOT NULL AUTO_INCREMENT,
> message_idnr BIGINT DEFAULT '0' NOT NULL,
> header_name VARCHAR(32),
> header_content VARCHAR(255),
> UNIQUE(message_idnr, header_name)
> );
>
> I beleive in that the interesting headers are defined in the
> configuration file. But remember, you need to recreate the fast
> header file every time you add another header to be stored in the
> fast header table.

one reason to keep the list of "interesting headers" in a table rather than
in dbmail.conf is so that the interesting_headers table you defined above
would have the header_name be an int that references a primary key in the
intereesting_headers_list table. This would reduce the size of the value
pair table, and speed queries to it.

As for indexes on the table, the one you defined is probably the best for
general purpose queries. But it might be better with two separate indexes.
The optimizations will be query specific, and probably somewhat database
specific (mysql vs. postgresql).

> We could also add something to dbmail, so that dbmail keeps
> statistics about which headers that are requested.

This would certainly be good for a research tool, not sure if it would be
usefull in the wild, unless it would automatically update the list of
"interesting headers".

> > Also, IMAP clients are not all we care about. We also have POP3 (less
> > interesting) and direct clients that bypass the IMAP interface, querying
> > the database themselves. I know several web apps do this already.
> > Also, there is talk extending dbmail to include more exchange type
> > functionality and implementing a SOAP interface, thus I think it's a bit
> > foolish to optimize the design solely around IMAP.
>
> I beleive that IMAP is the most important application, with a
> well defined standard.
> I think the IMAP way of operation is typical for a mail client, I
> mean the data flow and the requests that are done between the
> server and the client, wether it is SQL, IMAP or POP3.
> I think the authors of the direct clients can come up with their
> requests for indexes and database structure.

Probably true, I just wanted people to remember that IMAP is not everything,
but as you said, is probably the most relevant, certainly the largest
customer.

Mailing List Archive

Mailing List Archive

Attached Files: