Dear all,
First, I would like to congratulate you for working several years on
dbmail. After fetching source code from GIThub, I could count 46682
lines code in src and 152170 in total. Not bad!
For years, I have been looking for a reliable mail server relying on SQL
(i.e. PostgreSQL in my case) and I am happy I probably found it.
I would like to participate in coding and I have some questions.
Here are some questions:
1) SQL Table structure
Are there reasons for splitting the message table into several
sub-tables.
For example: dbmail_messages and dbmail_physmessage and
dbmail_messageblks
CREATE TABLE dbmail_messages
(
message_idnr bigint NOT NULL DEFAULT
nextval('dbmail_message_idnr_seq'::regclass),
mailbox_idnr bigint,
physmessage_id bigint,
seen_flag smallint NOT NULL DEFAULT 0::smallint,
answered_flag smallint NOT NULL DEFAULT 0::smallint,
deleted_flag smallint NOT NULL DEFAULT 0::smallint,
flagged_flag smallint NOT NULL DEFAULT 0::smallint,
recent_flag smallint NOT NULL DEFAULT 0::smallint,
draft_flag smallint NOT NULL DEFAULT 0::smallint,
unique_id character varying(70) NOT NULL,
status smallint NOT NULL DEFAULT 0::smallint,
CONSTRAINT dbmail_messages_pkey PRIMARY KEY (message_idnr),
CONSTRAINT dbmail_messages_mailbox_idnr_fkey FOREIGN KEY
(mailbox_idnr)
REFERENCES dbmail_mailboxes (mailbox_idnr) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT dbmail_messages_physmessage_id_fkey FOREIGN KEY
(physmessage_id)
REFERENCES dbmail_physmessage (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
and
CREATE TABLE dbmail_physmessage
(
id bigint NOT NULL DEFAULT
nextval('dbmail_physmessage_id_seq'::regclass),
messagesize bigint NOT NULL DEFAULT 0::bigint,
rfcsize bigint NOT NULL DEFAULT 0::bigint,
internal_date timestamp without time zone,
CONSTRAINT dbmail_physmessage_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE dbmail_physmessage
OWNER TO dbmail;
and
CREATE TABLE dbmail_messageblks
(
messageblk_idnr bigint NOT NULL DEFAULT
nextval('dbmail_messageblk_idnr_seq'::regclass),
physmessage_id bigint,
messageblk bytea NOT NULL,
blocksize bigint NOT NULL DEFAULT 0::bigint,
is_header smallint NOT NULL DEFAULT 0::smallint,
CONSTRAINT dbmail_messageblks_pkey PRIMARY KEY (messageblk_idnr),
CONSTRAINT dbmail_messageblks_physmessage_id_fkey FOREIGN KEY
(physmessage_id)
REFERENCES dbmail_physmessage (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE dbmail_messageblks
OWNER TO dbmail;
The only reasons I see for separating those tables would be storing them
in different tablespaces. But even this is questionable.
I am worried that this kind of separation might result on sequential
scans instead of memory and index scans. A JOIN without a reason is
always an expansive CPU task. Each row needs several CPU cycles.
Under modern systems like PostgreSQL, it seems more interesting to have
a single table with advanced indexing: partial indexes and/or full-text
seach (FTS) and/or materialized views. These indexes can be placed on a
different tablespace, i.e. an SSD drive. Rebuilding FTS happens in
background, so that inserts are fast. This is a much modern approach
that splitting a table in three.
Also, I am worried that running a lot of LEFT JOINs in the database
might open security breaches. This is the second issue also.
2) Privilege separation
I read the post on the WIKI with interest and would like to add this
information:
PostgreSQL supports CREATE SCHEMA:
http://www.postgresql.org/docs/9.4/static/sql-createschema.html
CREATE SCHEMA is compatible with the prefix of tables. But it is more
powerful as we can restrict a connection to a schema.
When connecting, you may restrict to a schema:
set search_path to 'schema'
You may also "glue" a user to a specific schema:
ALTER USER user_name SET search_path to 'schema'
Using schemas, it may be possible to create as many database structures
using an imap identifier (john@doe.fr).
This is to prevent the scope of any SQL injection and limit it to the
sole account of the user.
3) Database performance / SQL injection
Could you point me to any filter against database injection in the
source code. Where is it located in source code?
4) Performance / memory usage
I found this kind of code in dm_mailbox.c
c = db_con_get();
TRY
stmt = db_stmt_prepare(c,
"SELECT id,message_idnr FROM %sphysmessage p "
"LEFT JOIN %smessages m ON p.id=m.physmessage_id "
"LEFT JOIN %smailboxes b ON b.mailbox_idnr=m.mailbox_idnr "
"WHERE b.mailbox_idnr=? ORDER BY message_idnr",
DBPFX,DBPFX,DBPFX);
db_stmt_set_u64(stmt, 1, self->id);
r = db_stmt_query(stmt);
OK, this is only a prepare statement ... but
as I receive logs using mail, some of my mailboxes usually have more
than 100.000 message. I
This query results in a huge sequential scan, followed by an ORDER BY
statement and then is transferred to memory. I did not test with EXPLAIN
ANALYSE but it must be very CPU and memory expansive to run.
ids = p_list_new(self->pool);
while (db_result_next(r)) {
physid = db_result_get_u64(r,0);
msgid = db_result_get_u64(r,1);
if (g_tree_lookup(uids,&msgid)) {
id = mempool_pop(self->pool, sizeof(uint64_t));
*id = physid;
ids = p_list_append(ids,id);
}
}
For performance, there should be more LIMIT and OFFSET statements to
trim recordsets and avoid filling memory.
***
IMHO , DBmail is not usable in production:
I am worried that it contains source code that allows SQL injection and
gives access to all messages at once. Without SCHEMA support I will
probably not install DBmail.
Would you welcome a discussion followed by a patch/review for PostgreSQL
to make it work using SCHEMAs? It should not alter table structure.
Would you welcome a SCHEMA patch? What are your recommendations and
guidelines?
Kind regards,
Kellogs
_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail-dev
First, I would like to congratulate you for working several years on
dbmail. After fetching source code from GIThub, I could count 46682
lines code in src and 152170 in total. Not bad!
For years, I have been looking for a reliable mail server relying on SQL
(i.e. PostgreSQL in my case) and I am happy I probably found it.
I would like to participate in coding and I have some questions.
Here are some questions:
1) SQL Table structure
Are there reasons for splitting the message table into several
sub-tables.
For example: dbmail_messages and dbmail_physmessage and
dbmail_messageblks
CREATE TABLE dbmail_messages
(
message_idnr bigint NOT NULL DEFAULT
nextval('dbmail_message_idnr_seq'::regclass),
mailbox_idnr bigint,
physmessage_id bigint,
seen_flag smallint NOT NULL DEFAULT 0::smallint,
answered_flag smallint NOT NULL DEFAULT 0::smallint,
deleted_flag smallint NOT NULL DEFAULT 0::smallint,
flagged_flag smallint NOT NULL DEFAULT 0::smallint,
recent_flag smallint NOT NULL DEFAULT 0::smallint,
draft_flag smallint NOT NULL DEFAULT 0::smallint,
unique_id character varying(70) NOT NULL,
status smallint NOT NULL DEFAULT 0::smallint,
CONSTRAINT dbmail_messages_pkey PRIMARY KEY (message_idnr),
CONSTRAINT dbmail_messages_mailbox_idnr_fkey FOREIGN KEY
(mailbox_idnr)
REFERENCES dbmail_mailboxes (mailbox_idnr) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT dbmail_messages_physmessage_id_fkey FOREIGN KEY
(physmessage_id)
REFERENCES dbmail_physmessage (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
and
CREATE TABLE dbmail_physmessage
(
id bigint NOT NULL DEFAULT
nextval('dbmail_physmessage_id_seq'::regclass),
messagesize bigint NOT NULL DEFAULT 0::bigint,
rfcsize bigint NOT NULL DEFAULT 0::bigint,
internal_date timestamp without time zone,
CONSTRAINT dbmail_physmessage_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE dbmail_physmessage
OWNER TO dbmail;
and
CREATE TABLE dbmail_messageblks
(
messageblk_idnr bigint NOT NULL DEFAULT
nextval('dbmail_messageblk_idnr_seq'::regclass),
physmessage_id bigint,
messageblk bytea NOT NULL,
blocksize bigint NOT NULL DEFAULT 0::bigint,
is_header smallint NOT NULL DEFAULT 0::smallint,
CONSTRAINT dbmail_messageblks_pkey PRIMARY KEY (messageblk_idnr),
CONSTRAINT dbmail_messageblks_physmessage_id_fkey FOREIGN KEY
(physmessage_id)
REFERENCES dbmail_physmessage (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE dbmail_messageblks
OWNER TO dbmail;
The only reasons I see for separating those tables would be storing them
in different tablespaces. But even this is questionable.
I am worried that this kind of separation might result on sequential
scans instead of memory and index scans. A JOIN without a reason is
always an expansive CPU task. Each row needs several CPU cycles.
Under modern systems like PostgreSQL, it seems more interesting to have
a single table with advanced indexing: partial indexes and/or full-text
seach (FTS) and/or materialized views. These indexes can be placed on a
different tablespace, i.e. an SSD drive. Rebuilding FTS happens in
background, so that inserts are fast. This is a much modern approach
that splitting a table in three.
Also, I am worried that running a lot of LEFT JOINs in the database
might open security breaches. This is the second issue also.
2) Privilege separation
I read the post on the WIKI with interest and would like to add this
information:
PostgreSQL supports CREATE SCHEMA:
http://www.postgresql.org/docs/9.4/static/sql-createschema.html
CREATE SCHEMA is compatible with the prefix of tables. But it is more
powerful as we can restrict a connection to a schema.
When connecting, you may restrict to a schema:
set search_path to 'schema'
You may also "glue" a user to a specific schema:
ALTER USER user_name SET search_path to 'schema'
Using schemas, it may be possible to create as many database structures
using an imap identifier (john@doe.fr).
This is to prevent the scope of any SQL injection and limit it to the
sole account of the user.
3) Database performance / SQL injection
Could you point me to any filter against database injection in the
source code. Where is it located in source code?
4) Performance / memory usage
I found this kind of code in dm_mailbox.c
c = db_con_get();
TRY
stmt = db_stmt_prepare(c,
"SELECT id,message_idnr FROM %sphysmessage p "
"LEFT JOIN %smessages m ON p.id=m.physmessage_id "
"LEFT JOIN %smailboxes b ON b.mailbox_idnr=m.mailbox_idnr "
"WHERE b.mailbox_idnr=? ORDER BY message_idnr",
DBPFX,DBPFX,DBPFX);
db_stmt_set_u64(stmt, 1, self->id);
r = db_stmt_query(stmt);
OK, this is only a prepare statement ... but
as I receive logs using mail, some of my mailboxes usually have more
than 100.000 message. I
This query results in a huge sequential scan, followed by an ORDER BY
statement and then is transferred to memory. I did not test with EXPLAIN
ANALYSE but it must be very CPU and memory expansive to run.
ids = p_list_new(self->pool);
while (db_result_next(r)) {
physid = db_result_get_u64(r,0);
msgid = db_result_get_u64(r,1);
if (g_tree_lookup(uids,&msgid)) {
id = mempool_pop(self->pool, sizeof(uint64_t));
*id = physid;
ids = p_list_append(ids,id);
}
}
For performance, there should be more LIMIT and OFFSET statements to
trim recordsets and avoid filling memory.
***
IMHO , DBmail is not usable in production:
I am worried that it contains source code that allows SQL injection and
gives access to all messages at once. Without SCHEMA support I will
probably not install DBmail.
Would you welcome a discussion followed by a patch/review for PostgreSQL
to make it work using SCHEMAs? It should not alter table structure.
Would you welcome a SCHEMA patch? What are your recommendations and
guidelines?
Kind regards,
Kellogs
_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail-dev