Am 2022-05-13 08:37, schrieb bugzilla-daemon@spamassassin.apache.org:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
>
> Henrik Krohns <apache@hege.li> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> Status|NEW |RESOLVED
> Resolution|--- |FIXED
>
> --- Comment #9 from Henrik Krohns <apache@hege.li> ---
> Feel free to discuss here or on dev-list if something is still unclear.
First of all, I would like to thank Henrik for all the work he has done
on SpamAssassin. The further I get with my code review, the more I see
all the places where he has made minor or major changes that have made
the code faster, more readable and thus more maintainable.
Despite all these changes, however, I still see room for improvement. At
the moment I'm not very happy about how the do_meta_tests subroutine is
implemented. It seems to work now, but the query for the dependency of
the meta rules looks too complicated to me:
foreach my $r (@{$md->{$rulename}|[]}) {
next RULE unless exists $h->{$r} && !$tp->{$r} && !$pl{$r};
}
Instead of three dependencies, there should really only be one query at
this point for "is the rule ready or not". If we knew exactly the state
of a rule at each point in time, then it would also be possible to move
from a brute force algorithm to a deterministic algorithm, where a rule
that becomes ready automatically sets all meta rules to ready if that
rule was the last dependency analogous to the algorithm for tags.
do_meta_tests would then simply process a queue of ready meta rules. If
a new meta rule is ready it will be added to the end of the queue. If
the queue is empty, all rules of a priority class are processed.
The next possible step would be the implementation of short-circuit
behavior of && and || analogous to Perl itself. E.g. the rule
meta BITCOIN_SPAM_10 __BITCOIN_ID && ( HTML_IMAGE_ONLY_04 ||
HTML_IMAGE_ONLY_08 )
would immediately evaluate to false if __BITCOIN_ID is false.
HTML_IMAGE_ONLY_04 and HTML_IMAGE_ONLY_08 would then no longer need to
be evaluated.
Recently the question was asked why Check.pm is a plugin if it is not
optional. Check.pm is a plugin so that you can implement more than one
check plugin. Currently SpamAssassin still assumes that filtering
happens postqueue, where you can respond to the different wishes of
individual users. If you use SpamAssassin in a prequeue filtering, then
only the decision rejection or acceptance is possible for ALL
recipients. A consideration of different recipient wishes is not
possible.
Due to the possibility to switch between admin and user rules per
evaluation of an email, one accepts that the evaluation of the rules
must be rebuilt for each email. What we need instead is the possibility
to build the rules once at the start of the daemon and then use it to
evaluate any number of emails.
You can go one step further and first run a SpamAssassin instance with a
lightweight ruleset in prequeue mode to be able to decide as quickly as
possible whether to reject an email and then run another instance with a
heavyweight ruleset that makes a more precise distinction between
marking the email as spam or ham. The second instance can then also
evaluate user rules. There should be a feedback loop between the
instances so that the first instance can use the results of the second
instance, e.g. to quickly reject emails via a local blocklist using the
HashBL.pm plugin. Currently we use a feedback loop between the
SpamAssassin instance in prequeue mode and Postfix to (temporarily)
reject emails for 24 hours.
In any case, SpamAssassin should arrive in 2022 and offer a highly
optimized version of the analysis for MTAs that process millions of
emails per day.
These are my general remarks about the evaluation of rules. An email
with some minor cosmetic changes will follow.
Michael