On Sunday, March 28, 2004, 10:00:11 PM, Justin Mason wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Daniel Quinlan writes:
>> No FPs, but the SPAM% is rather low. I suspect the problem is that
>> SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
>> mapping.
> It's also *very* new -- I suspect it could do with more data ;) Anyone got
> an address for the operator? I can send on over a partial spamtrap feed
> from our server (100MBytes of spam per day), or similar.
> IMO, expiring after 4 days is *way* too early. At least a month would
> be better -- otherwise it allows spammers to "recycle" old domains very
> quickly after their use in spam.
Hi All,
I'm the person behind SURBL. First I'd like to thank Sidney for
relaying my announcement to you guys and also for his letting
me know this developer forum. I'd also like to thank the SA
developer community for building a great tool for fighting spam.
In that spirit, I'm trying to make a contribution to the efforts
in a way that perhaps has not been tried before. I'll try to
explain what I'd hope SURBL can help accomplish.
First, I think there may be some misunderstanding about the
intended use of the SURBL data, partially caused by my somewhat
shallow understanding of what URIDNSBL currently does, and also
because my own ideas on how SURBL should be used apparently
differ somewhat from how URIDNSBL appears to work. It seems
that URIDNSBL wants to do address resolution on domain names
found in message bodies and compare the resulting addresses
against numeric RBLs. For name-based URIs that's very different
from my intended use for SURBL so I may have been partially in
error in suggesting that an unmodified URIDNSBL use SURBL
directly.
Second, we can make the expiration of records and therefore
number of days any arbitrary length. Four days was chosen
because I felt it was a good match for the freshness of
the SpamCop (SC) Spamvertised site data. It was also chosen
to keep the amount of data reasonably small. If more of a
historical record would be useful, we can keep data for a
week or month. The shortness was partially meant to ensure
that the RBL data tracked current SC data fairly tightly and
also did not result in too large of an RBL. Presently the
RBL only has about 250 records; perhaps that's on the small
side. I'm not too worried about Joe Jobs and other problems
in the data due to some of the averaging effects explained
further on.
More fundamentally the question of number of days may somewhat
miss the idea of what I'm trying to accomplish with SURBL.
SURBL is not trained on spam in the sense of Baysean rules, etc.
It is simply meant to be a record of the most frequently
reported domains in spam message bodies that SpamCop users
choose to report. In this sense it's like a broadly-based,
hand-tuned black list of domains commonly found in spam. Because
quite a few reports need to be received to for a domain to get
added to SURBL, it effectively represents a consensus voting
system on what body domains are spammy. One improvement might be
to encode the frequency data in the RBL so that more frequently
reported domains could be used to give higher scores.
About the only tuning of the data I see as necessary or
possible is in the number expiration days and the report
count threshold for inclusion in the list (with the caveats
about how those counts are generated, as mentioned in the
documentation). Some statistical analysis could help with
the thresholding question.
http://sc.surbl.org/ As another example of difference about my views on the use of
the SURBL data, off-list Sidney brought up the question of
processing deliberately randomized host names that spammers
sometimes use and how that could confuse or defeat a spam message
body domain RBL. He implied that that such deliberate attempts
at randomization might be a reason my data was not working too
well with URIDNSBL, and I partially agree. This observation
points out potential differences in how the data might best
be used.
My take on the randomized host or subdomain problem highlights
a different viewpoint we took into consideration when designing
our data structure. Instead of checking every randomized FQDN
against the RBL, we prefer to try to strip off the random portion
and pass only the basic, unchanging domain. The SURBL data only
gets the parent of these randomized FQDNs since it builds its
(inverted) tree from the root (TLD) direction down toward the
leaves. (It actually starts counting reports from the second
level, not the top level, which would be way too broad.) It
accumulates a count of the children under the second level so
that:
lkjhlkjh.random.com
089yokhl.random.com
asdsdfsd.random.com
gives one entry for each FQDN, but gives the useful and desirable
count of *3* for random.com. The randomizers *cannot hide* from
this approach. The non-random child portion of their domains
shows up clearly and conspicuously as a parent domain with an
increased count (3 is greater than 1). Every time a spammer gets
reported using a randomized host or subdomain name, it increases
the count of their parent domain. In the words of the original,
Apple II version of Castle Wolfenstein, "You're caught."
So a technique to defeat the randomizers greater count is to look
at the higher levels of the domain, under which SURBL will always
count the randomized children of the "bad" parent. In this case
the URI diversity created through randomization hurts the spammer
by increasing the number of unique reports and increasing the
report count of their parent domain, making them more likely to
be added to SURBL. (Dooh, this paragraph is redundant...)
A quick look at the data will confirm that almost all of the
most often reported domains have just two levels (a few have
three levels):
http://spamcheck.freeapp.net/top-sites-domains This simply reflects the nature of the data, including the
positive and constructive handling of randomizers. The real
strength of SURBL is that the domains are very strongly spam
domains. This approach would be prone to failure if the FP rate
of these base domains was significantly above zero. Due to the
law of averages and fairly careful SpamCop reporters., that
seldom seems to happen.
My suggested alternative approach to parsing spam URIs would be to
start with the second level domains, compare those against SURBL,
try the third levels next, up to some limit. (Levels 1 with 2,
then 1 through 3 are probably enough, i.e. two DNS queries into
the SURBL domain). Since the DNS RBL lookups are all cached and
very fast there should not be too much of a performance penalty
for this. Probably it's less of a penalty than trying to resolve
spam body FQDNs into numeric addresses, then do reverse lookups
or name server record checks on the addresses, etc. Some of the
three-level domains are supersets of two-level domains, for
example to.discreetvaluepills.com and discreetvaluepills.com are
both listed, so the two level comparison may be the best place to
start.
Implementing this approach may require a new code branch off of
URIDNSBL to be started. But I'm convinced my approach may have
some definite merit if implemented.
The results of feeding SURBL directly into URIDNSBL may not
be too strong because the two approaches seem to have fairly
different background assumptions and design approaches in mind.
I now believe my data may work better when used as I describe
above than when fed directly into unmodified URIDNSBL.
I've never written any SA code, so could I convince someone
to consider implementing this approach or give me a pointer to
learn how to do it?
> And finally, I think we should add a new rule eval fn to URIBL, to
> allow URIs to be looked up against an RHSBL-style list. That should
> be faster, as it'd mean no need for the NS and A sets of lookups.
Jason's last comment would seem to include a key part of the
puzzle. As I mention above, I believe the SURBL data could
and quite possibly should be compared without any DNS resolution
of any domains in the message body. If the domain (or numeric
address) in the spam URI matches SURBL, you almost certainly hold
a gen-u-ine spam.
This also ties in with Daniel's earlier observation after testing
SURBL using URIDNSBL:
> No FPs, but the SPAM% is rather low. I suspect the problem is that
> SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
> mapping.
He's exactly right about the intention of SURBL. It is a direct
list of spam URI domains, intended for direct comparison against
domains in incoming message URIs without resorting to any DNS
resolution. I consider that a feature rather than a bug. ;)
TIA and Cheers,
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://sc.surbl.org/