Mailing List Archive: New RBL for use with URIDNSBL plugin

New RBL for use with URIDNSBL plugin

Mar 28, 2004, 11:45 AM

Post #1 of 36 (610 views)

I received this announcement from someone I know online from a sonic.net
users antispam newsgroup. He has set up a new RBL that automates the
extraction of domain and host names from SpamCop's list of spamvertised
websites. It is designed to be used with SpamAssassin's URIDNSBL plugin.

Is this something that we would put into the test rules to see what
mass-check has to say about its effectiveness?

The RBL ages entries after four days. Does that degree of freshness
affect the stats of testing?

Details are at

http://sc.surbl.org/

-- sidney

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

quinlan at pathname

Mar 28, 2004, 1:21 PM

Post #2 of 36 (609 views)

Permalink

Sidney Markowitz <sidney@sidney.com> writes:

> I received this announcement from someone I know online from a sonic.net
> users antispam newsgroup. He has set up a new RBL that automates the
> extraction of domain and host names from SpamCop's list of spamvertised
> websites. It is designed to be used with SpamAssassin's URIDNSBL plugin.
>
> Is this something that we would put into the test rules to see what
> mass-check has to say about its effectiveness?

Yes.

> The RBL ages entries after four days. Does that degree of freshness
> affect the stats of testing?

Yes. Four days is low for this application, I think. SpamCop is a pain
to test too, though.

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

sidney at sidney

Mar 28, 2004, 3:44 PM

Post #3 of 36 (609 views)

Permalink

Daniel Quinlan wrote:
> Four days is low for this application, I think. SpamCop is a pain
> to test too, though.

Is there so much churn among spamvertised sites on the SpamCop list that
this would be a problem? Keeping the expiration time low would minimize
the effect of false positives, especially from joe jobs. What do you
think an appropriate expiration time would be, and why do you think so?

Do you know anything about how SpamCop generates their spamvertised
sites page?
http://www.spamcop.net/w3m?action=inprogress&type=www
It appears to be a close to real time report based on submissions from
ISPs or other spam gatherers. I wonder how much checking they are able
to do before putting a URL on the list.

-- sidney

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

quinlan at pathname

Mar 28, 2004, 4:23 PM

Post #4 of 36 (609 views)

Permalink

Sidney Markowitz <sidney@sidney.com> writes:

> Is there so much churn among spamvertised sites on the SpamCop list
> that this would be a problem? Keeping the expiration time low would
> minimize the effect of false positives, especially from joe jobs. What
> do you think an appropriate expiration time would be, and why do you
> think so?

Using expiration time as the method for handling joe jobs (and false
positives in general) seems very insufficient to me. I think the FP
rate is going to be the same order-of-magnitude with our without quick
expiration.

> Do you know anything about how SpamCop generates their spamvertised
> sites page?
> http://www.spamcop.net/w3m?action=inprogress&type=www
>
> It appears to be a close to real time report based on submissions from
> ISPs or other spam gatherers. I wonder how much checking they are able
> to do before putting a URL on the list.

We can always just test it...

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

quinlan at pathname

Mar 28, 2004, 4:48 PM

Post #5 of 36 (609 views)

Permalink

Daniel Quinlan <quinlan@pathname.com> writes:

> We can always just test it...

Okay, I tested it on my last 7 days of spam and ham (which I just
generated today).

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
4895 2868 2027 0.586 0.00 0.00 (all messages)
100.000 58.5904 41.4096 0.586 0.00 0.00 (all messages as %)
42.451 71.0948 1.9240 0.974 1.00 1.00 URIBL_SBL
0.204 0.3487 0.0000 1.000 0.97 0.01 T_URIBL_SC_SURBL
0.756 0.9763 0.4440 0.687 0.22 1.00 URIBL_DSBL

No FPs, but the SPAM% is rather low. I suspect the problem is that
SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
mapping.

Also, my hits were largely confined to the last 4 days as expected
despite the corpus including the last 7 days of my spam:

first message in corpus: Fri Mar 19 23:11:07 2004
last message in corpus: Sun Mar 28 05:16:17 2004

hits:

Sun Mar 21 10:15:04 2004
Sun Mar 21 11:16:25 2004
Wed Mar 24 15:06:53 2004
Thu Mar 25 12:30:52 2004
Thu Mar 25 23:56:50 2004
Fri Mar 26 01:42:13 2004
Fri Mar 26 01:59:56 2004
Fri Mar 26 03:45:22 2004
Fri Mar 26 08:28:00 2004
Sat Mar 27 05:57:20 2004

distribution of messages in corpus:

count received date
23 Mar 19
360 Mar 20
335 Mar 21
369 Mar 22
324 Mar 23
372 Mar 24
390 Mar 25
398 Mar 26
295 Mar 27
2 Mar 28

This may or may not help with accuracy, but definitely will make delayed
testing harder.

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

jm at jmason

Mar 28, 2004, 11:00 PM

Post #6 of 36 (609 views)

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Daniel Quinlan writes:
> No FPs, but the SPAM% is rather low. I suspect the problem is that
> SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
> mapping.

It's also *very* new -- I suspect it could do with more data ;) Anyone got
an address for the operator? I can send on over a partial spamtrap feed
from our server (100MBytes of spam per day), or similar.

IMO, expiring after 4 days is *way* too early. At least a month would
be better -- otherwise it allows spammers to "recycle" old domains very
quickly after their use in spam.

And finally, I think we should add a new rule eval fn to URIBL, to
allow URIs to be looked up against an RHSBL-style list. That should
be faster, as it'd mean no need for the NS and A sets of lookups.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAZ7trQTcbUG5Y7woRAhotAKDu3jEOTmsdPXAV2fun6ZITeqUplQCgvWeh
DAaReyOO5vy5rDDC8r5eiUM=
=rfqH
-----END PGP SIGNATURE-----

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

sidney at sidney

Mar 29, 2004, 12:51 AM

Post #7 of 36 (609 views)

Permalink

Justin Mason wrote:
> Anyone got
> an address for the operator?

I already invited Jeff Chan to join in the discussion here. Last I heard
from him on the newsgroup today he was downloading the archived threads
about URIDNSBL to read before posting anything here. He did post an
email address for feedback before he got the surbl.org domain. Rather
than post the old email address here, I'll let him know that you are
looking to talk to him and also remind him that he forgot to put any
contact address on his new website :-)

> And finally, I think we should add a new rule eval fn to URIBL, to
> allow URIs to be looked up against an RHSBL-style list. That should
> be faster, as it'd mean no need for the NS and A sets of lookups.

That fits in with things that Jeff said in response to the stats that
Daniel posted. We'll see what he has to say when he gets here.

-- sidney

Re[2]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 1:51 AM

Post #8 of 36 (609 views)

Permalink

On Sunday, March 28, 2004, 10:00:11 PM, Justin Mason wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Daniel Quinlan writes:
>> No FPs, but the SPAM% is rather low. I suspect the problem is that
>> SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
>> mapping.

> It's also *very* new -- I suspect it could do with more data ;) Anyone got
> an address for the operator? I can send on over a partial spamtrap feed
> from our server (100MBytes of spam per day), or similar.

> IMO, expiring after 4 days is *way* too early. At least a month would
> be better -- otherwise it allows spammers to "recycle" old domains very
> quickly after their use in spam.

Hi All,
I'm the person behind SURBL. First I'd like to thank Sidney for
relaying my announcement to you guys and also for his letting
me know this developer forum. I'd also like to thank the SA
developer community for building a great tool for fighting spam.
In that spirit, I'm trying to make a contribution to the efforts
in a way that perhaps has not been tried before. I'll try to
explain what I'd hope SURBL can help accomplish.

First, I think there may be some misunderstanding about the
intended use of the SURBL data, partially caused by my somewhat
shallow understanding of what URIDNSBL currently does, and also
because my own ideas on how SURBL should be used apparently
differ somewhat from how URIDNSBL appears to work. It seems
that URIDNSBL wants to do address resolution on domain names
found in message bodies and compare the resulting addresses
against numeric RBLs. For name-based URIs that's very different
from my intended use for SURBL so I may have been partially in
error in suggesting that an unmodified URIDNSBL use SURBL
directly.

Second, we can make the expiration of records and therefore
number of days any arbitrary length. Four days was chosen
because I felt it was a good match for the freshness of
the SpamCop (SC) Spamvertised site data. It was also chosen
to keep the amount of data reasonably small. If more of a
historical record would be useful, we can keep data for a
week or month. The shortness was partially meant to ensure
that the RBL data tracked current SC data fairly tightly and
also did not result in too large of an RBL. Presently the
RBL only has about 250 records; perhaps that's on the small
side. I'm not too worried about Joe Jobs and other problems
in the data due to some of the averaging effects explained
further on.

More fundamentally the question of number of days may somewhat
miss the idea of what I'm trying to accomplish with SURBL.
SURBL is not trained on spam in the sense of Baysean rules, etc.
It is simply meant to be a record of the most frequently
reported domains in spam message bodies that SpamCop users
choose to report. In this sense it's like a broadly-based,
hand-tuned black list of domains commonly found in spam. Because
quite a few reports need to be received to for a domain to get
added to SURBL, it effectively represents a consensus voting
system on what body domains are spammy. One improvement might be
to encode the frequency data in the RBL so that more frequently
reported domains could be used to give higher scores.

About the only tuning of the data I see as necessary or
possible is in the number expiration days and the report
count threshold for inclusion in the list (with the caveats
about how those counts are generated, as mentioned in the
documentation). Some statistical analysis could help with
the thresholding question.

http://sc.surbl.org/

As another example of difference about my views on the use of
the SURBL data, off-list Sidney brought up the question of
processing deliberately randomized host names that spammers
sometimes use and how that could confuse or defeat a spam message
body domain RBL. He implied that that such deliberate attempts
at randomization might be a reason my data was not working too
well with URIDNSBL, and I partially agree. This observation
points out potential differences in how the data might best
be used.

My take on the randomized host or subdomain problem highlights
a different viewpoint we took into consideration when designing
our data structure. Instead of checking every randomized FQDN
against the RBL, we prefer to try to strip off the random portion
and pass only the basic, unchanging domain. The SURBL data only
gets the parent of these randomized FQDNs since it builds its
(inverted) tree from the root (TLD) direction down toward the
leaves. (It actually starts counting reports from the second
level, not the top level, which would be way too broad.) It
accumulates a count of the children under the second level so
that:

lkjhlkjh.random.com
089yokhl.random.com
asdsdfsd.random.com

gives one entry for each FQDN, but gives the useful and desirable
count of *3* for random.com. The randomizers *cannot hide* from
this approach. The non-random child portion of their domains
shows up clearly and conspicuously as a parent domain with an
increased count (3 is greater than 1). Every time a spammer gets
reported using a randomized host or subdomain name, it increases
the count of their parent domain. In the words of the original,
Apple II version of Castle Wolfenstein, "You're caught."

So a technique to defeat the randomizers greater count is to look
at the higher levels of the domain, under which SURBL will always
count the randomized children of the "bad" parent. In this case
the URI diversity created through randomization hurts the spammer
by increasing the number of unique reports and increasing the
report count of their parent domain, making them more likely to
be added to SURBL. (Dooh, this paragraph is redundant...)

A quick look at the data will confirm that almost all of the
most often reported domains have just two levels (a few have
three levels):

http://spamcheck.freeapp.net/top-sites-domains

This simply reflects the nature of the data, including the
positive and constructive handling of randomizers. The real
strength of SURBL is that the domains are very strongly spam
domains. This approach would be prone to failure if the FP rate
of these base domains was significantly above zero. Due to the
law of averages and fairly careful SpamCop reporters., that
seldom seems to happen.

My suggested alternative approach to parsing spam URIs would be to
start with the second level domains, compare those against SURBL,
try the third levels next, up to some limit. (Levels 1 with 2,
then 1 through 3 are probably enough, i.e. two DNS queries into
the SURBL domain). Since the DNS RBL lookups are all cached and
very fast there should not be too much of a performance penalty
for this. Probably it's less of a penalty than trying to resolve
spam body FQDNs into numeric addresses, then do reverse lookups
or name server record checks on the addresses, etc. Some of the
three-level domains are supersets of two-level domains, for
example to.discreetvaluepills.com and discreetvaluepills.com are
both listed, so the two level comparison may be the best place to
start.

Implementing this approach may require a new code branch off of
URIDNSBL to be started. But I'm convinced my approach may have
some definite merit if implemented.

The results of feeding SURBL directly into URIDNSBL may not
be too strong because the two approaches seem to have fairly
different background assumptions and design approaches in mind.
I now believe my data may work better when used as I describe
above than when fed directly into unmodified URIDNSBL.

I've never written any SA code, so could I convince someone
to consider implementing this approach or give me a pointer to
learn how to do it?

> And finally, I think we should add a new rule eval fn to URIBL, to
> allow URIs to be looked up against an RHSBL-style list. That should
> be faster, as it'd mean no need for the NS and A sets of lookups.

Jason's last comment would seem to include a key part of the
puzzle. As I mention above, I believe the SURBL data could
and quite possibly should be compared without any DNS resolution
of any domains in the message body. If the domain (or numeric
address) in the spam URI matches SURBL, you almost certainly hold
a gen-u-ine spam.

This also ties in with Daniel's earlier observation after testing
SURBL using URIDNSBL:

> No FPs, but the SPAM% is rather low. I suspect the problem is that
> SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
> mapping.

He's exactly right about the intention of SURBL. It is a direct
list of spam URI domains, intended for direct comparison against
domains in incoming message URIs without resorting to any DNS
resolution. I consider that a feature rather than a bug. ;)

TIA and Cheers,

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://sc.surbl.org/

Re[2]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 1:51 AM

Post #9 of 36 (609 views)

Permalink

Re[2]: New RBL for use with URIDNSBL plugin [ In reply to ]

dot at dotat

Mar 29, 2004, 3:15 AM

Post #10 of 36 (609 views)

Permalink

On Mon, 29 Mar 2004, Jeff Chan wrote:
>
> So a technique to defeat the randomizers greater count is to look
> at the higher levels of the domain, under which SURBL will always
> count the randomized children of the "bad" parent. In this case
> the URI diversity created through randomization hurts the spammer
> by increasing the number of unique reports and increasing the
> report count of their parent domain, making them more likely to
> be added to SURBL. (Dooh, this paragraph is redundant...)

Another approach is to blacklist nameservers that host spamvertized
domains. If an email address or a URI uses a domain name whose nameservers
are blacklisted (e.g. the SBL has appropriate listing criteria), or if the
reverse DNS is hosted on blacklisted nameservers, these may be grounds for
increasing the score.

I don't know if SA does this check yet.

--
Tony Finch <dot@dotat.at> http://dotat.at/

Re: Re[2]: New RBL for use with URIDNSBL plugin [ In reply to ]

quinlan at pathname

Mar 29, 2004, 3:57 AM

Post #11 of 36 (609 views)

Permalink

Jeff Chan <jeffc@surbl.org> writes:

> For name-based URIs that's very different from my intended use for
> SURBL so I may have been partially in error in suggesting that an
> unmodified URIDNSBL use SURBL directly.

Yeah, I didn't expect it would work based on the explanation on the
SURBL web page, but I figured I'd give it a try anyway. No harm, no
foul. I think we'll need to add another method to the URIDNSBL plugin
to support direct RHS query blacklists like SURBL.

> Presently the RBL only has about 250 records; perhaps that's on the
> small side.

250 seems small relative to the number of domains I see in spam each day
(very roughly about 4 domains mentioned per email, average of 2 domains
in each spam unique to a week-long period).

> One improvement might be to encode the frequency data in the RBL so
> that more frequently reported domains could be used to give higher
> scores.

We could do that, but let's see where we are once we start doing direct
lookups and if perhaps you increase your timeout and lower your
threshold to increase the number of records somewhat.

The key thing with the threshold is that we want SURBL to be accurate as
a spam rule. Joe jobs are something you want to think about now as
opposed to later.

One way you could reduce the possibility of joe jobs is to remove old
domains, ones that have been around a while. Stuff like amazon.com,
ebay.com, etc. have been around for a long time. SenderBase has easily
accessed data for this (first email from domain was initialized long
enough ago to be useful now) and there are also the whois records. You
could also build-up a whitelist for repeated joe-jobs.

You might also to increase the timeout on domains that appear again and
again.

> As another example of difference about my views on the use of the
> SURBL data, off-list Sidney brought up the question of processing
> deliberately randomized host names that spammers sometimes use and how
> that could confuse or defeat a spam message body domain RBL. He
> implied that that such deliberate attempts at randomization might be a
> reason my data was not working too well with URIDNSBL, and I partially
> agree. This observation points out potential differences in how the
> data might best be used.

Yes, but the SBL rule works pretty well, so I don't think randomized
host names are a problem yet.

> My take on the randomized host or subdomain problem highlights
> a different viewpoint we took into consideration when designing
> our data structure.

I *think* we also currently only do queries of the domain itself, so
it shouldn't be an issue.

> Instead of checking every randomized FQDN against the RBL, we prefer
> to try to strip off the random portion and pass only the basic,
> unchanging domain. The SURBL data only gets the parent of these
> randomized FQDNs since it builds its (inverted) tree from the root
> (TLD) direction down toward the leaves. (It actually starts counting
> reports from the second level, not the top level, which would be way
> too broad.) It accumulates a count of the children under the second
> level so that:
>
> lkjhlkjh.random.com
> 089yokhl.random.com
> asdsdfsd.random.com
>
> gives one entry for each FQDN, but gives the useful and desirable
> count of *3* for random.com. The randomizers *cannot hide* from
> this approach. The non-random child portion of their domains
> shows up clearly and conspicuously as a parent domain with an
> increased count (3 is greater than 1). Every time a spammer gets
> reported using a randomized host or subdomain name, it increases
> the count of their parent domain. In the words of the original,
> Apple II version of Castle Wolfenstein, "You're caught."

This is a good idea.

> My suggested alternative approach to parsing spam URIs would be to
> start with the second level domains, compare those against SURBL,
> try the third levels next, up to some limit. (Levels 1 with 2,
> then 1 through 3 are probably enough, i.e. two DNS queries into
> the SURBL domain). Since the DNS RBL lookups are all cached and
> very fast there should not be too much of a performance penalty
> for this.

Whatever we do, we really want to do all the queries at once as early as
possible in the message check for performance reasons.

> Probably it's less of a penalty than trying to resolve spam body FQDNs
> into numeric addresses, then do reverse lookups or name server record
> checks on the addresses, etc.

Definitely.

> Implementing this approach may require a new code branch off of
> URIDNSBL to be started. But I'm convinced my approach may have
> some definite merit if implemented.

I think it belongs in the URIDNSBL code, but another plugin would
perhaps be okay.

> I've never written any SA code, so could I convince someone to
> consider implementing this approach or give me a pointer to learn how
> to do it?

It sounds like Justin is thinking about it, or perhaps Sidney is
interested, or my advice if you want to do it would be to check out the
SVN tree and start hacking. :-)

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re[3]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 4:32 AM

Post #12 of 36 (609 views)

Permalink

On Monday, March 29, 2004, 2:15:27 AM, Tony Finch wrote:
> On Mon, 29 Mar 2004, Jeff Chan wrote:
>>
>> So a technique to defeat the randomizers greater count is to look
>> at the higher levels of the domain, under which SURBL will always
>> count the randomized children of the "bad" parent. In this case
>> the URI diversity created through randomization hurts the spammer
>> by increasing the number of unique reports and increasing the
>> report count of their parent domain, making them more likely to
>> be added to SURBL. (Dooh, this paragraph is redundant...)

> Another approach is to blacklist nameservers that host spamvertized
> domains. If an email address or a URI uses a domain name whose nameservers
> are blacklisted (e.g. the SBL has appropriate listing criteria), or if the
> reverse DNS is hosted on blacklisted nameservers, these may be grounds for
> increasing the score.

> I don't know if SA does this check yet.

Yes Eric and I discussed this approach, and I know others have
also, but I tend to think it could be overbroad and could catch
too many innocent domains. For example, a non-rogue ISP who got
burned by a spamming (ex-)customer could poison the legitimate
domains of all their other customers who use the same name
servers.

Our feeling is that addressing the *domains that actually
appear in spam* is more direct and therefore much less prone
to collateral damage.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/

Re[3]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 4:32 AM

Post #13 of 36 (609 views)

Permalink

Re[3]: New RBL for use with URIDNSBL plugin [ In reply to ]

dot at dotat

Mar 29, 2004, 4:35 AM

Post #14 of 36 (609 views)

Permalink

On Mon, 29 Mar 2004, Jeff Chan wrote:
>
> Yes Eric and I discussed this approach, and I know others have
> also, but I tend to think it could be overbroad and could catch
> too many innocent domains. For example, a non-rogue ISP who got
> burned by a spamming (ex-)customer could poison the legitimate
> domains of all their other customers who use the same name
> servers.
>
> Our feeling is that addressing the *domains that actually
> appear in spam* is more direct and therefore much less prone
> to collateral damage.

Yes, this is why you have to be careful about the nameservers that are
blacklisted. They must be controlled by spammers rather than merely used
by spammers, which is why the SBL is an appropriate blacklist for this
purpose.

--
Tony Finch <dot@dotat.at> http://dotat.at/

Re[4]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 5:20 AM

Post #15 of 36 (609 views)

Permalink

On Monday, March 29, 2004, 3:35:07 AM, Tony Finch wrote:
> On Mon, 29 Mar 2004, Jeff Chan wrote:
>>
>> Yes Eric and I discussed this approach, and I know others have
>> also, but I tend to think it could be overbroad and could catch
>> too many innocent domains. For example, a non-rogue ISP who got
>> burned by a spamming (ex-)customer could poison the legitimate
>> domains of all their other customers who use the same name
>> servers.
>>
>> Our feeling is that addressing the *domains that actually
>> appear in spam* is more direct and therefore much less prone
>> to collateral damage.

> Yes, this is why you have to be careful about the nameservers that are
> blacklisted. They must be controlled by spammers rather than merely used
> by spammers, which is why the SBL is an appropriate blacklist for this
> purpose.

I appreciate your feedback. Certainly known spam gangs should be
tracked and blocked where ever they can be found, and SBL is a
good way to do that. We're not really suggesting the SURBL
replace SBL, but they certainly could and probably should be used
together. SURBL is meant to be more like frosting on an existing
cake.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/

Re[4]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 5:20 AM

Post #16 of 36 (609 views)

Permalink

Re[4]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 5:38 AM

Post #17 of 36 (609 views)

Permalink

On Monday, March 29, 2004, 2:57:53 AM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:

>> For name-based URIs that's very different from my intended use for
>> SURBL so I may have been partially in error in suggesting that an
>> unmodified URIDNSBL use SURBL directly.

> Yeah, I didn't expect it would work based on the explanation on the
> SURBL web page, but I figured I'd give it a try anyway. No harm, no
> foul. I think we'll need to add another method to the URIDNSBL plugin
> to support direct RHS query blacklists like SURBL.

Sounds like a plan, and I certainly appreciate your giving it a
try with URIDNSBL as it is now.

>> Presently the RBL only has about 250 records; perhaps that's on the
>> small side.

> 250 seems small relative to the number of domains I see in spam each day
> (very roughly about 4 domains mentioned per email, average of 2 domains
> in each spam unique to a week-long period).

An interesting thing is that the data seems pretty "normal" in a
statistical sense. Halving of the threshold approximately
doubles the size of the resulting list in the range of thresholds
I looked at (approx 5 to 25 "report counts"). Lengthening the
expiration period should also increase the size of the list for a
given threshold, and the additional data gained from doing so
could be pretty valid.

One thing I did notice from top-sites.html is that there is a
persistent pharmaspammer hosted in China or Brazil that almost
always seems to be near the top of the list. They had used
domain names like medz4cheap.com, and some other names.
Currently they're using medicalfhtjk.com:

http://spamcheck.freeapp.net/top-sites.html

What's interesting is that their domains only last a week or
so before they switch to a new one, with very similar-style
spams referencing all of them. In their case at least, that
kind of argues for a one week or so expiration, but that's
only one anecdotal example and not really a basis for a
policy. Perhaps it's not a coincidence the 7 days is also
a typical minimum zone file expire time, i.e. a length of
time the spam domain zone file might be cached on name
servers.

>> One improvement might be to encode the frequency data in the RBL so
>> that more frequently reported domains could be used to give higher
>> scores.

> We could do that, but let's see where we are once we start doing direct
> lookups and if perhaps you increase your timeout and lower your
> threshold to increase the number of records somewhat.

Agreed. And I'm not even sure most RBL code would know what to
do with information other than "yep it resolves, so it's a match,
and I'm done." That said, it could be easy to add to an RBL RR
including the Text record.

> The key thing with the threshold is that we want SURBL to be accurate as
> a spam rule. Joe jobs are something you want to think about now as
> opposed to later.

And why I want to start with a somewhat high threshold and an
effective whitelist.

> One way you could reduce the possibility of joe jobs is to remove old
> domains, ones that have been around a while.

That's an interesting idea that assumes spam body domains go
away eventually. My current code expires all domains equally,
but could be modified to look for persistent ones and treat
them differently. The averaging effect seems to be very strong
however, and very few FPs seem to get in. The fact that the
manual SpamCop reports can be and probably are mostly hand-tuned
by every SC user seems to help. I.e. most SC users probably
make an effort to uncheck legitimate domains to prevent false
reporting.

> Stuff like amazon.com,
> ebay.com, etc. have been around for a long time. SenderBase has easily
> accessed data for this (first email from domain was initialized long
> enough ago to be useful now) and there are also the whois records. You
> could also build-up a whitelist for repeated joe-jobs.

Certainly the existing SURBL whitelist could be used for that.
I've already added some of the common domains like yahoo,
hotmail, etc. and have just added ebay and amazon due to your
reminder. None of those has actually appeared above the
threshold yet, however, so the law of averages and careful
reporting seem to be on our side so far.

I'm not too familiar with SenderBase. Do they have a web site
or domain whitelist? For that matter, does anyone know of any
such whitelists that we could incorporate? Basically it would
just be a list of known, legitimate, popular sites or domains. I
would assume such whitelists exist but am somewhat new to working
on anti-spam technologies.

> You might also to increase the timeout on domains that appear again and
> again.

Interesting idea. Would that be for spam domains or legitimate
ones? The idea of variable expiration is interesting though.

>> As another example of difference about my views on the use of the
>> SURBL data, off-list Sidney brought up the question of processing
>> deliberately randomized host names that spammers sometimes use and how
>> that could confuse or defeat a spam message body domain RBL. He
>> implied that that such deliberate attempts at randomization might be a
>> reason my data was not working too well with URIDNSBL, and I partially
>> agree. This observation points out potential differences in how the
>> data might best be used.

> Yes, but the SBL rule works pretty well, so I don't think randomized
> host names are a problem yet.

We've seen quite a few randomized or customized (to a username
for example) host names in some of the top pharmaspam sites. The
idea is exactly as others have mentioned: add chaos to the names
to throw off message body checkers. Doesn't throw us off though;
we thrive on it as long is their main domain is behind it!

>> My take on the randomized host or subdomain problem highlights
>> a different viewpoint we took into consideration when designing
>> our data structure.

> I *think* we also currently only do queries of the domain itself, so
> it shouldn't be an issue.

If so, great. If not, the approach I outlined could be worth a
try.

>> Instead of checking every randomized FQDN against the RBL, we prefer
>> to try to strip off the random portion and pass only the basic,
>> unchanging domain. The SURBL data only gets the parent of these
>> randomized FQDNs since it builds its (inverted) tree from the root
>> (TLD) direction down toward the leaves. (It actually starts counting
>> reports from the second level, not the top level, which would be way
>> too broad.) It accumulates a count of the children under the second
>> level so that:
>>
>> lkjhlkjh.random.com
>> 089yokhl.random.com
>> asdsdfsd.random.com
>>
>> gives one entry for each FQDN, but gives the useful and desirable
>> count of *3* for random.com. The randomizers *cannot hide* from
>> this approach. The non-random child portion of their domains
>> shows up clearly and conspicuously as a parent domain with an
>> increased count (3 is greater than 1). Every time a spammer gets
>> reported using a randomized host or subdomain name, it increases
>> the count of their parent domain. In the words of the original,
>> Apple II version of Castle Wolfenstein, "You're caught."

> This is a good idea.

Thanks!

>> My suggested alternative approach to parsing spam URIs would be to
>> start with the second level domains, compare those against SURBL,
>> try the third levels next, up to some limit. (Levels 1 with 2,
>> then 1 through 3 are probably enough, i.e. two DNS queries into
>> the SURBL domain). Since the DNS RBL lookups are all cached and
>> very fast there should not be too much of a performance penalty
>> for this.

> Whatever we do, we really want to do all the queries at once as early as
> possible in the message check for performance reasons.

Agreed, though local DNS caching helps quite a bit....

>> Probably it's less of a penalty than trying to resolve spam body FQDNs
>> into numeric addresses, then do reverse lookups or name server record
>> checks on the addresses, etc.

> Definitely.

Agreed. It's *definitely* quicker to do DNS lookups of the
single, cached SURBL domain than DNS lookups on all the random
domains appearing in spam (and in legitimate messages).

>> Implementing this approach may require a new code branch off of
>> URIDNSBL to be started. But I'm convinced my approach may have
>> some definite merit if implemented.

> I think it belongs in the URIDNSBL code, but another plugin would
> perhaps be okay.

If it can be done in the existing code, I'm all for that!

If not we could consider forking it it off.

>> I've never written any SA code, so could I convince someone to
>> consider implementing this approach or give me a pointer to learn how
>> to do it?

> It sounds like Justin is thinking about it, or perhaps Sidney is
> interested, or my advice if you want to do it would be to check out the
> SVN tree and start hacking. :-)

> Daniel

Someone please try it. I think it could rock! :)

Eric Kolve if you're reading this would you care to try, per
my previous suggested design?

If no one else will, I may give it a hack or two. It would
probably be immensely faster for someone already familiar with SA
to give it a try though.... ;)

Thanks for your feedback!

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/

Re[4]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 5:38 AM

Post #18 of 36 (609 views)

Permalink

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

marc at perkel

Mar 29, 2004, 10:35 AM

Post #19 of 36 (609 views)

Permalink

For what its worth - I've been blacklisting against my own URI list for
over a year now and quite frankly - it's the best thing I have for
trapping spam of anything I do. It's 100% accurate and if I see new spam
getting through all I have to do is add to the list and no more of them.

So - YES !!!!

Glad to see SA implementing this.

Just want to say though - the ability to add my own URIs to the
blacklist is important. Should also support a flat text file with regex
expressions.

My 2 centz .....

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

marc at perkel

Mar 29, 2004, 11:01 AM

Post #20 of 36 (609 views)

Permalink

Here's some of my initial thoughts.

In the domain is what I would call the "real" part of the domain.

farmsex.com
farmsex.co.uk

The part before the "farmsex" should be ignored. Anyone who controls the
domains also probably controls the subdomains and that is likely the
rotating part.

Additionally - a reverse lookup should be done on the IPs of the links
for the purpouses of statistical tracking. We might find the the
resolved IP is always spam - or always not spam - or sometimes spam and
sometimes not spam. We may be able to return a score on the resolved IP
addresses. I believe that we are going to see a lot of spam linking to
the same IP or groups of IPs and that if a new URI resolves to the same
IP address as farmsex.com then it is likely also spam.

The thought is that spammers might start linking to cnn.com or something
to try to raise the score - even if it's in hidden text. And - that's an
issue - but live links to other sites might defeat the purpose of the
spam and mixing blacklisted sites with nonblacklisted might even become
a stronger indicator of spam.

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

marc at perkel

Mar 29, 2004, 11:07 AM

Post #21 of 36 (609 views)

Permalink

Yes Tony - I really like that idea. I have done something like that
myself using Exim rules to put the nameserver info into a header and
then let Bayes chew on it. I noticed that the nameservers (of the last
received IP) of spam looks different from nonspam and that it gave bayes
more useful data to score from.

I believe that the nameserver records of the IP address that the URIs
resulve to would be very hot and would be a strong factor to identfy
spam. I'm wondering if dumping the data into Bayes my be a better way of
automatically scoring hat.

But I REALLY like your idea!

Tony Finch wrote:

>domains. If an email address or a URI uses a domain name whose nameservers
>are blacklisted (e.g. the SBL has appropriate listing criteria), or if the
>reverse DNS is hosted on blacklisted nameservers, these may be grounds for
>increasing the score.
>
>I don't know if SA does this check yet.
>
>
>
> Another approach is to blacklist nameservers that host spamvertized

Re: Re[2]: New RBL for use with URIDNSBL plugin [ In reply to ]

jm at jmason

Mar 29, 2004, 11:31 AM

Post #22 of 36 (609 views)

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tony Finch writes:
> On Mon, 29 Mar 2004, Jeff Chan wrote:
> >
> > So a technique to defeat the randomizers greater count is to look
> > at the higher levels of the domain, under which SURBL will always
> > count the randomized children of the "bad" parent. In this case
> > the URI diversity created through randomization hurts the spammer
> > by increasing the number of unique reports and increasing the
> > report count of their parent domain, making them more likely to
> > be added to SURBL. (Dooh, this paragraph is redundant...)
>
> Another approach is to blacklist nameservers that host spamvertized
> domains. If an email address or a URI uses a domain name whose nameservers
> are blacklisted (e.g. the SBL has appropriate listing criteria), or if the
> reverse DNS is hosted on blacklisted nameservers, these may be grounds for
> increasing the score.
>
> I don't know if SA does this check yet.

Yep, it does -- that's what the URIBL plugin does currently.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAaGuKQTcbUG5Y7woRAra7AKDemalfVvMXd9in3b+DTTuCSco4mACg5ySk
HmqOkWPCNJUam1alF1aqnP8=
=cBGw
-----END PGP SIGNATURE-----

Re: New RBL for use with URIDNSBL plugin [ In reply to ]

marc at perkel

Mar 29, 2004, 11:41 AM

Post #23 of 36 (609 views)

Permalink

I agree - but if you were using your own Bayes to do the scoring then -
hopefully - nameservers that were split spam/nonspam wouldn't affect the
score where the ones that were all spam would

Tony Finch wrote:

>
>
>Yes, this is why you have to be careful about the nameservers that are
>blacklisted. They must be controlled by spammers rather than merely used
>by spammers, which is why the SBL is an appropriate blacklist for this
>purpose.
>
>
>

Re[2]: New RBL for use with URIDNSBL plugin [ In reply to ]

jeffc at surbl

Mar 29, 2004, 1:45 PM

Post #24 of 36 (609 views)

Permalink

> Tony Finch wrote:
>>Yes, this is why you have to be careful about the nameservers that are
>>blacklisted. They must be controlled by spammers rather than merely used
>>by spammers, which is why the SBL is an appropriate blacklist for this
>>purpose.

On Monday, March 29, 2004, 10:41:39 AM, Marc Perkel wrote:
> I agree - but if you were using your own Bayes to do the scoring then -
> hopefully - nameservers that were split spam/nonspam wouldn't affect the
> score where the ones that were all spam would

But doesn't that assume (perhaps prejudicially) that an SA can
see *every* domain that a name server serves? What if there were
a name server whose domains only appeared in spams as far as a
particular SA could see, but which in fact served up many more
legitimate domains that *seldom appeared in spam*? Those legitimate
domains would be blocked by this approach of blacklisting name
servers. I find that potentially somewhat unfair.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/