Mailing List Archive

Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests
I am pleased to announce that Eric Kolve has added SURBL support
to his SpamAssassin 2.63 plugin called SpamCopURI:

http://sourceforge.net/projects/spamcopuri/

In order to use the new RBL method, please comment out the the
previous tests SPAMCOP_URI and SPAMCOP_URI_HOST and increase
the score for the new test up to something like 2.5:

score SPAMCOP_URI_RBL 2.5

in the spamcop_uri.cf file. Values higher than 2.5 may be
appropriate because the test is a highly accurate indicator
of spam, for some of the reasons mentioned at the SURBL site:

http://www.surbl.org/

Note that unlike URIDNSBL, we are comparing *domains* found in
message bodies to *domains* in SURBL (aka a name or RHSBL), rather
than resolving the names into IP addresses (representing the spam
web site's hosting server) and comparing those addresses to a
number-based RBL.

We consider this a direct approach to the problem of URIs
advertised in spam, and we're confident that the URI data
we are getting from SpamCop and scoring based on report
counts are very useful and relevant. More information about
the data SURBL is built on can be found at:

http://spamcheck.freeapp.net/

Would someone with access to large spam and ham corpi please
give SpamCopURI a try against their recent data, as Daniel
Quinlan did with URIDNSBL + SURBL, and kindly let us know what
kind of results they obtain? Currently four trailing days of
SpamCop URI reports are represented in SURBL.

Thanks!

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/
Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
I am pleased to announce that Eric Kolve has added SURBL support
to his SpamAssassin 2.63 plugin called SpamCopURI:

http://sourceforge.net/projects/spamcopuri/

In order to use the new RBL method, please comment out the the
previous tests SPAMCOP_URI and SPAMCOP_URI_HOST and increase
the score for the new test up to something like 2.5:

score SPAMCOP_URI_RBL 2.5

in the spamcop_uri.cf file. Values higher than 2.5 may be
appropriate because the test is a highly accurate indicator
of spam, for some of the reasons mentioned at the SURBL site:

http://www.surbl.org/

Note that unlike URIDNSBL, we are comparing *domains* found in
message bodies to *domains* in SURBL (aka a name or RHSBL), rather
than resolving the names into IP addresses (representing the spam
web site's hosting server) and comparing those addresses to a
number-based RBL.

We consider this a direct approach to the problem of URIs
advertised in spam, and we're confident that the URI data
we are getting from SpamCop and scoring based on report
counts are very useful and relevant. More information about
the data SURBL is built on can be found at:

http://spamcheck.freeapp.net/

Would someone with access to large spam and ham corpi please
give SpamCopURI a try against their recent data, as Daniel
Quinlan did with URIDNSBL + SURBL, and kindly let us know what
kind of results they obtain? Currently four trailing days of
SpamCop URI reports are represented in SURBL.

Thanks!

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
Jeff Chan <jeffc@surbl.org> writes:

> Would someone with access to large spam and ham corpi please give
> SpamCopURI a try against their recent data, as Daniel Quinlan did with
> URIDNSBL + SURBL, and kindly let us know what kind of results they
> obtain? Currently four trailing days of SpamCop URI reports are
> represented in SURBL.

2.6x modules, rules, and patches aren't very interesting right now.
Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing
and I'll gladly give it a whirl.

Four days still seems rather low. Bear in mind that we're testing
corpora that have spams somewhere between 0 and 3 months old (on
average). SpamCop is very hard to accurately gauge because stuff leaves
so quickly. Expiring stuff quickly doesn't really reduce FPs unless
you're testing old ham vs. new spam. I care more about the S/O ratio
(spam/overall where overall=ham+spam for a 50/50 mix of spam and ham).

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Thursday, April 1, 2004, 11:37:54 PM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:

>> Would someone with access to large spam and ham corpi please give
>> SpamCopURI a try against their recent data, as Daniel Quinlan did with
>> URIDNSBL + SURBL, and kindly let us know what kind of results they
>> obtain? Currently four trailing days of SpamCop URI reports are
>> represented in SURBL.

> 2.6x modules, rules, and patches aren't very interesting right now.
> Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing
> and I'll gladly give it a whirl.

I would do that immediately if I knew how to write one. I've
been rewriting my data stuff lately, while letting Eric update
SpamCopURI to now use SURBL. (The somewhat frustrating thing is
that someone already familiar with SA 3.0 plugins could probably
make such a patch for URIDNSBL in a small fraction of the time it
would take me to come up to speed. But I realize everyone else
is short of time also.)

> Four days still seems rather low.

What would be a better expiration time, and how do you suggest
removing from the blacklist domains that are no longer active in
spams?

We can expire after any arbitrary number of days. I'm leaning
towards seven days right now since it's a typical DNS cacheout
interval.

> Bear in mind that we're testing
> corpora that have spams somewhere between 0 and 3 months old (on
> average). SpamCop is very hard to accurately gauge because stuff leaves
> so quickly.

True, but it also accurately reflects spams that people are
actually getting and reporting at any given moment. To me
that feature has a significant value in timeliness.

If it's the case that domains expire out of the SpamCop
URI data sooner than the particular spam domains remain
a problem, then I could definitely see a need for a longer
expiration. Being somewhat new to the game, I don't
have any data to support either argument.

My intuition is that if a domain continued to appear
in spam, people would continue to report it, and it
would therefore continue to show up in our SURBL data.
I'm interested in finding out what I may be overlooking
in this assumption.

Do you or anyone else here have some data that might shed
some light on this question?

> Expiring stuff quickly doesn't really reduce FPs unless
> you're testing old ham vs. new spam. I care more about the S/O ratio
> (spam/overall where overall=ham+spam for a 50/50 mix of spam and ham).

My priorities are near zero FPs and near 100% accuracy in
the spams we do tag. I don't guarantee that we will tag
all spams, but I'd like the ones we say are spam to actually
*be* spam. Verity is important to me.

Other techniques may be able to catch spams which we miss, and we
may be able to improve our process to catch more spams our way.
I also think our spam% will be very high if the SpamCop reports
represent a good cross-section of actual spams at any given time.

Comments? Surely I'm missing something... ;)

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Thursday, April 1, 2004, 11:37:54 PM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:

>> Would someone with access to large spam and ham corpi please give
>> SpamCopURI a try against their recent data, as Daniel Quinlan did with
>> URIDNSBL + SURBL, and kindly let us know what kind of results they
>> obtain? Currently four trailing days of SpamCop URI reports are
>> represented in SURBL.

> 2.6x modules, rules, and patches aren't very interesting right now.
> Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing
> and I'll gladly give it a whirl.

I would do that immediately if I knew how to write one. I've
been rewriting my data stuff lately, while letting Eric update
SpamCopURI to now use SURBL. (The somewhat frustrating thing is
that someone already familiar with SA 3.0 plugins could probably
make such a patch for URIDNSBL in a small fraction of the time it
would take me to come up to speed. But I realize everyone else
is short of time also.)

> Four days still seems rather low.

What would be a better expiration time, and how do you suggest
removing from the blacklist domains that are no longer active in
spams?

We can expire after any arbitrary number of days. I'm leaning
towards seven days right now since it's a typical DNS cacheout
interval.

> Bear in mind that we're testing
> corpora that have spams somewhere between 0 and 3 months old (on
> average). SpamCop is very hard to accurately gauge because stuff leaves
> so quickly.

True, but it also accurately reflects spams that people are
actually getting and reporting at any given moment. To me
that feature has a significant value in timeliness.

If it's the case that domains expire out of the SpamCop
URI data sooner than the particular spam domains remain
a problem, then I could definitely see a need for a longer
expiration. Being somewhat new to the game, I don't
have any data to support either argument.

My intuition is that if a domain continued to appear
in spam, people would continue to report it, and it
would therefore continue to show up in our SURBL data.
I'm interested in finding out what I may be overlooking
in this assumption.

Do you or anyone else here have some data that might shed
some light on this question?

> Expiring stuff quickly doesn't really reduce FPs unless
> you're testing old ham vs. new spam. I care more about the S/O ratio
> (spam/overall where overall=ham+spam for a 50/50 mix of spam and ham).

My priorities are near zero FPs and near 100% accuracy in
the spams we do tag. I don't guarantee that we will tag
all spams, but I'd like the ones we say are spam to actually
*be* spam. Verity is important to me.

Other techniques may be able to catch spams which we miss, and we
may be able to improve our process to catch more spams our way.
I also think our spam% will be very high if the SpamCop reports
represent a good cross-section of actual spams at any given time.

Comments? Surely I'm missing something... ;)

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 2:11:39 AM, Jeff Chan wrote:
> If it's the case that domains expire out of the SpamCop
> URI data sooner than the particular spam domains remain
> a problem, then I could definitely see a need for a longer
> expiration. Being somewhat new to the game, I don't
> have any data to support either argument.

OK I can see one flaw in my argument would be that if message
body domain blocking were already popular and successful then
*reporting* about spam URIs would taper off as fewer spams
reached victims, even if the spam-referenced domains stayed
up. In that case we could simply increase our expiration
time to make the blocking persist long after the reports
tapered off. (But there still should be some mechanism for
expiring domains off the block list, whatever time period
is used. Or there should be some other method of removing
domains from the list.)

Does anyone have any data about the persistence of spam URI
domains? I'll even settle for any data about spam web server
IP addresses. :-)

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 2:11:39 AM, Jeff Chan wrote:
> If it's the case that domains expire out of the SpamCop
> URI data sooner than the particular spam domains remain
> a problem, then I could definitely see a need for a longer
> expiration. Being somewhat new to the game, I don't
> have any data to support either argument.

OK I can see one flaw in my argument would be that if message
body domain blocking were already popular and successful then
*reporting* about spam URIs would taper off as fewer spams
reached victims, even if the spam-referenced domains stayed
up. In that case we could simply increase our expiration
time to make the blocking persist long after the reports
tapered off. (But there still should be some mechanism for
expiring domains off the block list, whatever time period
is used. Or there should be some other method of removing
domains from the list.)

Does anyone have any data about the persistence of spam URI
domains? I'll even settle for any data about spam web server
IP addresses. :-)

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Jeff Chan writes:
> On Friday, April 2, 2004, 2:11:39 AM, Jeff Chan wrote:
> > If it's the case that domains expire out of the SpamCop
> > URI data sooner than the particular spam domains remain
> > a problem, then I could definitely see a need for a longer
> > expiration. Being somewhat new to the game, I don't
> > have any data to support either argument.
>
> OK I can see one flaw in my argument would be that if message
> body domain blocking were already popular and successful then
> *reporting* about spam URIs would taper off as fewer spams
> reached victims, even if the spam-referenced domains stayed
> up. In that case we could simply increase our expiration
> time to make the blocking persist long after the reports
> tapered off. (But there still should be some mechanism for
> expiring domains off the block list, whatever time period
> is used. Or there should be some other method of removing
> domains from the list.)
>
> Does anyone have any data about the persistence of spam URI
> domains? I'll even settle for any data about spam web server
> IP addresses. :-)

I've seen the same domain being used for several months.

BTW I would suggest a TTL in the list of at least 1 month for reported
URIs. If you're worried about FPs hanging around for long, provide a very
easy removal method (e.g. web form or email). Don't bother trying to
assess the spamminess or otherwise of the requester, just remove the URL
ASAP (and log the action, of course).

Side issue: why use easy removal without questions? Spammers do not have
the bandwidth to remove themselves from every list. If they *do* go to
the bother, and a URL does get removed, then repeatedly crops up in spam
again, it should be raised as an alarm -- and possibly brought to the
notice of other people -- e.g. this list or others.

If it really is a spammy URL and the spammer just keeps removing it, I'd
imagine the URL would be noted as such and quickly find its way into
systems that *don't* offer easy removal. ;) If it isn't a spammy URL,
then you've saved yourself a lot of FPs and annoyed users, without
requiring much legwork on your part.

Basically the philosophy is to make it easy for anyone to remove an
URL from the list.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAbfbRQTcbUG5Y7woRAlB0AJoDZtBP7lqSmUDngr9kBASS2VvJpgCfSG6v
8JNhCJUWh2C5X7NDm86crEE=
=i2Ij
-----END PGP SIGNATURE-----
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 3:27:13 PM, Justin Mason wrote:
> Jeff Chan writes:
>> Does anyone have any data about the persistence of spam URI
>> domains? I'll even settle for any data about spam web server
>> IP addresses. :-)

> I've seen the same domain being used for several months.

Thanks much for the feedback. Can you cite some persistent
spam domains?

I'd like to check their histories against my data from SpamCop
reporting. We have enough history built up that I should be able
to see if they would have fallen off my lists at certain points
due to our relatively short expiration. I might be able to use
that information to tune the expirations better.

> BTW I would suggest a TTL in the list of at least 1 month for reported
> URIs. If you're worried about FPs hanging around for long, provide a very
> easy removal method (e.g. web form or email). Don't bother trying to
> assess the spamminess or otherwise of the requester, just remove the URL
> ASAP (and log the action, of course).

> Side issue: why use easy removal without questions? Spammers do not have
> the bandwidth to remove themselves from every list. If they *do* go to
> the bother, and a URL does get removed, then repeatedly crops up in spam
> again, it should be raised as an alarm -- and possibly brought to the
> notice of other people -- e.g. this list or others.

> If it really is a spammy URL and the spammer just keeps removing it, I'd
> imagine the URL would be noted as such and quickly find its way into
> systems that *don't* offer easy removal. ;) If it isn't a spammy URL,
> then you've saved yourself a lot of FPs and annoyed users, without
> requiring much legwork on your part.

> Basically the philosophy is to make it easy for anyone to remove an
> URL from the list.

It's a useful approach to know about. I'm sure as I get more
experience I'll be better able to make judgements about what
can work best. It definitely helps to have input from the
"spam war veterans" so I appreciate it!

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 3:27:13 PM, Justin Mason wrote:
> Jeff Chan writes:
>> Does anyone have any data about the persistence of spam URI
>> domains? I'll even settle for any data about spam web server
>> IP addresses. :-)

> I've seen the same domain being used for several months.

Thanks much for the feedback. Can you cite some persistent
spam domains?

I'd like to check their histories against my data from SpamCop
reporting. We have enough history built up that I should be able
to see if they would have fallen off my lists at certain points
due to our relatively short expiration. I might be able to use
that information to tune the expirations better.

> BTW I would suggest a TTL in the list of at least 1 month for reported
> URIs. If you're worried about FPs hanging around for long, provide a very
> easy removal method (e.g. web form or email). Don't bother trying to
> assess the spamminess or otherwise of the requester, just remove the URL
> ASAP (and log the action, of course).

> Side issue: why use easy removal without questions? Spammers do not have
> the bandwidth to remove themselves from every list. If they *do* go to
> the bother, and a URL does get removed, then repeatedly crops up in spam
> again, it should be raised as an alarm -- and possibly brought to the
> notice of other people -- e.g. this list or others.

> If it really is a spammy URL and the spammer just keeps removing it, I'd
> imagine the URL would be noted as such and quickly find its way into
> systems that *don't* offer easy removal. ;) If it isn't a spammy URL,
> then you've saved yourself a lot of FPs and annoyed users, without
> requiring much legwork on your part.

> Basically the philosophy is to make it easy for anyone to remove an
> URL from the list.

It's a useful approach to know about. I'm sure as I get more
experience I'll be better able to make judgements about what
can work best. It definitely helps to have input from the
"spam war veterans" so I appreciate it!

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Jeff Chan writes:
> On Friday, April 2, 2004, 3:27:13 PM, Justin Mason wrote:
> > Jeff Chan writes:
> >> Does anyone have any data about the persistence of spam URI
> >> domains? I'll even settle for any data about spam web server
> >> IP addresses. :-)
>
> > I've seen the same domain being used for several months.
>
> Thanks much for the feedback. Can you cite some persistent
> spam domains?

530000x.com, 530000x.org, 530000x.net -- these stuck around
for quite a while before being dropped. They're now almost
definitely not around any more ;)

> I'd like to check their histories against my data from SpamCop
> reporting. We have enough history built up that I should be able
> to see if they would have fallen off my lists at certain points
> due to our relatively short expiration. I might be able to use
> that information to tune the expirations better.
>
> > BTW I would suggest a TTL in the list of at least 1 month for reported
> > URIs. If you're worried about FPs hanging around for long, provide a very
> > easy removal method (e.g. web form or email). Don't bother trying to
> > assess the spamminess or otherwise of the requester, just remove the URL
> > ASAP (and log the action, of course).
>
> > Side issue: why use easy removal without questions? Spammers do not have
> > the bandwidth to remove themselves from every list. If they *do* go to
> > the bother, and a URL does get removed, then repeatedly crops up in spam
> > again, it should be raised as an alarm -- and possibly brought to the
> > notice of other people -- e.g. this list or others.
>
> > If it really is a spammy URL and the spammer just keeps removing it, I'd
> > imagine the URL would be noted as such and quickly find its way into
> > systems that *don't* offer easy removal. ;) If it isn't a spammy URL,
> > then you've saved yourself a lot of FPs and annoyed users, without
> > requiring much legwork on your part.
>
> > Basically the philosophy is to make it easy for anyone to remove an
> > URL from the list.
>
> It's a useful approach to know about. I'm sure as I get more
> experience I'll be better able to make judgements about what
> can work best. It definitely helps to have input from the
> "spam war veterans" so I appreciate it!

np ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAbgDsQTcbUG5Y7woRApqKAKCmxonGVplkIyB6ddREeyM6aAKbfQCgmstl
KP9y5iepKUnwPRff2sQF4E8=
=aZTO
-----END PGP SIGNATURE-----
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
jm@jmason.org (Justin Mason) writes:

> Side issue: why use easy removal without questions? Spammers do not have
> the bandwidth to remove themselves from every list. If they *do* go to
> the bother, and a URL does get removed, then repeatedly crops up in spam
> again, it should be raised as an alarm -- and possibly brought to the
> notice of other people -- e.g. this list or others.

I'm not so sure easy removal is actually a good idea. I think it's
better to have FP-prevention mechanisms that don't require attention of
the email sender.

Why? Because it's a mechanism biased towards savvy users, people who
use blacklists, SpamAssassin, etc. In addition, it's exactly the same
folks who are already overrepresented in our ham corpus. So, the
effective FP rate will be higher than it appears in our corpus *and*
non-savvy senders will be penalized.

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 4:23:51 PM, Daniel Quinlan wrote:
> jm@jmason.org (Justin Mason) writes:

>> Side issue: why use easy removal without questions? Spammers do not have
>> the bandwidth to remove themselves from every list. If they *do* go to
>> the bother, and a URL does get removed, then repeatedly crops up in spam
>> again, it should be raised as an alarm -- and possibly brought to the
>> notice of other people -- e.g. this list or others.

> I'm not so sure easy removal is actually a good idea. I think it's
> better to have FP-prevention mechanisms that don't require attention of
> the email sender.

Can you cite some examples of FP-prevention strategies?

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 4:23:51 PM, Daniel Quinlan wrote:
> jm@jmason.org (Justin Mason) writes:

>> Side issue: why use easy removal without questions? Spammers do not have
>> the bandwidth to remove themselves from every list. If they *do* go to
>> the bother, and a URL does get removed, then repeatedly crops up in spam
>> again, it should be raised as an alarm -- and possibly brought to the
>> notice of other people -- e.g. this list or others.

> I'm not so sure easy removal is actually a good idea. I think it's
> better to have FP-prevention mechanisms that don't require attention of
> the email sender.

Can you cite some examples of FP-prevention strategies?

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 3:52:31 PM, Jeff Chan wrote:
> We have enough history built up that I should be able
> to see if they would have fallen off my lists at certain points
> due to our relatively short expiration. I might be able to use
> that information to tune the expirations better.

After I wrote this I remembered a strategy someone else may have
suggested here earlier (unfortunately can't remember where I
first saw it): make the expiration tied to the amount of
reporting. That could make SURBL somewhat self-tuning:

1. Domains with many reports over a short period of time
probably really are spam domains and would get a longer
expiration. I.e., with a longer expiration we keep watching
this domain for a longer period of time, making it easier
to catch repeat offenses and keep the domain on the list for
longer. Something like an inverse logarithmic function
where the input is the spam count and the output is the
number of days to keep it on the list might be nice.

2. Domains with fewer reports get a shorter expiration.
This lets FPs roll off the list sooner, all automatically.

In other words we don't let small things bother us for
very long, but big offenders get the big hairy eye on
them for a long time. How does this sound?


Something that should probably be clarified about our
expirations is that they are "refreshed" by fresh spam.
If a domain keeps getting more than 10 spam reports over
a 4 day sliding window (current values), it will *stay on
the list for longer than 4 days*. Domains stay on the list
for as long as a certain rate of reports keep coming in,
which could in principle be forever.

It's not like the domains automatically get off the list
after 4 days. If reports keep coming in, they stay on
the list.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 3:52:31 PM, Jeff Chan wrote:
> We have enough history built up that I should be able
> to see if they would have fallen off my lists at certain points
> due to our relatively short expiration. I might be able to use
> that information to tune the expirations better.

After I wrote this I remembered a strategy someone else may have
suggested here earlier (unfortunately can't remember where I
first saw it): make the expiration tied to the amount of
reporting. That could make SURBL somewhat self-tuning:

1. Domains with many reports over a short period of time
probably really are spam domains and would get a longer
expiration. I.e., with a longer expiration we keep watching
this domain for a longer period of time, making it easier
to catch repeat offenses and keep the domain on the list for
longer. Something like an inverse logarithmic function
where the input is the spam count and the output is the
number of days to keep it on the list might be nice.

2. Domains with fewer reports get a shorter expiration.
This lets FPs roll off the list sooner, all automatically.

In other words we don't let small things bother us for
very long, but big offenders get the big hairy eye on
them for a long time. How does this sound?


Something that should probably be clarified about our
expirations is that they are "refreshed" by fresh spam.
If a domain keeps getting more than 10 spam reports over
a 4 day sliding window (current values), it will *stay on
the list for longer than 4 days*. Domains stay on the list
for as long as a certain rate of reports keep coming in,
which could in principle be forever.

It's not like the domains automatically get off the list
after 4 days. If reports keep coming in, they stay on
the list.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 4:42:34 PM, Jeff Chan wrote:
> Something like an inverse logarithmic function
> where the input is the spam count and the output is the
> number of days to keep it on the list

Correction that should be a log function.

A linear function like reports/10 + 1 could also work. With such
a linear function 10 reports would get a domain on the list for 2
days, and 600 reports would get a domain on the list for 61 days.
(That's roughly the range of counts we're seeing now.) The data
itself already has a normal-looking curve to the counts.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 4:42:34 PM, Jeff Chan wrote:
> Something like an inverse logarithmic function
> where the input is the spam count and the output is the
> number of days to keep it on the list

Correction that should be a log function.

A linear function like reports/10 + 1 could also work. With such
a linear function 10 reports would get a domain on the list for 2
days, and 600 reports would get a domain on the list for 61 days.
(That's roughly the range of counts we're seeing now.) The data
itself already has a normal-looking curve to the counts.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
> On Friday, April 2, 2004, 3:27:13 PM, Justin Mason wrote:
> > Jeff Chan writes:
> >> Does anyone have any data about the persistence of spam URI
> >> domains? I'll even settle for any data about spam web server
> >> IP addresses. :-)

Jeff, it seems to me that you are in a good place to start figuring this
out. Simply make some sort of database by reported URL (or the part you
keep) and log the dates you find it reported. Any persistant domain is
going to be reported either persistantly, or at least repeatedly.

You could even make an exception report that showed the domains that are
being reported consistantly for more than say 7 days.

Loren
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
Jeff, I had a look at your list at some random time a few days ago. I
noticed that the top 90% or so of the reports looked pretty solid. At the
instant I looked, the bottom 10% of the reports were most all highly
suspect. This is where the yahoo and geocities and other whitelist stuff
was showing up. Some other reports (and I can't remember what any of them
were) also seemed somewhat suspect, even though they probably weren't on a
whitelist.

I concluded that only the top 90% of your reports should be used in the
blocking test, and ignore the reports with less than 10% of the
highest-scoring report. Now, perhaps this percentage fluxuates with time, I
certainly haven't made multiple checks to see. And maybe after whitelist
removal the rest of the bottom 10% really is spam.

But I think it would be an interesting experiment to compare the relibility
of the top 90% to the relibility of the entire collection.

Loren
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 8:53:46 PM, Loren Wilton wrote:
>> On Friday, April 2, 2004, 3:27:13 PM, Justin Mason wrote:
>> > Jeff Chan writes:
>> >> Does anyone have any data about the persistence of spam URI
>> >> domains? I'll even settle for any data about spam web server
>> >> IP addresses. :-)

> Jeff, it seems to me that you are in a good place to start figuring this
> out. Simply make some sort of database by reported URL (or the part you
> keep) and log the dates you find it reported. Any persistant domain is
> going to be reported either persistantly, or at least repeatedly.

> You could even make an exception report that showed the domains that are
> being reported consistantly for more than say 7 days.

Hehe, you're right about who can answer this question.
One issue is that we've only been collecting data for
a little over a month so we don't have the historical
view of some of these issues yet. So the feedback from
folks is definitely helpful. :)

Somewhat oddly I just added logging of the new spam
domains as they reach the threshold and get added to
the rbl as a result, just before you wrote. So now
we will be able to see if any domains drop off the list
and come back on with the current expiration. The log
is visible at:

http://spamcheck.freeapp.net/top-sites-domains.new.log

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 8:53:46 PM, Loren Wilton wrote:
>> On Friday, April 2, 2004, 3:27:13 PM, Justin Mason wrote:
>> > Jeff Chan writes:
>> >> Does anyone have any data about the persistence of spam URI
>> >> domains? I'll even settle for any data about spam web server
>> >> IP addresses. :-)

> Jeff, it seems to me that you are in a good place to start figuring this
> out. Simply make some sort of database by reported URL (or the part you
> keep) and log the dates you find it reported. Any persistant domain is
> going to be reported either persistantly, or at least repeatedly.

> You could even make an exception report that showed the domains that are
> being reported consistantly for more than say 7 days.

Hehe, you're right about who can answer this question.
One issue is that we've only been collecting data for
a little over a month so we don't have the historical
view of some of these issues yet. So the feedback from
folks is definitely helpful. :)

Somewhat oddly I just added logging of the new spam
domains as they reach the threshold and get added to
the rbl as a result, just before you wrote. So now
we will be able to see if any domains drop off the list
and come back on with the current expiration. The log
is visible at:

http://spamcheck.freeapp.net/top-sites-domains.new.log

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
Jeff Chan <jeffc@surbl.org> writes:

> Can you cite some examples of FP-prevention strategies?

1. Automated testing. We're testing URLs (web sites). That allows a
large number of strategies which could be used from each aspect of
the URL.

A record
check other blacklists
check IP owner against SBL
domain name
check name servers in other blacklists
check registrar
check age of domain (SenderBase information)
check ISP / IP block owner (SenderBase, SBL, etc.)
web content
check web site for common spam web site content (porn, drugs, credit
card forms, empty top-level page, etc.)

Any of those can also be used in concert with threshold tuning. For
example, lower thresholds if a good blacklist hits and somewhat
higher thresholds for older domains.

2. Building up a long and accurate whitelist of good URLs over time
would also help. Maybe work with places that vouch for domain's
anti-spam policies (Habeas, BondedSender, IADB) to develop longer
whitelists.

3. Using a corpus to tune results and thresholds (also whitelist
seeding).

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 9:02:59 PM, Loren Wilton wrote:
> Jeff, I had a look at your list at some random time a few days ago. I
> noticed that the top 90% or so of the reports looked pretty solid. At the
> instant I looked, the bottom 10% of the reports were most all highly
> suspect. This is where the yahoo and geocities and other whitelist stuff
> was showing up. Some other reports (and I can't remember what any of them
> were) also seemed somewhat suspect, even though they probably weren't on a
> whitelist.

> I concluded that only the top 90% of your reports should be used in the
> blocking test, and ignore the reports with less than 10% of the
> highest-scoring report. Now, perhaps this percentage fluxuates with time, I
> certainly haven't made multiple checks to see. And maybe after whitelist
> removal the rest of the bottom 10% really is spam.

> But I think it would be an interesting experiment to compare the relibility
> of the top 90% to the relibility of the entire collection.

Thanks for checking this over for us! It looks like you visited:

http://spamcheck.freeapp.net/top-sites.html

which does not have the whitelist entries removed from it and
which does not go all the way down to the threshold of 10 spams.

The full list which is about 11000 entries can be seen at:

http://spamcheck.freeapp.net/top-sites.txt

This is a basis for the thresholded 400 or so domains at:

http://spamcheck.freeapp.net/top-sites-domains

which doesn't show the counts used to threshold, but they
all got over 10 counts. It does however have some duplicates
like www.domain.com for domain.com eliminated and perhaps most
importantly *has had the whitelisted domains and two level ccTLDs
removed*. It is the basis for the RBL:

http://spamcheck.freeapp.net/surbl.bind

Due to the whitelisting and thresholding, the domains that make it
into SURBL are quite spammy, hopefully and probably more than the
90% you estimated on the unfiltered list.

Cheers,

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/
Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]
On Friday, April 2, 2004, 9:02:59 PM, Loren Wilton wrote:
> Jeff, I had a look at your list at some random time a few days ago. I
> noticed that the top 90% or so of the reports looked pretty solid. At the
> instant I looked, the bottom 10% of the reports were most all highly
> suspect. This is where the yahoo and geocities and other whitelist stuff
> was showing up. Some other reports (and I can't remember what any of them
> were) also seemed somewhat suspect, even though they probably weren't on a
> whitelist.

> I concluded that only the top 90% of your reports should be used in the
> blocking test, and ignore the reports with less than 10% of the
> highest-scoring report. Now, perhaps this percentage fluxuates with time, I
> certainly haven't made multiple checks to see. And maybe after whitelist
> removal the rest of the bottom 10% really is spam.

> But I think it would be an interesting experiment to compare the relibility
> of the top 90% to the relibility of the entire collection.

Thanks for checking this over for us! It looks like you visited:

http://spamcheck.freeapp.net/top-sites.html

which does not have the whitelist entries removed from it and
which does not go all the way down to the threshold of 10 spams.

The full list which is about 11000 entries can be seen at:

http://spamcheck.freeapp.net/top-sites.txt

This is a basis for the thresholded 400 or so domains at:

http://spamcheck.freeapp.net/top-sites-domains

which doesn't show the counts used to threshold, but they
all got over 10 counts. It does however have some duplicates
like www.domain.com for domain.com eliminated and perhaps most
importantly *has had the whitelisted domains and two level ccTLDs
removed*. It is the basis for the RBL:

http://spamcheck.freeapp.net/surbl.bind

Due to the whitelisting and thresholding, the domains that make it
into SURBL are quite spammy, hopefully and probably more than the
90% you estimated on the unfiltered list.

Cheers,

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

1 2  View All