Mailing List Archive: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

Apr 3, 2004, 2:29 AM

Post #26 of 35 (99 views)

On Saturday, April 3, 2004, 12:34:24 AM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:

>> Can you cite some examples of FP-prevention strategies?

> 1. Automated testing. We're testing URLs (web sites). That allows a
> large number of strategies which could be used from each aspect of
> the URL.

> A record
> check other blacklists
> check IP owner against SBL
> domain name
> check name servers in other blacklists
> check registrar
> check age of domain (SenderBase information)
> check ISP / IP block owner (SenderBase, SBL, etc.)
> web content
> check web site for common spam web site content (porn, drugs, credit
> card forms, empty top-level page, etc.)

> Any of those can also be used in concert with threshold tuning. For
> example, lower thresholds if a good blacklist hits and somewhat
> higher thresholds for older domains.

I agree with the content check, but will step on many toes here
by proclaiming that other blacklists (other than SBL), name
servers, registrars, ISP address blocks, and similar approaches
are overly broad and have too much potential for collateral
damage *for my sensibilities*. I really, really hate
blacklisting innocent victims. I consider that a false
accusation or even false punishment. Having policies which allow
blacklisting an entire ISP or even an entire web server IP
address have the potential to harm too many innocent bystanders,
IMO. Your mileage may and probably does vary. ;)

Our approach is to start with some likely good data in the
SpamCop URIs. See comments below.

> 2. Building up a long and accurate whitelist of good URLs over time
> would also help. Maybe work with places that vouch for domain's
> anti-spam policies (Habeas, BondedSender, IADB) to develop longer
> whitelists.

I agree in principle, however I feel that the SpamCop reported
URIs tend to have relatively few FPs. They are domains that
people took the time to report; in essence they are *voting with
their time that these are spam domains*.

That's one of the reasons our whitelist is quite small now (@ 35)
yet catches the few legitimate domains and subdomains that
survive the reporting and thresholding and are (mis-over-)reported
enough to get onto the list before I can notice and whitelist
them. That need has been small so far.

http://spamcheck.freeapp.net/whitelist-domains

> 3. Using a corpus to tune results and thresholds (also whitelist
> seeding).

Agreed. Currently we lack spam and ham corporea of our own and
have not had a chance to set some up yet. That may come later
though.

I hope I'm not taking too confrontational a tone here. I'm
just trying to defend our approach, which I think can be valid.
I also realize people have a lot of work invested in other
approaches, but I hope they will eventually give ours a try.
I feel it has value, even if I can't prove it conclusively
myself yet. LOL! :-)

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 2:29 AM

Post #27 of 35 (99 views)

Permalink

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 2:45 AM

Post #28 of 35 (99 views)

Permalink

On Saturday, April 3, 2004, 1:29:44 AM, Jeff Chan wrote:
> On Saturday, April 3, 2004, 12:34:24 AM, Daniel Quinlan wrote:
>> domain name
>> check name servers in other blacklists
>> check registrar
>> check age of domain (SenderBase information)
>> check ISP / IP block owner (SenderBase, SBL, etc.)

> name
> servers, registrars, ISP address blocks, and similar approaches
> are overly broad

I should add that I fully understand that there are rogue
ISPs, rogue name servers, rogue netblocks, etc., all of
which deserve something much stronger than simply being
rejected, blocked or blackholed. But SBL aside, it can be
difficult to successfully identify the true bad guys using
the above without catching too many innocents in the same
net.

That said, I use several of these other RBLs myself. I'm
not necessarily opposed to them, but feel our approach with
SURBL *adds* something hopefully new and effective.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 2:45 AM

Post #29 of 35 (99 views)

Permalink

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

quinlan at pathname

Apr 3, 2004, 1:52 PM

Post #30 of 35 (99 views)

Permalink

Jeff Chan <jeffc@surbl.org> writes:

> I agree with the content check, but will step on many toes here
> by proclaiming that other blacklists (other than SBL), name
> servers, registrars, ISP address blocks, and similar approaches
> are overly broad and have too much potential for collateral
> damage *for my sensibilities*.

There are other blacklists just as accurate as SBL (and some more
accurate). And bear in mind these are secondary checks to lower the
threshold for a URI already reported to SpamCop so the accuracy should
be really good (two 99% accurate features => more than 99% accurate
together).

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
69948 37790 32158 0.540 0.00 0.00 (all messages)
100.000 54.0258 45.9742 0.540 0.00 0.00 (all messages as %)
1.016 1.8815 0.0000 1.000 0.93 8.60 RCVD_IN_OPM_SOCKS
2.918 5.3956 0.0062 0.999 0.94 0.62 RCVD_IN_NJABL_DIALUP
1.138 2.1037 0.0031 0.999 0.93 8.60 RCVD_IN_OPM_HTTP
1.107 2.0455 0.0031 0.998 0.93 8.60 RCVD_IN_OPM_HTTP_POST
7.769 14.3292 0.0591 0.996 0.94 1.27 RCVD_IN_SBL
2.698 4.9749 0.0218 0.996 0.93 0.53 RCVD_IN_RSL
19.630 36.1842 0.1772 0.995 0.97 2.55 RCVD_IN_SORBS_DUL
3.127 5.7581 0.0342 0.994 0.92 0.74 RCVD_IN_NJABL_SPAM
9.759 17.9360 0.1493 0.992 0.93 1.20 RCVD_IN_SORBS_MISC
5.067 9.3146 0.0746 0.992 0.92 0.01 T_RCVD_IN_AHBL_SPAM
0.815 1.4978 0.0124 0.992 0.91 1.20 RCVD_IN_SORBS_SMTP
32.202 59.1532 0.5317 0.991 0.99 1.10 RCVD_IN_DSBL
17.386 31.8735 0.3607 0.989 0.95 1.00 RCVD_IN_XBL
13.524 24.8002 0.2736 0.989 0.94 1.20 RCVD_IN_NJABL_PROXY
9.088 16.6711 0.1772 0.989 0.93 1.20 RCVD_IN_SORBS_HTTP

(some older mail being tested, so these numbers are going to be somewhat
off)

> I really, really hate blacklisting innocent victims. I consider that
> a false accusation or even false punishment. Having policies which
> allow blacklisting an entire ISP or even an entire web server IP
> address have the potential to harm too many innocent bystanders, IMO.
> Your mileage may and probably does vary. ;)

You already have a repeated URL. Are you just railing about other
blacklists or did you really consider my suggestion? SpamCop is no more
accurate than the above blacklists. People report ham all the time,
sometimes repeatedly.

> Our approach is to start with some likely good data in the
> SpamCop URIs. See comments below.

And these are ways to make the data more accurate.

> I agree in principle, however I feel that the SpamCop reported
> URIs tend to have relatively few FPs. They are domains that
> people took the time to report; in essence they are *voting with
> their time that these are spam domains*.

Again, SpamCop has false positives. It is no magic bullet. Some
mailing lists are very low volume so when an announcement or conference
notice goes out, people report it as spam even though they actually
subscribed. It happens all the time.

I think pre-seeding a whitelist would be a sensible precaution against
joe jobs and the more sporadic (for any one domain, SpamCop has false
positives probably every day) type of false positive.

> I hope I'm not taking too confrontational a tone here. I'm just
> trying to defend our approach, which I think can be valid.

Nobody is attacking your approach. I only made these suggestions to
potentially allow you to selectively lower or raise your threshold for
specific URLs based on other data and therefore increase your accuracy
and spam hit rate. I suspect your blacklist will work well once a
plug-in supports it, but until then it seems like further discussion is
a waste of my time.

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 2:40 PM

Post #31 of 35 (99 views)

Permalink

On Saturday, April 3, 2004, 12:52:23 PM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:
>> I agree with the content check, but will step on many toes here
>> by proclaiming that other blacklists (other than SBL), name
>> servers, registrars, ISP address blocks, and similar approaches
>> are overly broad and have too much potential for collateral
>> damage *for my sensibilities*.

> There are other blacklists just as accurate as SBL (and some more
> accurate). And bear in mind these are secondary checks to lower the
> threshold for a URI already reported to SpamCop so the accuracy should
> be really good (two 99% accurate features => more than 99% accurate
> together).

Understood. I never claimed SURBL was a 100% solution to all
spam, but I feel it can give some good results which could be
useful and probably should not be ignored. I absolutely agree
it should be used in conjunction with other RBLs and techniques.

>> I really, really hate blacklisting innocent victims. I consider that
>> a false accusation or even false punishment. Having policies which
>> allow blacklisting an entire ISP or even an entire web server IP
>> address have the potential to harm too many innocent bystanders, IMO.
>> Your mileage may and probably does vary. ;)

> You already have a repeated URL.

Not sure what you mean. If you mean a domain fell off the list
then came back on, I agree that's possible with the current
simple, initial tuning. If you mean there are multiple domains
on the list resolving to the same web server or using the same
name server, yes, that's definitely going to happen.

> Are you just railing about other
> blacklists or did you really consider my suggestion?

See above. I'm not comfortable with all RBLs or their
approaches, but I use some myself. I definitely agree
SURBL should be used with other RBLs to catch spams it
may miss. The more pre-emptive approach of SBL and others
like it seem justified in terms of catching known spam
gangs, spamhausen and the like.

> I think pre-seeding a whitelist would be a sensible precaution against
> joe jobs and the more sporadic (for any one domain, SpamCop has false
> positives probably every day) type of false positive.

Yes, we are pre-seeding our whitelist. Any whitelists
anyone wants to share with us or reference would be gratefully
accepted and also hand checked by me before any get added
to our whitelist.

> I only made these suggestions to
> potentially allow you to selectively lower or raise your threshold for
> specific URLs based on other data and therefore increase your accuracy
> and spam hit rate.

Yes and I appreciate your suggestions. I think we may be
agreeing more than disagreeing.

I will definitely let folks here know when we have a 3.0
plugin, modify URIDNSBL, can find some coding help, etc.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 2:40 PM

Post #32 of 35 (99 views)

Permalink

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 4:00 PM

Post #33 of 35 (99 views)

Permalink

After announcing SpamCopURI + SURBL on an ISP-internal newsgroup
we got the following positive response, followed by our reply:

http://sourceforge.net/projects/spamcopuri/ http://www.surbl.org/

On Sat, 03 Apr 2004 08:29:11 GMT, Daniel wrote:
I read your posting and thought I'd give it a try. [...]
17 hits in the last 20 minutes! And my site is relatively low volume
(~4000 messages/day). Congratulations on what appears to be an
effective spam countermeasure. I've given it a score of 3 for now.
If it's ok with you, I'd like to mention it on the MailScanner mailing list
(how I run spamassassin, http://mailscanner.info). There are some
people who run large servers there, can your nameservers handle
more load?

Daniel
kleinsi AT sonic

Here's a sample of what it caught for me in the first 29 minutes, all
definite spams:

Date/Time Subject
3/4/2004 0:01 "Cia.lis,Lipitor,Soma Specia| Of.fers! Tb"
3/4/2004 0:00 "rates will go up, Act Now!"
2/4/2004 23:59 Find Sexy Singles
2/4/2004 23:59 Enjoy The Status of Platinum
2/4/2004 23:58 Fwd: No Forms. Medicines Available.
v:aGr@.XA'n'ax.Val!i!um.Vcod1n.RIPJPAZP
2/4/2004 23:58 "Bob, Extended Warranty Protection for your vehicle"
2/4/2004 23:55 Su.per Dis.count of 90%! vwqhh wly v qd
2/4/2004 23:55 "WinXP, Office, Autodesk, Adobe OEM sale - all under
99$ per CD from Esparza's SoftShop"
2/4/2004 23:55 huge gaps
2/4/2004 23:54 You get More for half the price!
2/4/2004 23:53 Need quick cash until your next payday?
2/4/2004 23:53 Special Promotion Sildenafil Citrate (6) 50mg. doses
FREE Consultation
2/4/2004 23:51 Re: Life Ins. for Your Lmfe Frke Quotes
2/4/2004 23:51 Re: Life Ins. for Your Lmfe Frke Quotes
2/4/2004 23:51 Your love life will be better than his
2/4/2004 23:50 Your love life will be better than his
2/4/2004 23:49 Change your love life forever
2/4/2004 23:47 "You can look and feel younger,make 2004 your year"
2/4/2004 23:47 "rates will go up, Act Now!"
2/4/2004 23:43 Fully Stocked Online Ph)armacy!
2/4/2004 23:41 Sav.e $1000-$1500 per month
2/4/2004 23:41 delft filmy
2/4/2004 23:41 Enjoy The Status of Platinum
2/4/2004 23:40 Best Term Life Insurance rates guaranteed
2/4/2004 23:34 Re: Are you lonely?
2/4/2004 23:33 Fwd:Fwd: Individual Life Ins. Frëe Quotes
2/4/2004 23:33 Passion should last forever
__

Thanks very much for your feedback Daniel! We'd be honored if you'd
mention SURBL and SpamCopURI on your mailing list and hope other
people will give it a try also. As Kelsey mentions above, Sonic is
providing some secondary DNS for SURBL, and if things take off
we may want to recruit some more secondary name servers.

In addition to seeking more volunteer name server secondaries,
we could probably also use some coding help on a SA 3.0 plugin to
use SURBL. If someone wants to try that or even an MTA update to
use it, that would be great too. I can be reached at:

jeffc at surbl dot org

Also if anyone spots any false positives, please contact us
at:

whitelist at surbl dot org

so we can update our whitelists.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jeffc at surbl

Apr 3, 2004, 4:00 PM

Post #34 of 35 (99 views)

Permalink

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests [ In reply to ]

jm at jmason

Apr 5, 2004, 11:21 AM

Post #35 of 35 (99 views)

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Daniel Quinlan writes:
> jm@jmason.org (Justin Mason) writes:
>
> > Side issue: why use easy removal without questions? Spammers do not have
> > the bandwidth to remove themselves from every list. If they *do* go to
> > the bother, and a URL does get removed, then repeatedly crops up in spam
> > again, it should be raised as an alarm -- and possibly brought to the
> > notice of other people -- e.g. this list or others.
>
> I'm not so sure easy removal is actually a good idea. I think it's
> better to have FP-prevention mechanisms that don't require attention of
> the email sender.
>
> Why? Because it's a mechanism biased towards savvy users, people who
> use blacklists, SpamAssassin, etc. In addition, it's exactly the same
> folks who are already overrepresented in our ham corpus. So, the
> effective FP rate will be higher than it appears in our corpus *and*
> non-savvy senders will be penalized.

(I meant to reply to this but misfiled the msg. better late than never ;)

Well, I'd say both.

1. easy webform removal

2. corpus-based "ham URL" list; scan a corpus of mails for URLs found in
spam. a URL used even once in a ham mail is non-spammy. (just be careful
not to include messages that *discuss* spams!)

and:

3. manual whitelist of "ham URLs"; keep a file of "www.yahoo.com",
"www.hotmail.com", "www.cnn.com" etc. etc. I think you're already doing
that.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAcaOSQTcbUG5Y7woRAnEBAKCD+3ujS1Z4cbhyPRzAiT+ONiRxfwCgqtFh
GntJp60mno5mm9/h9LVEoNY=
=UgtJ
-----END PGP SIGNATURE-----