Mailing List Archive: Re: Making spam scores public

David MacQuigg wrote on Monday, March 05, 2007 4:27 PM -0600:

> It seems to me the real goal of our work on these reputation systems
> is to provide a universal solution to the spam problem.

While laudable, this is not possible.

> Keeping the data private means there is no fundamental difference
> between what you are doing and what any large ISP or spam-appliance
> company does. How can you expect your solution to be any better
> than what these private companies are doing?

Having private data is a large advantage, which is why large ISP's don't
publish their internal listing criteria. Attacking a small network that
uses local reputation data followed by Bayesian content filtering is
inherently harder than attacking a system using public DNSBL's and
content filters with public rule sets. You can get some unwanted
messages through, but you can't test the messages for deliverability
ahead of time. The only commonality among networks that use local data
is the code that generates it. The data itself, and the system
parameters that drive the decisions, is all unknown to attackers.

The private data advantage is reduced if your incoming mail flow is too
small. Communications among a few peer systems can help greatly in this
case. As an aside, Bayesian filters don't necessarily work better than
carefully maintained rule sets, but they do it with a fraction of the
maintenance. Private reputation data created from your own mail flow
holds the same promise.

I don't mean to imply that there is no use for public reputation data.
Evaluating whether to use data from a particular source means knowing
who they are. This exposes them to legal action, a risk most companies
do not want. An alternative is creating composite data from all
submitters, which is the SpamCop approach that many sites find too
unreliable. In the end, the most successful public lists are created
from networks of trusted private sources and are carefully managed.

> I think the way to deal with threats of costly lawsuits is to set up
> the company in a jurisdiction with more common sense in their legal
> system than the USA.

This is the precise reason that U.S. companies will not likely make
their reputation data public. Even if there were no threat of lawsuits,
publishing this data tells your attackers how effective they were with
each spam run.

> If some rating service is put out of business by a lawsuit, others
> will take its place.

Even the threat of lawsuits is enough to deter most people.

--
Seth Goodman

-------
Sender Policy Framework: http://www.openspf.org/
Archives at http://archives.listbox.com/spf-discuss/current/
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/?list_id=735

At 11:42 PM 3/6/2007 -0600, Seth Goodman wrote:
>David MacQuigg wrote on Monday, March 05, 2007 4:27 PM -0600:
>
> > It seems to me the real goal of our work on these reputation systems
> > is to provide a universal solution to the spam problem.
>
>While laudable, this is not possible.

Nattering Nabob of Negativism!!! :>)
http://en.wikipedia.org/wiki/Spiro_Agnew

> > Keeping the data private means there is no fundamental difference
> > between what you are doing and what any large ISP or spam-appliance
> > company does. How can you expect your solution to be any better
> > than what these private companies are doing?
>
>Having private data is a large advantage, which is why large ISP's don't
>publish their internal listing criteria. Attacking a small network that
>uses local reputation data followed by Bayesian content filtering is
>inherently harder than attacking a system using public DNSBL's and
>content filters with public rule sets. You can get some unwanted
>messages through, but you can't test the messages for deliverability
>ahead of time. The only commonality among networks that use local data
>is the code that generates it. The data itself, and the system
>parameters that drive the decisions, is all unknown to attackers.

I think we need to make a distinction between private reputation data and
private rule sets or methods. I can see your point if we are talking about
rule sets and methods, and I am inclined to agree, but not entirely
convinced. The SpamAssassin folks argue that making their rule sets public
is not a problem. If I understand their argument, it is that their rule
sets are so large and hard to work around that it takes spammers months to
adapt, and by that time, the next update on their rule set is available. I
wish I had a link for you, but I do recall seeing a graph of spam vs time
supporting this argument. Whatever the conclusion, it doesn't really
matter, because we use SpamAssassin only to process messages that are not
whitelisted, and as a means of generating reputation scores for domains
that have not yet qualified to be whitelisted. SpamAssassin scores are
plenty accurate for that purpose.

If spammers get very good at avoiding SpamAssassin's rule sets, we can
switch to another filter, or even use different filters for different
receivers, and keep the choices secret. As long as we get some feedback on
the filter's decisions, we can include this feedback in an overall average
rating for a domain.

As to the tactical advantage of keeping the reputation data secret, I see
none. These are long-term averages of results from many receivers, not a
rapid-feedback loop to help spammers improve their methods. On the
contrary, I see a large advantage to publishing the data. This is what
will motivate legitimate senders to block the zombies in their networks by
publishing better authentication records.

When email recipients can see a direct comparison of spam ratings for
comcast.net and aol.com, Comcast might just decide that publishing a strict
authentication record would be in their best interest. They might lose
some of their spamming customers, but so will every other large ISP that
tolerates spammers. My guess is that they would welcome an opportunity to
tell their spammers - "Hey guys, we have to do this. It's not our fault."

>The private data advantage is reduced if your incoming mail flow is too
>small. Communications among a few peer systems can help greatly in this
>case.

The main problem I see with private data is that it doesn't allow for the
kind of rapid global communication needed to make spamming unprofitable. I
realize that peers can be located anywhere in the world, so when I say
"global", I don't mean geography. I mean no isolated "islands" of peers
that can be attacked one at a time. If it takes even a few hours to
downgrade a reputation, that will be plenty of time for spammers to
inundate one island, then move on to the next.

If I understand the Gossip system, reputation information "diffuses"
throughout the network of peers, one link at a time. How long will it take
before the whole world knows that a particular domain has been taken over
by spammers?

>As an aside, Bayesian filters don't necessarily work better than
>carefully maintained rule sets, but they do it with a fraction of the
>maintenance. Private reputation data created from your own mail flow
>holds the same promise.

Bayesian filters, heuristic rulesets, IP blacklists, all are inferior to
feedback from recipients. The trick is to make the amount of spam from
whitelisted senders small enough that recipients don't mind having to
report it. In the last three weeks, I've seen only 5 whitelisted spams in
my inbox, 3 from google.com, and 2 from comcast.net. With only one or two
spams a week, recipients won't mind dropping what they are doing, quickly
reviewing the content of the message, and forwarding it to a spam-reporting
address.

We can also make things nicer for recipients by sending them an immediate
acknowledgement of their report, and a link to a website where they can see
their report listed along with any others for the domain in question, the
response of the domain postmaster, and any actions taken by the Rating
Services watching this domain.

A few months ago, I saw a burst of spam from Yahoo's webmail servers
lasting a few days. I expect they will be much quicker in shutting down
these sources when they are prodded by our spam reports, and when all they
have to do is publish one DNS record.

>I don't mean to imply that there is no use for public reputation data.
>Evaluating whether to use data from a particular source means knowing
>who they are. This exposes them to legal action, a risk most companies
>do not want. An alternative is creating composite data from all
>submitters, which is the SpamCop approach that many sites find too
>unreliable. In the end, the most successful public lists are created
>from networks of trusted private sources and are carefully managed.

Our ratings will come from many sources, including Gossip, if we can find a
way to interface. The simplest system, which we are testing now on a small
scale, simply takes an "average" of the SpamAssassin scores from many
receivers over a long period of time, discarding the "outliers", which we
define to be any source that attempts to move the average too much in
either direction. This will eliminate the most obvious attack, sending
huge volumes of phony mail to a collaborating recipient, so as to drive up
the "ham" count.

While I hope that spammers will simply give up, and not force us to the
next step, I am fully prepared for a battle of wits, as clever spammers try
to fool equally clever managers at the Rating Services using our
Registry. I'm working now on some Python scripts to display the data on a
domain in a way that will allow managers to quickly spot anomalies. It
should be very difficult for a spammer to generate a broad distribution of
"ham" over a long period of time, enough to look like a normal legitimate
sender.

I still don't understand the legal threat you keep referring to. There is
no such thing as a "bad" reputation in our system. Ratings range from C
(unknown) to A (less than one spam in 100 messages). We don't bother with
lower ratings, because we assume that no spammer will continue to use a
name with a rating lower than a fresh new "unknown" name. If a spammer is
thwarted in an attempt to gain a higher reputation, who is he going to
sue? What would be the allegation - "Spamhaus, you failed to give me the
A-rating I deserve after 3 months of diligently faking a legitimate mail-flow?"

I don't see any threat from legitimate senders who lose their reputation
through innocent mistakes. A well-managed rating service will work with a
legitimate mailer to correct the mistake quickly. Let's say yahoo.com is
doing quite well with their current default record, authorizing 84992 IP
addresses. Suddenly spammers discover that they can forge Yahoo's name, at
least on the zombies that lie within one of these huge IP blocks. What
will Yahoo do, hire lawyers to sue rating services all over the world, or
simply assert control of their Registry record, and de-authorize the zombies?

> > I think the way to deal with threats of costly lawsuits is to set up
> > the company in a jurisdiction with more common sense in their legal
> > system than the USA.
>
>This is the precise reason that U.S. companies will not likely make
>their reputation data public.

The exception would be large companies like Ironport. No spammer would
dare sue them. They do in fact, make their data public, just not in a way
that can be automated without paying a fee. I expect that fees to rating
services like Ironport will be the biggest cost in providing Registry
services. That is as it should be. We need the best services in the world
to provide the most reliable domain ratings. Everything else can be automated.

I believe the reason we don't have public reputation data is not fear of
lawsuits, but rather a desire by companies to maintain a competitive
advantage in selling their bundled products.

>Even if there were no threat of lawsuits,
>publishing this data tells your attackers how effective they were with
>each spam run.

The data that is published is long-term averages of data from many
sources. This will be very little value to the spammer. The only
immediate feedback a spammer might see is an alert that goes out when a
reputable domain is suddenly hijacked.

> > If some rating service is put out of business by a lawsuit, others
> > will take its place.
>
>Even the threat of lawsuits is enough to deter most people.

The few Rating Services that are brave enough to not fear harassment in
U.S. courts, will include the ones listed in our Registry records.

A much bigger worry regarding the reliability of Rating Services will be
the possibility of bogus services controlled by spammers. Our strategy
here is to pick the best services by allowing Registry subscribers to
designate what fraction of their subscription fee goes to each
Service. Corrupt or incompetent services will quickly lose their income,
and eventually be dropped from the Registry.

-- Dave

-------
Sender Policy Framework: http://www.openspf.org/
Archives at http://archives.listbox.com/spf-discuss/current/
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/?list_id=735