Mailing List Archive

Spam URI TLD report sizes
FWIW Here's an du -sk directory size summary of the reports
SURBL grabbed from SpamCop Spamvertised sites over the past
4 days or so, stored by TLD or first octet of a numeric URI:

KBytes TLD or first octet of numeric address
====== =====================================
7 140
7 163
7 196
7 199
34 200
13 202
8 203
7 204
37 205
25 207
3 208
14 209
1 210
67 211
19 213
7 216
7 217
31 218
41 219
7 220
13 24
11 61
7 63
31 64
27 66
13 68
33 69
13 80
7 82
1 ae
1 an
5 ar
5 aspa
9 au
5 be
5550 biz
5 bogeyme
60 br
5 bz
1 ca
38 cc
3 celer
9 ch
21 cl
57 cn
7653 com
57 de
5 edu
9 es
3 f
17 fr
5 gg
9 gr
3 grand
9 hk
1 hostingp
11 il
3 imabigpimp
5 in
5798 info
21 it
9 jp
25 kr
5 mx
5 name
946 net
21 nl
5 no
3 nort
5 nu
305 org
5 pe
75 ph
1 pl
9 pt
21 ro
51 ru
5 se
1 sg
1 sk
1 st
1 st1
3 tabletswh
29 tc
5 thesed
5 tk
11 to
5 tr
51 tv
50 tw
9 ua
32 uk
1880 us
5 whole
69 ws
13 za

Looks like .com is the top spam site TLD reported to SpamCop,
followed by .info and .biz, then .us. And 211. is the top
numeric URI.

The obviously wrong TLDs like "grand" and "tabletswh" are either
sloppy URIs or an attempt to take advantage of an implicit .com
some browsers apparently add when no TLD is specified in
a URI. If the latter it could be an attempt to get around
message body scanning: sort of "obfuscation by underspecification".
We could counter this by adding a ".com" before processing any
domain lacking a legitimate-looking TLD.

Individual record lines vary in size somewhat so something like a
record count (line count) would be a more accurate way to measure
the number of minute-unique spam reports, but as an general
estimate of reported activity, it's probably pretty good.

Source data is the "domains" directory SURBL uses as a text
database of reports, stored into a tree of domain levels:

http://spamcheck.freeapp.net/domains/

Hope this kind of info is not too redundant; I'm new here...

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/
Spam URI TLD report sizes [ In reply to ]
FWIW Here's an du -sk directory size summary of the reports
SURBL grabbed from SpamCop Spamvertised sites over the past
4 days or so, stored by TLD or first octet of a numeric URI:

KBytes TLD or first octet of numeric address
====== =====================================
7 140
7 163
7 196
7 199
34 200
13 202
8 203
7 204
37 205
25 207
3 208
14 209
1 210
67 211
19 213
7 216
7 217
31 218
41 219
7 220
13 24
11 61
7 63
31 64
27 66
13 68
33 69
13 80
7 82
1 ae
1 an
5 ar
5 aspa
9 au
5 be
5550 biz
5 bogeyme
60 br
5 bz
1 ca
38 cc
3 celer
9 ch
21 cl
57 cn
7653 com
57 de
5 edu
9 es
3 f
17 fr
5 gg
9 gr
3 grand
9 hk
1 hostingp
11 il
3 imabigpimp
5 in
5798 info
21 it
9 jp
25 kr
5 mx
5 name
946 net
21 nl
5 no
3 nort
5 nu
305 org
5 pe
75 ph
1 pl
9 pt
21 ro
51 ru
5 se
1 sg
1 sk
1 st
1 st1
3 tabletswh
29 tc
5 thesed
5 tk
11 to
5 tr
51 tv
50 tw
9 ua
32 uk
1880 us
5 whole
69 ws
13 za

Looks like .com is the top spam site TLD reported to SpamCop,
followed by .info and .biz, then .us. And 211. is the top
numeric URI.

The obviously wrong TLDs like "grand" and "tabletswh" are either
sloppy URIs or an attempt to take advantage of an implicit .com
some browsers apparently add when no TLD is specified in
a URI. If the latter it could be an attempt to get around
message body scanning: sort of "obfuscation by underspecification".
We could counter this by adding a ".com" before processing any
domain lacking a legitimate-looking TLD.

Individual record lines vary in size somewhat so something like a
record count (line count) would be a more accurate way to measure
the number of minute-unique spam reports, but as an general
estimate of reported activity, it's probably pretty good.

Source data is the "domains" directory SURBL uses as a text
database of reports, stored into a tree of domain levels:

http://spamcheck.freeapp.net/domains/

Hope this kind of info is not too redundant; I'm new here...

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/
Re: Spam URI TLD report sizes [ In reply to ]
Jeff Chan <jeffc@surbl.org> writes:

> FWIW Here's an du -sk directory size summary of the reports
> SURBL grabbed from SpamCop Spamvertised sites over the past
> 4 days or so, stored by TLD or first octet of a numeric URI:
>
> KBytes TLD or first octet of numeric address

It might be interesting to do checks on /24 networks since spammers will
often get a whole block of addresses and divvy up their current domains
amongst them.

If it's possible and not too much work for you, it might be worth trying
a bunch of different approaches on different temporary subdomains and
then we can compare each against our corpora.

- longer timeout vs. shorter timeout
- lower threshold vs. higher threshold
- gathering /24 networks for numeric addresses (combined with an A
lookup of non-numeric addresses on your end).


Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: Spam URI TLD report sizes [ In reply to ]
On Tuesday, March 30, 2004, 3:18:02 PM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:
>> FWIW Here's an du -sk directory size summary of the reports
>> SURBL grabbed from SpamCop Spamvertised sites over the past
>> 4 days or so, stored by TLD or first octet of a numeric URI:
>>
>> KBytes TLD or first octet of numeric address

> It might be interesting to do checks on /24 networks since spammers will
> often get a whole block of addresses and divvy up their current domains
> amongst them.

> If it's possible and not too much work for you, it might be worth trying
> a bunch of different approaches on different temporary subdomains and
> then we can compare each against our corpora.

> - longer timeout vs. shorter timeout
> - lower threshold vs. higher threshold
> - gathering /24 networks for numeric addresses (combined with an A
> lookup of non-numeric addresses on your end).

I misspoke somewhat that the data used for this is the source for
SURBL (and not vice versa :-). None of the thresholding that SURBL
does is reflected in my previous posting. That was based on the
raw data of reports including any that don't happen to come up to
the SURBL threshold.

/24s are visible in the data as:

http://spamcheck.freeapp.net/domains/N/N/N where N is a number

Similarly /16, and /32

http://spamcheck.freeapp.net/domains/N/N/
http://spamcheck.freeapp.net/domains/N/N/N/N

with the exception that we skipped /8s because they would be
too inclusive, just like we skipped accumulated reports of TLDs.

So for example, under 211 there is data for:

http://spamcheck.freeapp.net/domains/211/23/
http://spamcheck.freeapp.net/domains/211/23/103/
http://spamcheck.freeapp.net/domains/211/23/103/36/

Some of the other 211/16s are much larger, such as:

http://spamcheck.freeapp.net/domains/211/147/

No name resolution is done on the domain name data.
No reverse name resolution is done on the numeric IP Address
data from URIs. It's all just a record of what was reported.
We could do resolution, and may in future, but that's not our
primary focus for the use of the data.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/
Re: Spam URI TLD report sizes [ In reply to ]
On Tuesday, March 30, 2004, 3:18:02 PM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:
>> FWIW Here's an du -sk directory size summary of the reports
>> SURBL grabbed from SpamCop Spamvertised sites over the past
>> 4 days or so, stored by TLD or first octet of a numeric URI:
>>
>> KBytes TLD or first octet of numeric address

> It might be interesting to do checks on /24 networks since spammers will
> often get a whole block of addresses and divvy up their current domains
> amongst them.

> If it's possible and not too much work for you, it might be worth trying
> a bunch of different approaches on different temporary subdomains and
> then we can compare each against our corpora.

> - longer timeout vs. shorter timeout
> - lower threshold vs. higher threshold
> - gathering /24 networks for numeric addresses (combined with an A
> lookup of non-numeric addresses on your end).

I misspoke somewhat that the data used for this is the source for
SURBL (and not vice versa :-). None of the thresholding that SURBL
does is reflected in my previous posting. That was based on the
raw data of reports including any that don't happen to come up to
the SURBL threshold.

/24s are visible in the data as:

http://spamcheck.freeapp.net/domains/N/N/N where N is a number

Similarly /16, and /32

http://spamcheck.freeapp.net/domains/N/N/
http://spamcheck.freeapp.net/domains/N/N/N/N

with the exception that we skipped /8s because they would be
too inclusive, just like we skipped accumulated reports of TLDs.

So for example, under 211 there is data for:

http://spamcheck.freeapp.net/domains/211/23/
http://spamcheck.freeapp.net/domains/211/23/103/
http://spamcheck.freeapp.net/domains/211/23/103/36/

Some of the other 211/16s are much larger, such as:

http://spamcheck.freeapp.net/domains/211/147/

No name resolution is done on the domain name data.
No reverse name resolution is done on the numeric IP Address
data from URIs. It's all just a record of what was reported.
We could do resolution, and may in future, but that's not our
primary focus for the use of the data.

Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/