Mailing List Archive

RFH: using SOUGHT logic to combat phish
Hello,

I have quite pretty archive of phish mail (bank and mail accounts), where
many words and phrases repeat.

I was thinking about processing them manually and creating rules, but that
would be much work.

I remember that SOUGHT ruleset used to contain phrases that appear
repeatedly, so I'd try to use these, if possible.

so far I found:
- description how it works https://taint.org/2007/03/05/134447a.html
- scripts to search in corpus:
https://svn.apache.org/repos/asf/spamassassin/trunk/masses/rule-dev/seek-phrases-in-corpus

which seems to use plugins (Dumptext.pm, GrepRenderedBody.pm) I found at:
https://svn.apache.org/repos/asf/spamassassin/branches/3.3/masses/plugins/


Are these still working or do they have any new versions?

Does anyone have hints how to process phish archive?

I mean, I apparently could manually weed out any repeated non-phish phrases
to avoid FPs or check them manually what mail they hit, so I didn't need to
keep much of ham mail

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Due to unexpected conditions Windows 2000 will be released
in first quarter of year 1901
Re: RFH: using SOUGHT logic to combat phish [ In reply to ]
On 10/11/2022 5:38 AM, Matus UHLAR - fantomas wrote:
> Are these still working or do they have any new versions?
>
> Does anyone have hints how to process phish archive?
>
> I mean, I apparently could manually weed out any repeated non-phish
> phrases to avoid FPs or check them manually what mail they hit, so I
> didn't need to keep much of ham mail
There was so interesting in a SOUGHT2 but no, the tooling hasn't been
looked at in some time.  It would show promise if you want to dig into it!

--
Kevin A. McGrail
KMcGrail@Apache.org

Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171
Re: RFH: using SOUGHT logic to combat phish [ In reply to ]
Matus UHLAR - fantomas wrote:
> Hello,
>
> I have quite pretty archive of phish mail (bank and mail accounts),
> where many words and phrases repeat.
>
> I was thinking about processing them manually and creating rules, but
> that would be much work.
> I remember that SOUGHT ruleset used to contain phrases that appear
> repeatedly, so I'd try to use these, if possible.
>
> so far I found:
> - description how it works https://taint.org/2007/03/05/134447a.html
> - scripts to search in corpus:
>
> https://svn.apache.org/repos/asf/spamassassin/trunk/masses/rule-dev/seek-phrases-in-corpus
>
>
> which seems to use plugins (Dumptext.pm, GrepRenderedBody.pm) I found
> at:
> https://svn.apache.org/repos/asf/spamassassin/branches/3.3/masses/plugins/
>
>
> Are these still working or do they have any new versions?

I'm a little hazy on the deep internals, but all the parts are still in
SVN trunk. I've been using this locally with a growing collection of
configuration wrapper to generate a number of rule sets for different
subgroups of spam.

I've just tried a test in a current trunk checkout and everything seems
to work without issue. Some components may need a little more tweaking
for local conditions.

> Does anyone have hints how to process phish archive?
>
> I mean, I apparently could manually weed out any repeated non-phish
> phrases to avoid FPs or check them manually what mail they hit, so I
> didn't need to keep much of ham mail

The minimal setup is to modify
masses/rule-dev/sought/example_backend/run for your local pathnames, and
change the rule fragment names however you like, and run that script.
I've attached a patch showing my own changes for my quick test above.

You *do* need a collection of ham, however; as-is it relies on that to
weed out patterns you don't want to actually be firing on as well has
sorting/grouping the patterns by hit-rate thresholds.

You could probably still use one of the intermediate files to bootstrap
what you might have done manually, but you risk including poor patterns
(either those that don't hit much, or also hit ham).

-kgd