Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
The comment on the change is:
Version 1 of the FuzzyOcr Plugin, an improvement of the OcrPlugin
New page:
== How it works ==
NOTE: This plugin is based on the OcrPlugin written by Maarten de Boer and was extended and improved.
This plugin checks for specific keywords in image/gif or image/jpeg attachments, using {{{gocr}}} (an ''optical character recognition'' program).
This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information.
Additionally to the normal OcrPlugin, it can do approximate matches on words, so errors in recognition or attempts to obfuscate the text inside the image will not cause the detection to fail. Another improvement was to move the wordlist into the configuration file so it can be easily extended.
== Requirements ==
You will need {{{convert}}} (imagemagick) and {{{gocr}}} installed.
Additionally, you will need the perl module {{{ String::Approx }}}.
== Installation ==
Save the two files below in your local configuration directory. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.
The scoring is dynamic, more word matches lead to a higher score. The scoring is done in {{{FuzzyOcr.pm}}} in line 74 (see ToDo) and is basically a fixed score for the first {{{$countreq}}} matches (default 4) + 1 Point for every additional match. This can be adjusted easily as you wish.
The variable $countreq can be adjusted via the configuration file parameter {{{focr_counts_required}}} and indicates the number of matches that need to be found before any score will be triggered.
The variable $treshold is similarly adjusted with the configuration file parameter {{{focr_treshold}}}. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).
== Example of work ==
Lets say you have defined {{{focr_word investor}}} in your configuration. Now you receive an image which, after converted and recognized gives you:
{{{ATTENTION ALL IN\lESTORS AND DAY TRADERS}}}
Then the plugin will find the word investor. It would even succeed if the text was {{{ATTENTION ALL STUPUDIN\lESTORSHAHA}}} or {{{INVSTORSZ}}} etc.
Generally, the plugin follows these rules:
* The case is not relevant
* All special characters or numbers are stripped before any matching is done
* Your wordlist word will be found even if it is inside another word (submatching)
* The distance is calculated from the amount of character additions, deletions and substitutions, that need to be done.
== Remarks ==
* The words checked for are specific for some spam I received a lot of recently.
* {{{gocr}}} can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif or jpeg attachments.
== ToDo ==
* Make the score adjustable in the configuration.
* Avoid usage of tmp files for gocr, redirect output directly back to the script
-- Author: Christian Holler, decoder_at_own-hero_dot_net
== Code ==
=== FuzzyOcr.cf ===
{{{
loadplugin FuzzyOcr FuzzyOcr.pm
body FUZZY_OCR eval:check_fuzzy_ocr()
describe FUZZY_OCR Mail contains an image with common spam text inside
# Here we defined the words to scan for
focr_word stock
focr_word investor
focr_word international
focr_word company
focr_word money
focr_word million
focr_word thousand
focr_word buy
focr_word price
focr_word trade
focr_word banking
focr_word service
focr_word kunde
focr_word volksbank
focr_word sparkasse
focr_word software
focr_word viagra
focr_word cialis
focr_word levitra
# These parameters can be used to change other detection settings
# Normally these don't need to be changed.
#
#focr_treshold 0.3
#focr_counts_required 2
}}}
=== FuzzyOcr.pm ===
{{{
# FuzzyOcr plugin, version 1
# written by Christian Holler decoder_at_own-hero_dot_net
package FuzzyOcr;
use strict;
use Mail::SpamAssassin;
use Mail::SpamAssassin::Util;
use Mail::SpamAssassin::Plugin;
use String::Approx 'adistr';
our @ISA = qw (Mail::SpamAssassin::Plugin);
our @words = ( );
our $cnt = 0;
# Default values
our $treshold = "0.3";
our $countreq = 2;
# constructor: register the eval rule
sub new {
my ( $class, $mailsa ) = @_;
$class = ref($class) || $class;
my $self = $class->SUPER::new($mailsa);
bless( $self, $class );
$self->register_eval_rule("check_fuzzy_ocr");
return $self;
}
sub parse_config {
my ($self, $opts) = @_;
if ($opts->{key} eq "focr_word") {
push(@words, $opts->{value});
} elsif ($opts->{key} eq "focr_treshold") {
$treshold = $opts->{value};
} elsif ($opts->{key} eq "focr_counts_required") {
$countreq = $opts->{value};
}
}
sub check_fuzzy_ocr {
my ( $self, $pms ) = @_;
$cnt = 0;
foreach my $p ( $pms->{msg}->find_parts("image") ) {
my ( $ctype, $boundary, $charset, $name ) =
Mail::SpamAssassin::Util::parse_content_type(
$p->get_header('content-type') );
if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
foreach $p ( $p->decode() ) {
print OCR $p;
}
close OCR;
open OCR, "/tmp/spamassassin.focr.$$";
while (<OCR>) {
s/[^a-zA-Z ]//g;
$_ = lc;
my $w;
foreach $w (@words) {
$w = lc $w;
my $matched = adistr($w, $_);
if (abs($matched) < $treshold) {
$cnt++;
}
}
}
close OCR;
unlink "/tmp/spamassassin.focr.$$";
}
}
if ($cnt >= $countreq) {
my $score = 4 + ($cnt - $countreq);
$pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR}." ($cnt word occurrences found)");
}
return 0;
}
1;
}}}
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
The comment on the change is:
Version 1 of the FuzzyOcr Plugin, an improvement of the OcrPlugin
New page:
== How it works ==
NOTE: This plugin is based on the OcrPlugin written by Maarten de Boer and was extended and improved.
This plugin checks for specific keywords in image/gif or image/jpeg attachments, using {{{gocr}}} (an ''optical character recognition'' program).
This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information.
Additionally to the normal OcrPlugin, it can do approximate matches on words, so errors in recognition or attempts to obfuscate the text inside the image will not cause the detection to fail. Another improvement was to move the wordlist into the configuration file so it can be easily extended.
== Requirements ==
You will need {{{convert}}} (imagemagick) and {{{gocr}}} installed.
Additionally, you will need the perl module {{{ String::Approx }}}.
== Installation ==
Save the two files below in your local configuration directory. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.
The scoring is dynamic, more word matches lead to a higher score. The scoring is done in {{{FuzzyOcr.pm}}} in line 74 (see ToDo) and is basically a fixed score for the first {{{$countreq}}} matches (default 4) + 1 Point for every additional match. This can be adjusted easily as you wish.
The variable $countreq can be adjusted via the configuration file parameter {{{focr_counts_required}}} and indicates the number of matches that need to be found before any score will be triggered.
The variable $treshold is similarly adjusted with the configuration file parameter {{{focr_treshold}}}. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).
== Example of work ==
Lets say you have defined {{{focr_word investor}}} in your configuration. Now you receive an image which, after converted and recognized gives you:
{{{ATTENTION ALL IN\lESTORS AND DAY TRADERS}}}
Then the plugin will find the word investor. It would even succeed if the text was {{{ATTENTION ALL STUPUDIN\lESTORSHAHA}}} or {{{INVSTORSZ}}} etc.
Generally, the plugin follows these rules:
* The case is not relevant
* All special characters or numbers are stripped before any matching is done
* Your wordlist word will be found even if it is inside another word (submatching)
* The distance is calculated from the amount of character additions, deletions and substitutions, that need to be done.
== Remarks ==
* The words checked for are specific for some spam I received a lot of recently.
* {{{gocr}}} can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif or jpeg attachments.
== ToDo ==
* Make the score adjustable in the configuration.
* Avoid usage of tmp files for gocr, redirect output directly back to the script
-- Author: Christian Holler, decoder_at_own-hero_dot_net
== Code ==
=== FuzzyOcr.cf ===
{{{
loadplugin FuzzyOcr FuzzyOcr.pm
body FUZZY_OCR eval:check_fuzzy_ocr()
describe FUZZY_OCR Mail contains an image with common spam text inside
# Here we defined the words to scan for
focr_word stock
focr_word investor
focr_word international
focr_word company
focr_word money
focr_word million
focr_word thousand
focr_word buy
focr_word price
focr_word trade
focr_word banking
focr_word service
focr_word kunde
focr_word volksbank
focr_word sparkasse
focr_word software
focr_word viagra
focr_word cialis
focr_word levitra
# These parameters can be used to change other detection settings
# Normally these don't need to be changed.
#
#focr_treshold 0.3
#focr_counts_required 2
}}}
=== FuzzyOcr.pm ===
{{{
# FuzzyOcr plugin, version 1
# written by Christian Holler decoder_at_own-hero_dot_net
package FuzzyOcr;
use strict;
use Mail::SpamAssassin;
use Mail::SpamAssassin::Util;
use Mail::SpamAssassin::Plugin;
use String::Approx 'adistr';
our @ISA = qw (Mail::SpamAssassin::Plugin);
our @words = ( );
our $cnt = 0;
# Default values
our $treshold = "0.3";
our $countreq = 2;
# constructor: register the eval rule
sub new {
my ( $class, $mailsa ) = @_;
$class = ref($class) || $class;
my $self = $class->SUPER::new($mailsa);
bless( $self, $class );
$self->register_eval_rule("check_fuzzy_ocr");
return $self;
}
sub parse_config {
my ($self, $opts) = @_;
if ($opts->{key} eq "focr_word") {
push(@words, $opts->{value});
} elsif ($opts->{key} eq "focr_treshold") {
$treshold = $opts->{value};
} elsif ($opts->{key} eq "focr_counts_required") {
$countreq = $opts->{value};
}
}
sub check_fuzzy_ocr {
my ( $self, $pms ) = @_;
$cnt = 0;
foreach my $p ( $pms->{msg}->find_parts("image") ) {
my ( $ctype, $boundary, $charset, $name ) =
Mail::SpamAssassin::Util::parse_content_type(
$p->get_header('content-type') );
if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
foreach $p ( $p->decode() ) {
print OCR $p;
}
close OCR;
open OCR, "/tmp/spamassassin.focr.$$";
while (<OCR>) {
s/[^a-zA-Z ]//g;
$_ = lc;
my $w;
foreach $w (@words) {
$w = lc $w;
my $matched = adistr($w, $_);
if (abs($matched) < $treshold) {
$cnt++;
}
}
}
close OCR;
unlink "/tmp/spamassassin.focr.$$";
}
}
if ($cnt >= $countreq) {
my $score = 4 + ($cnt - $countreq);
$pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR}." ($cnt word occurrences found)");
}
return 0;
}
1;
}}}