Mailing List Archive

[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

The comment on the change is:
Version 1 of the FuzzyOcr Plugin, an improvement of the OcrPlugin

New page:
== How it works ==

NOTE: This plugin is based on the OcrPlugin written by Maarten de Boer and was extended and improved.

This plugin checks for specific keywords in image/gif or image/jpeg attachments, using {{{gocr}}} (an ''optical character recognition'' program).

This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information.

Additionally to the normal OcrPlugin, it can do approximate matches on words, so errors in recognition or attempts to obfuscate the text inside the image will not cause the detection to fail. Another improvement was to move the wordlist into the configuration file so it can be easily extended.

== Requirements ==

You will need {{{convert}}} (imagemagick) and {{{gocr}}} installed.

Additionally, you will need the perl module {{{ String::Approx }}}.

== Installation ==

Save the two files below in your local configuration directory. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.

The scoring is dynamic, more word matches lead to a higher score. The scoring is done in {{{FuzzyOcr.pm}}} in line 74 (see ToDo) and is basically a fixed score for the first {{{$countreq}}} matches (default 4) + 1 Point for every additional match. This can be adjusted easily as you wish.

The variable $countreq can be adjusted via the configuration file parameter {{{focr_counts_required}}} and indicates the number of matches that need to be found before any score will be triggered.

The variable $treshold is similarly adjusted with the configuration file parameter {{{focr_treshold}}}. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).


== Example of work ==

Lets say you have defined {{{focr_word investor}}} in your configuration. Now you receive an image which, after converted and recognized gives you:

{{{ATTENTION ALL IN\lESTORS AND DAY TRADERS}}}

Then the plugin will find the word investor. It would even succeed if the text was {{{ATTENTION ALL STUPUDIN\lESTORSHAHA}}} or {{{INVSTORSZ}}} etc.

Generally, the plugin follows these rules:

* The case is not relevant
* All special characters or numbers are stripped before any matching is done
* Your wordlist word will be found even if it is inside another word (submatching)
* The distance is calculated from the amount of character additions, deletions and substitutions, that need to be done.


== Remarks ==

* The words checked for are specific for some spam I received a lot of recently.
* {{{gocr}}} can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif or jpeg attachments.

== ToDo ==

* Make the score adjustable in the configuration.
* Avoid usage of tmp files for gocr, redirect output directly back to the script

-- Author: Christian Holler, decoder_at_own-hero_dot_net

== Code ==

=== FuzzyOcr.cf ===

{{{
loadplugin FuzzyOcr FuzzyOcr.pm
body FUZZY_OCR eval:check_fuzzy_ocr()
describe FUZZY_OCR Mail contains an image with common spam text inside

# Here we defined the words to scan for

focr_word stock
focr_word investor
focr_word international
focr_word company
focr_word money
focr_word million
focr_word thousand
focr_word buy
focr_word price
focr_word trade
focr_word banking
focr_word service
focr_word kunde
focr_word volksbank
focr_word sparkasse
focr_word software
focr_word viagra
focr_word cialis
focr_word levitra

# These parameters can be used to change other detection settings
# Normally these don't need to be changed.
#
#focr_treshold 0.3
#focr_counts_required 2

}}}

=== FuzzyOcr.pm ===

{{{
# FuzzyOcr plugin, version 1
# written by Christian Holler decoder_at_own-hero_dot_net

package FuzzyOcr;

use strict;
use Mail::SpamAssassin;
use Mail::SpamAssassin::Util;
use Mail::SpamAssassin::Plugin;

use String::Approx 'adistr';

our @ISA = qw (Mail::SpamAssassin::Plugin);

our @words = ( );
our $cnt = 0;

# Default values
our $treshold = "0.3";
our $countreq = 2;

# constructor: register the eval rule
sub new {
my ( $class, $mailsa ) = @_;
$class = ref($class) || $class;
my $self = $class->SUPER::new($mailsa);
bless( $self, $class );
$self->register_eval_rule("check_fuzzy_ocr");
return $self;
}

sub parse_config {
my ($self, $opts) = @_;
if ($opts->{key} eq "focr_word") {
push(@words, $opts->{value});
} elsif ($opts->{key} eq "focr_treshold") {
$treshold = $opts->{value};
} elsif ($opts->{key} eq "focr_counts_required") {
$countreq = $opts->{value};
}
}

sub check_fuzzy_ocr {
my ( $self, $pms ) = @_;
$cnt = 0;
foreach my $p ( $pms->{msg}->find_parts("image") ) {
my ( $ctype, $boundary, $charset, $name ) =
Mail::SpamAssassin::Util::parse_content_type(
$p->get_header('content-type') );
if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
foreach $p ( $p->decode() ) {
print OCR $p;
}
close OCR;
open OCR, "/tmp/spamassassin.focr.$$";
while (<OCR>) {
s/[^a-zA-Z ]//g;
$_ = lc;
my $w;
foreach $w (@words) {
$w = lc $w;
my $matched = adistr($w, $_);
if (abs($matched) < $treshold) {
$cnt++;
}
}
}
close OCR;
unlink "/tmp/spamassassin.focr.$$";
}
}
if ($cnt >= $countreq) {
my $score = 4 + ($cnt - $countreq);
$pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR}." ($cnt word occurrences found)");
}
return 0;
}

1;

}}}
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------

== Requirements ==

- You will need {{{convert}}} (imagemagick) and {{{gocr}}} installed.
+ You will need {{{giftopnm and jpegtopnm}}} (netpbm) and {{{gocr}}} installed.

Additionally, you will need the perl module {{{ String::Approx }}}.

@@ -147, +147 @@

Mail::SpamAssassin::Util::parse_content_type(
$p->get_header('content-type') );
if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
+ if ($ctype eq "image/gif") {
- open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
+ open OCR, "|/usr/bin/giftopnm - |/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
+ } else {
+ open OCR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
+ }
foreach $p ( $p->decode() ) {
print OCR $p;
}
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

The comment on the change is:
version 2.0

------------------------------------------------------------------------------

NOTE: This plugin is based on the OcrPlugin written by Maarten de Boer and was extended and improved.

- This plugin checks for specific keywords in image/gif or image/jpeg attachments, using {{{gocr}}} (an ''optical character recognition'' program).
+ This plugin checks for specific keywords in image/gif, image/jpeg or image/png attachments, using {{{gocr}}} (an ''optical character recognition'' program).

This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information.

@@ -12, +12 @@


== Requirements ==

- You will need {{{giftopnm and jpegtopnm}}} (netpbm) and {{{gocr}}} installed.
+ You will need {{{giftopnm, jpegtopnm and pngtopnm}}} (from netpbm) and {{{gocr}}} installed.

- Additionally, you will need the perl module {{{ String::Approx }}}.
+ Additionally, you will need the perl module {{{ String::Approx }}} and {{{giffix}}} (from giflib).
+
+ == Changelog ==
+
+ Version 2.0: * Replaced imagemagick with netpbm tools
+ * Plugin invokes giffix now on gifs to handle intentionally corrupted gifs
+ * Added png support
+ * Added magic byte detection to detect correct file format independantly from content-type
+ * Added 3 verbosity levels
+ * Added configuration option for tmp file path and scores

== Installation ==

Save the two files below in your local configuration directory. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.

- The scoring is dynamic, more word matches lead to a higher score. The scoring is done in {{{FuzzyOcr.pm}}} in line 74 (see ToDo) and is basically a fixed score for the first {{{$countreq}}} matches (default 4) + 1 Point for every additional match. This can be adjusted easily as you wish.
+ The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as {{focr_counts_required}} matches were found. It scores exactly {{focr_base_score}} points then. For every additional match, it scores additionally {{focr_add_score}} points.
+
+ Attention: Do not add a score line to the config file. It will not be used! Scoring is done INTERNALLY and can only be configured with the two parameters described above.

The variable $countreq can be adjusted via the configuration file parameter {{{focr_counts_required}}} and indicates the number of matches that need to be found before any score will be triggered.

The variable $treshold is similarly adjusted with the configuration file parameter {{{focr_treshold}}}. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).
+
+
+ Explanation of the additional options:
+
+ {{{focr_tmp_path}}} - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)
+ {{{focr_verbosity}}} - Verbose level (0 - 2).
+
+ * 0 means normal operation.
+ * 1 means output all words and the corresponding measured distance in the rule output:
+
+ 6.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
+ Words found:
+ "viagra" with fuzz of 0.2
+ "cialis" with fuzz of 0
+ "viagra" with fuzz of 0.2
+ "levitra" with fuzz of 0
+ (4 word occurrences found)
+ * 2 means same as 1 with an additional output of the text recognized by gocr in a file debug.<number>.focr in the local directory
+ This file also contains the recognized format type in the first line (1 means gif, 2 jpeg, 3 png).
+
+


== Example of work ==
@@ -46, +78 @@

== Remarks ==

* The words checked for are specific for some spam I received a lot of recently.
- * {{{gocr}}} can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif or jpeg attachments.
+ * {{{gocr}}} can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif, png or jpeg attachments.

== ToDo ==

- * Make the score adjustable in the configuration.
* Avoid usage of tmp files for gocr, redirect output directly back to the script

-- Author: Christian Holler, decoder_at_own-hero_dot_net
@@ -85, +116 @@

focr_word viagra
focr_word cialis
focr_word levitra
+ focr_word medicine
+ focr_word legal
+ focr_word medication
+ focr_word click here
+ focr_word penis
+ focr_word growth
+ focr_word drugs
+ focr_word pharmacy

# These parameters can be used to change other detection settings
- # Normally these don't need to be changed.
#
+ # Detection treshold (see manual)
#focr_treshold 0.3
+ #
+ # This is the score for a hit after focr_counts_required matches
+ #focr_base_score 4
+ #
+ # This is the additional score for every additional match after focr_counts_required matches
+ #focr_add_score 1
+ #
+ # Number of minimum matches before the rule scores
#focr_counts_required 2
-
+ #
+ # Verbosity level (see manual)
+ #focr_verbose 2
+ #
+ # Path for temporary files
+ #focr_tmp_path "/tmp"
}}}

=== FuzzyOcr.pm ===

{{{
- # FuzzyOcr plugin, version 1
+ # FuzzyOcr plugin, version 2.0
+ # Changelog:
+ # version 2.0
+ # Replaced imagemagick with netpbm
+ # Invoke giffix to fix broken gifs before conversion
+ # Support png images
+ # Analyze the file to detect the format without content-type
+ # Added several configuration parameters
+ #
+ #
# written by Christian Holler decoder_at_own-hero_dot_net

package FuzzyOcr;
@@ -112, +173 @@

our @ISA = qw (Mail::SpamAssassin::Plugin);

our @words = ( );
- our $cnt = 0;

# Default values
our $treshold = "0.3";
+ our $base_score = "4";
+ our $add_score = "1";
our $countreq = 2;
+ our $verbose = 1;
+ our $tmppath = "/tmp";
+

# constructor: register the eval rule
sub new {
@@ -134, +199 @@

push(@words, $opts->{value});
} elsif ($opts->{key} eq "focr_treshold") {
$treshold = $opts->{value};
+ } elsif ($opts->{key} eq "focr_base_score") {
+ $base_score = $opts->{value};
+ } elsif ($opts->{key} eq "focr_add_score") {
+ $add_score = $opts->{value};
} elsif ($opts->{key} eq "focr_counts_required") {
$countreq = $opts->{value};
+ } elsif ($opts->{key} eq "focr_verbose") {
+ $verbose = $opts->{value};
+ } elsif ($opts->{key} eq "focr_tmp_path") {
+ $tmppath = $opts->{value};
}
}

sub check_fuzzy_ocr {
my ( $self, $pms ) = @_;
+ my @found = ( );
+ my $image_type = 0;
- $cnt = 0;
+ my $cnt = 0;
foreach my $p ( $pms->{msg}->find_parts("image") ) {
my ( $ctype, $boundary, $charset, $name ) =
Mail::SpamAssassin::Util::parse_content_type(
$p->get_header('content-type') );
- if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
- if ($ctype eq "image/gif") {
+ if ($ctype =~ /image/) {
+ my $firstline = ($p->decode())[0];
+ my $tempfile = $tmppath . "/" . "spamassassin.$$.focr";
+ if ($firstline =~ /^\x47\x49\x46/) {
+ $image_type = 1;
+ open IMAGE_PROCESSOR, "|/usr/bin/giffix | /usr/bin/giftopnm - |/usr/bin/gocr -i - > $tempfile";
+ } elsif ($firstline =~ /^\xff\xd8/) {
+ $image_type = 2;
+ open IMAGE_PROCESSOR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i - > $tempfile";
+ } elsif ($firstline =~ /^\x89\x50\x4e\x47/) {
+ $image_type = 3;
- open OCR, "|/usr/bin/giftopnm - |/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
+ open IMAGE_PROCESSOR, "|/usr/bin/pngtopnm - |/usr/bin/gocr -i - > $tempfile";
} else {
- open OCR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$";
+ $image_type = 0;
+ print "No compatible file type detected... skipping image...\n";
+ next;
}
foreach $p ( $p->decode() ) {
- print OCR $p;
+ print IMAGE_PROCESSOR $p;
}
+ if ($verbose > 1) {
+ open DEBUG, ">debug.$$.focr";
+ print DEBUG "File type: $image_type\n\n"
+ }
- close OCR;
+ close IMAGE_PROCESSOR;
- open OCR, "/tmp/spamassassin.focr.$$";
+ open OCR_DATA, "<$tempfile";
- while (<OCR>) {
+ while (<OCR_DATA>) {
s/[^a-zA-Z ]//g;
$_ = lc;
+ if ($verbose > 1) {
+ print DEBUG $_;
+ }
my $w;
foreach $w (@words) {
$w = lc $w;
my $matched = adistr($w, $_);
if (abs($matched) < $treshold) {
$cnt++;
+ if ($verbose > 0) {
+ push(@found, "\"$w\"" . " with fuzz of " . abs($matched));
+ }
}
}
}
- close OCR;
+ close OCR_DATA;
- unlink "/tmp/spamassassin.focr.$$";
+ unlink $tempfile;
+ if ($verbose > 1) {
+ close DEBUG;
+ }
}
}
if ($cnt >= $countreq) {
- my $score = 4 + ($cnt - $countreq);
+ my $score = $base_score + ($cnt - $countreq) * $add_score;
+ my $debuginfo = "";
+ if ($verbose > 0) {
+ $debuginfo = ("\nWords found:\n" . join("\n", @found) ."\n($cnt word occurrences found)");
+ }
- $pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR}." ($cnt word occurrences found)");
+ $pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR} . $debuginfo);
}
return 0;
}

1;
-
}}}
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------
You will need {{{giftopnm, jpegtopnm and pngtopnm}}} (from netpbm) and {{{gocr}}} installed.

Additionally, you will need the perl module {{{ String::Approx }}} and {{{giffix}}} (from giflib).
+
+ Notes for Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix.
+ Notes for Debian users: The package libungif-bin provides giffix.

== Changelog ==

@@ -29, +32 @@


Save the two files below in your local configuration directory. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.

- The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as {{focr_counts_required}} matches were found. It scores exactly {{focr_base_score}} points then. For every additional match, it scores additionally {{focr_add_score}} points.
+ The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as {{{focr_counts_required}}} matches were found. It scores exactly {{focr_base_score}} points then. For every additional match, it scores additionally {{{focr_add_score}}} points.

Attention: Do not add a score line to the config file. It will not be used! Scoring is done INTERNALLY and can only be configured with the two parameters described above.
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------

== Changelog ==

+ Version 2.0:
- Version 2.0: * Replaced imagemagick with netpbm tools
+ * Replaced imagemagick with netpbm tools
* Plugin invokes giffix now on gifs to handle intentionally corrupted gifs
* Added png support
* Added magic byte detection to detect correct file format independantly from content-type
* Added 3 verbosity levels
* Added configuration option for tmp file path and scores

+ Version 2.1:
+ * Added scoring for wrong content-type
+ * Added scoring for broken gif images
+ * Added configuration for helper applications
+ * Added autodisable_score feature to disable the OCR engine if the message has already enough points
+
== Installation ==

- Save the two files below in your local configuration directory. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.
+ Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.

- The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as {{{focr_counts_required}}} matches were found. It scores exactly {{focr_base_score}} points then. For every additional match, it scores additionally {{{focr_add_score}}} points.
+ The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as {{{focr_counts_required}}} matches were found. It scores exactly {{{focr_base_score}}} points then. For every additional match, it scores additionally {{{focr_add_score}}} points.

Attention: Do not add a score line to the config file. It will not be used! Scoring is done INTERNALLY and can only be configured with the two parameters described above.

@@ -44, +51 @@

Explanation of the additional options:

{{{focr_tmp_path}}} - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)
+
{{{focr_verbose}}} - Verbose level (0 - 2). (1 is currently the default)

* 0 means normal operation.
* 1 means output all words and the corresponding measured distance in the rule output:
-
+ {{{
- 6.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
+ 6.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"viagra" with fuzz of 0.2
"cialis" with fuzz of 0
"viagra" with fuzz of 0.2
"levitra" with fuzz of 0
(4 word occurrences found)
+ }}}
* 2 means same as 1 with an additional output of the text recognized by gocr in a file debug.<number>.focr in the local directory
This file also contains the recognized format type in the first line (1 means gif, 2 jpeg, 3 png).

+ {{{focr_bin_*}}} - Tells the plugin about the helper applications, change to the full path + binary name if your applications are not found.

+ {{{focr_wrongctype_score}}} - Score to give for a wrong content-type (e.g. Image is GIF but content-type says image/jpeg)
+
+ {{{focr_corrupt_score}}} - Score to give for a corrupted image (Currently only used with GIF images)
+
+ {{{focr_autodisable_score}}} - If the message has already more points than this value, then the plugin will cancel all further OCR checking.


== Example of work ==
@@ -89, +104 @@


-- Author: Christian Holler, decoder_at_own-hero_dot_net

- == Code ==
+ == How to obtain ==

- === FuzzyOcr.cf ===
+ You can download the latest tarball containing the {{{FuzzyOcr.pm}}} and {{{FuzzyOcr.cf}}} from http://users.own-hero.net/~decoder/fuzzyocr/

- {{{
- loadplugin FuzzyOcr FuzzyOcr.pm
- body FUZZY_OCR eval:check_fuzzy_ocr()
- describe FUZZY_OCR Mail contains an image with common spam text inside
-
- # Here we defined the words to scan for
-
- focr_word stock
- focr_word investor
- focr_word international
- focr_word company
- focr_word money
- focr_word million
- focr_word thousand
- focr_word buy
- focr_word price
- focr_word trade
- focr_word banking
- focr_word service
- focr_word kunde
- focr_word volksbank
- focr_word sparkasse
- focr_word software
- focr_word viagra
- focr_word cialis
- focr_word levitra
- focr_word medicine
- focr_word legal
- focr_word medication
- focr_word click here
- focr_word penis
- focr_word growth
- focr_word drugs
- focr_word pharmacy
-
- # These parameters can be used to change other detection settings
- #
- # Detection treshold (see manual)
- #focr_treshold 0.3
- #
- # This is the score for a hit after focr_counts_required matches
- #focr_base_score 4
- #
- # This is the additional score for every additional match after focr_counts_required matches
- #focr_add_score 1
- #
- # Number of minimum matches before the rule scores
- #focr_counts_required 2
- #
- # Verbosity level (see manual)
- #focr_verbose 2
- #
- # Path for temporary files
- #focr_tmp_path "/tmp"
- }}}
-
- === FuzzyOcr.pm ===
-
- {{{
- # FuzzyOcr plugin, version 2.0
- # Changelog:
- # version 2.0
- # Replaced imagemagick with netpbm
- # Invoke giffix to fix broken gifs before conversion
- # Support png images
- # Analyze the file to detect the format without content-type
- # Added several configuration parameters
- #
- #
- # written by Christian Holler decoder_at_own-hero_dot_net
-
- package FuzzyOcr;
-
- use strict;
- use Mail::SpamAssassin;
- use Mail::SpamAssassin::Util;
- use Mail::SpamAssassin::Plugin;
-
- use String::Approx 'adistr';
-
- our @ISA = qw (Mail::SpamAssassin::Plugin);
-
- our @words = ( );
-
- # Default values
- our $treshold = "0.3";
- our $base_score = "4";
- our $add_score = "1";
- our $countreq = 2;
- our $verbose = 1;
- our $tmppath = "/tmp";
-
-
- # constructor: register the eval rule
- sub new {
- my ( $class, $mailsa ) = @_;
- $class = ref($class) || $class;
- my $self = $class->SUPER::new($mailsa);
- bless( $self, $class );
- $self->register_eval_rule("check_fuzzy_ocr");
- return $self;
- }
-
- sub parse_config {
- my ($self, $opts) = @_;
- if ($opts->{key} eq "focr_word") {
- push(@words, $opts->{value});
- } elsif ($opts->{key} eq "focr_treshold") {
- $treshold = $opts->{value};
- } elsif ($opts->{key} eq "focr_base_score") {
- $base_score = $opts->{value};
- } elsif ($opts->{key} eq "focr_add_score") {
- $add_score = $opts->{value};
- } elsif ($opts->{key} eq "focr_counts_required") {
- $countreq = $opts->{value};
- } elsif ($opts->{key} eq "focr_verbose") {
- $verbose = $opts->{value};
- } elsif ($opts->{key} eq "focr_tmp_path") {
- $tmppath = $opts->{value};
- }
- }
-
- sub check_fuzzy_ocr {
- my ( $self, $pms ) = @_;
- my @found = ( );
- my $image_type = 0;
- my $cnt = 0;
- foreach my $p ( $pms->{msg}->find_parts("image") ) {
- my ( $ctype, $boundary, $charset, $name ) =
- Mail::SpamAssassin::Util::parse_content_type(
- $p->get_header('content-type') );
- if ($ctype =~ /image/) {
- my $firstline = ($p->decode())[0];
- my $tempfile = $tmppath . "/" . "spamassassin.$$.focr";
- if ($firstline =~ /^\x47\x49\x46/) {
- $image_type = 1;
- open IMAGE_PROCESSOR, "|/usr/bin/giffix | /usr/bin/giftopnm - |/usr/bin/gocr -i - > $tempfile";
- } elsif ($firstline =~ /^\xff\xd8/) {
- $image_type = 2;
- open IMAGE_PROCESSOR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i - > $tempfile";
- } elsif ($firstline =~ /^\x89\x50\x4e\x47/) {
- $image_type = 3;
- open IMAGE_PROCESSOR, "|/usr/bin/pngtopnm - |/usr/bin/gocr -i - > $tempfile";
- } else {
- $image_type = 0;
- print "No compatible file type detected... skipping image...\n";
- next;
- }
- foreach $p ( $p->decode() ) {
- print IMAGE_PROCESSOR $p;
- }
- if ($verbose > 1) {
- open DEBUG, ">debug.$$.focr";
- print DEBUG "File type: $image_type\n\n"
- }
- close IMAGE_PROCESSOR;
- open OCR_DATA, "<$tempfile";
- while (<OCR_DATA>) {
- s/[^a-zA-Z ]//g;
- $_ = lc;
- if ($verbose > 1) {
- print DEBUG $_;
- }
- my $w;
- foreach $w (@words) {
- $w = lc $w;
- my $matched = adistr($w, $_);
- if (abs($matched) < $treshold) {
- $cnt++;
- if ($verbose > 0) {
- push(@found, "\"$w\"" . " with fuzz of " . abs($matched));
- }
- }
- }
- }
- close OCR_DATA;
- unlink $tempfile;
- if ($verbose > 1) {
- close DEBUG;
- }
- }
- }
- if ($cnt >= $countreq) {
- my $score = $base_score + ($cnt - $countreq) * $add_score;
- my $debuginfo = "";
- if ($verbose > 0) {
- $debuginfo = ("\nWords found:\n" . join("\n", @found) ."\n($cnt word occurrences found)");
- }
- $pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR} . $debuginfo);
- }
- return 0;
- }
-
- 1;
- }}}
-
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------

Additionally, you will need the perl module {{{ String::Approx }}} and {{{giffix}}} (from giflib).

+ Notes for Fedora Core 5 (or higher) users: The package libungif-utils provides giffix.
- Notes for Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix.
+ Notes for other Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix.
Notes for Debian users: The package libungif-bin provides giffix.

== Changelog ==
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------
* Added scoring for broken gif images
* Added configuration for helper applications
* Added autodisable_score feature to disable the OCR engine if the message has already enough points
+ Version 2.2
+ * Several bugfixes
+ * New debug system
+ * Logfile support
+ * Proper error handling for most errors

== Installation ==

- Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it. Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish.
+ Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it to /etc/mail/spamassassin/ (You may choose another location but all necessary adjustments to the configuration file are up to you then). Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish. If you have the helper binaries in a different location than the default in the config file specifies, then change these to the correct path.

The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as {{{focr_counts_required}}} matches were found. It scores exactly {{{focr_base_score}}} points then. For every additional match, it scores additionally {{{focr_add_score}}} points.

@@ -55, +60 @@


{{{focr_tmp_path}}} - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)

+ {{{focr_logfile}}} - String determining the file to send log messages to. Make sure this is writable!
+
{{{focr_verbose}}} - Verbose level (0 - 2). (1 is currently the default)

* 0 means normal operation.
@@ -68, +75 @@

"levitra" with fuzz of 0
(4 word occurrences found)
}}}
+ * 2 means same as 1 with an additional output to the logfile (more messages) and temporary files don't get deleted (so you can inspect them)
- * 2 means same as 1 with an additional output of the text recognized by gocr in a file debug.<number>.focr in the local directory
- This file also contains the recognized format type in the first line (1 means gif, 2 jpeg, 3 png).

{{{focr_bin_*}}} - Tells the plugin about the helper applications, change to the full path + binary name if your applications are not found.
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------
* Proper error handling for most errors

== Installation ==
+
+ Attention: If you need help installing this plugin or have other questions, please use the mailinglist created for this plugin.
+
+ It can be found at http://lists.own-hero.net/mailman/listinfo/devel-spam

Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it to /etc/mail/spamassassin/ (You may choose another location but all necessary adjustments to the configuration file are up to you then). Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish. If you have the helper binaries in a different location than the default in the config file specifies, then change these to the correct path.

@@ -117, +121 @@


You can download the latest tarball containing the {{{FuzzyOcr.pm}}} and {{{FuzzyOcr.cf}}} from http://users.own-hero.net/~decoder/fuzzyocr/

+ For support use the mailing list found at http://lists.own-hero.net/mailman/listinfo/devel-spam
+
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------

You can download the latest tarball containing the {{{FuzzyOcr.pm}}} and {{{FuzzyOcr.cf}}} from http://users.own-hero.net/~decoder/fuzzyocr/

- For support use the mailing list found at http://lists.own-hero.net/mailman/listinfo/devel-spam
+ For support, you can write me an email, or catch me on IRC: Server: irc.own-hero.net Channel: #nmg (My nick is decoder:P)

+ You can also subscribe to the mailing list found at http://lists.own-hero.net/mailman/listinfo/devel-spam but please understand that it is private since we also talk about development there. If you want to test the newest alpha releases, this is the place were you'd want to be ;)
+
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------

You will need {{{giftopnm, jpegtopnm and pngtopnm}}} (from netpbm) and {{{gocr}}} installed.

- Additionally, you will need the perl module {{{ String::Approx }}} and {{{giffix}}} (from giflib).
+ Additionally, you will need the perl module {{{ String::Approx }}} and several tools from {{{ giflib }}} (also known as libungif).
+
+ ATTENTION: There has been a segfault discovered in both {{{ giftext }}} and {{{ gocr }}}
+
+ Patches for the sources are to be found in the download directory of FuzzyOcr. Not using these can make problems under certain circumstances.
+

Notes for Fedora Core 5 (or higher) users: The package libungif-utils provides giffix.

Notes for other Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix.

Notes for Debian users: The package libungif-bin provides giffix.
+
+ Attention when using RedHat! The gocr RPM for RedHat is faulty and causes bad recognition results. Here is a quote from Ken Bass:
+
+ {{{
+
+ I installed gocr 0.40 using an RPM / (source RPM). For whatever reason,
+ the RPM configures the gocr using 'configure --with-netpbm=no'. This
+ netpbm=no option causes some images to not be decoded properly. I get
+ much more garbage. I had to rebuild/reinstall gocr 0.40 without
+ disabling netpbm for best results.
+
+ I've reported this to the gocr mailing list, sent him a sample image,
+ and hopefully it will be fixed. He seemed to think it would be fixed in
+ gocr, but I have no timeframe. In the meantime, those with problems,
+ trying modiying the RPM gocr.spec file or build/install by hand.
+
+ }}}
+

== Changelog ==
[Spamassassin Wiki] Update of "FuzzyOcrPlugin" by ChristianHoller [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by ChristianHoller:
http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

------------------------------------------------------------------------------

== Requirements ==

- You will need {{{giftopnm, jpegtopnm and pngtopnm}}} (from netpbm) and {{{gocr}}} installed.
+ You will need {{{giftopnm, jpegtopnm and pngtopnm}}} (from netpbm), imagemagick and {{{gocr}}} installed.

Additionally, you will need the perl module {{{ String::Approx }}} and several tools from {{{ giflib }}} (also known as libungif).

@@ -65, +65 @@

* New debug system
* Logfile support
* Proper error handling for most errors
+ Version 2.3
+ * Multiple scans with different pnm preprocessing and gocr arguments possible
+ * Support for interlaced gifs
+ * Support for animated gifs
+ * Temporary file handling reorganized
+ * External wordlist support
+ * Personalized wordlist support
+ * Spaces are now stripped from wordlist words and OCR results before matching
+ * Experimental MD5 Database feature

== Installation ==

- Attention: If you need help installing this plugin or have other questions, please use the mailinglist created for this plugin.
+ Attention: If you need help installing this plugin or have other questions, please use the mailinglist created for this plugin or contact me on IRC (see the end of this page for more informations)

It can be found at http://lists.own-hero.net/mailman/listinfo/devel-spam
+
+ Since version 2.3, the tarball contains an INSTALL file and a FAQ file. Both should be read for instructions installing it.
+
+ The following informations are a bit older and might not be accurate anymore for version 2.3. Most new parameters are not mentioned here anymore.

Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it to /etc/mail/spamassassin/ (You may choose another location but all necessary adjustments to the configuration file are up to you then). Open {{{FuzzyOcr.cf}}} and extend the wordlist as you wish. If you have the helper binaries in a different location than the default in the config file specifies, then change these to the correct path.

@@ -84, +97 @@



Explanation of the additional options:
-
- {{{focr_tmp_path}}} - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)

{{{focr_logfile}}} - String determining the file to send log messages to. Make sure this is writable!

@@ -124, +135 @@

Generally, the plugin follows these rules:

* The case is not relevant
- * All special characters or numbers are stripped before any matching is done
+ * All special characters, spaces or numbers are stripped before any matching is done
* Your wordlist word will be found even if it is inside another word (submatching)
* The distance is calculated from the amount of character additions, deletions and substitutions, that need to be done.

@@ -136, +147 @@


== ToDo ==

- * Avoid usage of tmp files for gocr, redirect output directly back to the script
+ * Rework animated gif handling
+ * Replace plain MD5 database with a DBM file

-- Author: Christian Holler, decoder_at_own-hero_dot_net