Mailing List Archive

ExtractText tuning
Hello,

I have successfully set up ExtractText plugin with proposed settings (those
in pod/manual page) and here's a tip:

- put extracttext.pm into /etc/spamassassin or similar directory
(extracttest settings aren't loaded from user_prefs)

- tesseract takes too much time to process (at least on my server),
so I recommend to set:

extracttext_timeout 20 60

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Windows 2000: 640 MB ought to be enough for anybody
Re: ExtractText tuning [ In reply to ]
Hi,

I have successfully set up ExtractText plugin with proposed settings (those
> in pod/manual page) and here's a tip:
>
> - put extracttext.pm into /etc/spamassassin or similar directory
> (extracttest settings aren't loaded from user_prefs)
>
> - tesseract takes too much time to process (at least on my server),
> so I recommend to set:
>
> extracttext_timeout 20 60
>

Have you noticed an increase in false positives due to legitimate "invoice"
PDFs or other attachments being processed by body filters and getting
tagged incorrectly?
Re: ExtractText tuning [ In reply to ]
>I have successfully set up ExtractText plugin with proposed settings (those
>> in pod/manual page) and here's a tip:
>>
>> - put extracttext.pm into /etc/spamassassin or similar directory
>> (extracttest settings aren't loaded from user_prefs)
>>
>> - tesseract takes too much time to process (at least on my server),
>> so I recommend to set:
>>
>> extracttext_timeout 20 60

On 06.03.23 12:23, Alex wrote:
>Have you noticed an increase in false positives due to legitimate "invoice"
>PDFs or other attachments being processed by body filters and getting
>tagged incorrectly?

none so far, as I only have implemented it on my personal machine and I
don't receive many invoices.
I will try on another machines tho.

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
WinError #98652: Operation completed successfully.
Re: ExtractText tuning [ In reply to ]
>> I have successfully set up ExtractText plugin with proposed settings
>> (those in pod/manual page) and here's a tip:
>>
>> - put extracttext.pm into /etc/spamassassin or similar directory
>> (extracttest settings aren't loaded from user_prefs)
>>
>> - tesseract takes too much time to process (at least on my server),
>> so I recommend to set:
>>
>> extracttext_timeout 20 60

On 06.03.23 12:23, Alex wrote:
>Have you noticed an increase in false positives due to legitimate "invoice"
>PDFs or other attachments being processed by body filters and getting
>tagged incorrectly?

Update:

so far I am only happy by catching spams using BAYES:

X-Spam-ExtractText-Chars: 118
X-Spam-ExtractText-Words: 19
X-Spam-ExtractText-Tools: pdftotext
X-Spam-ExtractText-Types: application/pdf
X-Spam-ExtractText-Extensions: pdf

I believe training of invoices would quickly fix any problem

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Nothing is fool-proof to a talented fool.