Mailing List Archive

[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579

--- Comment #12 from Henrik Krohns <apache@hege.li> ---
(In reply to Giovanni Bechis from comment #7)
>
> Extract URIs from pdf files (at least some of them) and add them to the pool
> of URIs to be checked (URIBL, etc...).

We have ExtractText.pm too, so which is better tool for the job? How will we
manage things in future when we have 10 plugins all adding some metadata? Do we
actually want "uri" or URIBL to match _anything_ and how do we manage on
per-rule basis which sources should be used?

--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri [ In reply to ]
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579

--- Comment #13 from Henrik Krohns <apache@hege.li> ---
Let's say some large PDF has a hundred unique "uris" for one reason or another.
How would we manage this? Should we prefer to URIBL query them instead of body
uris? Or shuffle and take n-amount of uris from here and there? How will
different __URI* rules react, which depend on count / number of hits?

I'm quite sceptical that even ExtractText makes any sense. It has the same
problems, along with possibly filling Bayes with semi-random stuff from badly
OCR'd images or wonky rendered PDF's etc.

I think would just vote to have a pdf_has_uri() which can match uris from PDFs
and that's it. No complex metadata hassles.

--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri [ In reply to ]
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579

--- Comment #14 from Giovanni Bechis <giovanni@paclan.it> ---
(In reply to Henrik Krohns from comment #12)
> (In reply to Giovanni Bechis from comment #7)
> >
> > Extract URIs from pdf files (at least some of them) and add them to the pool
> > of URIs to be checked (URIBL, etc...).
>
> We have ExtractText.pm too, so which is better tool for the job? How will we
> manage things in future when we have 10 plugins all adding some metadata? Do
> we actually want "uri" or URIBL to match _anything_ and how do we manage on
> per-rule basis which sources should be used?

IMHO ExtractText.pm is more ocr oriented and it covers more then just pdf
files, PDFInfo.pm is more about attached pdf file names and other info strictly
related to pdf, maybe they could be merged but I do not think it's worth the
effort.

--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri [ In reply to ]
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579

--- Comment #15 from Giovanni Bechis <giovanni@paclan.it> ---
(In reply to Henrik Krohns from comment #13)
> Let's say some large PDF has a hundred unique "uris" for one reason or
> another. How would we manage this? Should we prefer to URIBL query them
> instead of body uris? Or shuffle and take n-amount of uris from here and
> there? How will different __URI* rules react, which depend on count / number
> of hits?
>
> I'm quite sceptical that even ExtractText makes any sense. It has the same
> problems, along with possibly filling Bayes with semi-random stuff from
> badly OCR'd images or wonky rendered PDF's etc.
>
> I think would just vote to have a pdf_has_uri() which can match uris from
> PDFs and that's it. No complex metadata hassles.

ExtractText could poison Bayes databases but a lot of other sources can do the
same, on the other hand it can parse .docx files and images as well and not
just pdf files.
A warning about using ExtractText together with Bayes is a good idea anyway.

--
You are receiving this mail because:
You are the assignee for the bug.