Mailing List Archive: ExtractText and docx

ExtractText and docx

mysqlstudent at gmail

May 6, 2021, 6:20 PM

Post #1 of 7 (876 views)

Hi,

I'm trying to use the latest ExtractText plugin, but the docx2txt
program the plugin references is no longer available from
http://docx2txt.sourceforge.net

I've located a working replacement at
https://github.com/ankushshah89/python-docx2txt/ (although it's
written in python and I don't have a distro package for that), it
doesn't appear to output to stdout.

extracttext_external docx2txt /usr/local/bin/docx2txt {} -
extracttext_use docx2txt .docx application/docx

Do you have any recommendations for an alternative or how to modify
this python script to pipe its text to stdout?

# /usr/local/bin/docx2txt -h
usage: docx2txt [-h] [-i IMG_DIR] docx

A pure python-based utility to extract text and images from docx files.

positional arguments:
docx path of the docx file

optional arguments:
-h, --help show this help message and exit
-i IMG_DIR, --img_dir IMG_DIR
path of directory to extract images

Also, has anyone written any meta rules for use with ExtractText that
they'd like to share? I'd like to block all PDF file that contain any
type of javascript - malicious or otherwise. I'd also like to block
all PDFs that's a single page and contain a single URL - that appears
to be the vast majority of all malicious PDFs.

Re: ExtractText and docx [ In reply to ]

lwilton at earthlink

May 6, 2021, 8:09 PM

Post #2 of 7 (876 views)

> I'm trying to use the latest ExtractText plugin, but the docx2txt
> program the plugin references is no longer available from
> http://docx2txt.sourceforge.net

The latest version appears to be 1.4 from several years ago.
I just tried downloading the 1.4 version and the CVS version, and in both
cases was rewarded with an archive file.

Loren

Re: ExtractText and docx [ In reply to ]

jhardin at impsec

May 6, 2021, 9:30 PM

Post #3 of 7 (876 views)

On Thu, 6 May 2021, Alex wrote:

> Hi,
>
> I'm trying to use the latest ExtractText plugin, but the docx2txt
> program the plugin references is no longer available from
> http://docx2txt.sourceforge.net

> Do you have any recommendations for an alternative...?

Perhaps one of (from Stack Overflow):

unzip -p some.docx word/document.xml |\
sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

unzip -p document.docx word/document.xml |\
sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

unzip -p document.docx word/document.xml |\
sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'

...though html2text might be better than sed for reliably de-XMLizing the
document text.

There's also this:

http://abisource.com/downloads/wv/

There's conflicting information on whether Antiword groks .docx, you may
want to try it and see. It may be available from your distro, otherwise:

http://www.winfield.demon.nl/index.html

It might be worthwhile to use native perl utilities to unzip the file,
extract the document.xml content and pass it through XML::XPath to extract
the text, but that would probably involve code changes to ExtractText
rather than just configuring an it to use external utility.

Caveat: I have never looked at the ExtractText plugin.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Are you a mildly tech-literate politico horrified by the level of
ignorance demonstrated by lawmakers gearing up to regulate online
technology they don't even begin to grasp? Cool. Now you have a
tiny glimpse into a day in the life of a gun owner. -- Sean Davis
-----------------------------------------------------------------------
2 days until the 76th anniversary of VE day

Re: ExtractText and docx [ In reply to ]

May 6, 2021, 9:35 PM

Post #4 of 7 (876 views)

If you have a JVM lying around, you can extract docx text with Apache Tika.

—
Peter West
pbw@ehealth.id.au
“I am the vine; you are the branches.”

> On 7 May 2021, at 2:30 pm, John Hardin <jhardin@impsec.org> wrote:
>
> On Thu, 6 May 2021, Alex wrote:
>
>> Hi,
>>
>> I'm trying to use the latest ExtractText plugin, but the docx2txt
>> program the plugin references is no longer available from
>> http://docx2txt.sourceforge.net
>
>> Do you have any recommendations for an alternative...?
>
> Perhaps one of (from Stack Overflow):
>
> unzip -p some.docx word/document.xml |\
> sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
>
> unzip -p document.docx word/document.xml |\
> sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
>
> unzip -p document.docx word/document.xml |\
> sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'
>
> ...though html2text might be better than sed for reliably de-XMLizing the document text.
>
> There's also this:
>
> http://abisource.com/downloads/wv/
>
> There's conflicting information on whether Antiword groks .docx, you may want to try it and see. It may be available from your distro, otherwise:
>
> http://www.winfield.demon.nl/index.html
>
> It might be worthwhile to use native perl utilities to unzip the file, extract the document.xml content and pass it through XML::XPath to extract the text, but that would probably involve code changes to ExtractText rather than just configuring an it to use external utility.
>
> Caveat: I have never looked at the ExtractText plugin.
>
>
> --
> John Hardin KA7OHZ http://www.impsec.org/~jhardin/
> jhardin@impsec.org pgpk -a jhardin@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
> Are you a mildly tech-literate politico horrified by the level of
> ignorance demonstrated by lawmakers gearing up to regulate online
> technology they don't even begin to grasp? Cool. Now you have a
> tiny glimpse into a day in the life of a gun owner. -- Sean Davis
> -----------------------------------------------------------------------
> 2 days until the 76th anniversary of VE day

Re: ExtractText and docx [ In reply to ]

Olivier.Nicole at cs

May 6, 2021, 9:58 PM

Post #5 of 7 (876 views)

Peter West <pbw@pbw.id.au> writes:

> [1:text/plain Show]
>
>
> [2:text/html Hide Save:noname (29kB)]
>
> If you have a JVM lying around, you can extract docx text with Apache Tika.

I use LibreOffice for that purpose. Not the most efficient, but I am
sure it covers it all and will update each time I update LibreOffice:

/usr/local/bin/soffice --headless --convert-to pdf $dir/message.raw
--outdir $dir

Once in a while LibreOffice process will hang, so I have a cron to
delete any such process older than 5 minutes.

Note that it converts the document to PDF, so I still have to do PDF
extraction afterward.

Best regards,

Olivier

>
> —
> Peter West
> pbw@ehealth.id.au
> “I am the vine; you are the branches.”
>
> On 7 May 2021, at 2:30 pm, John Hardin <jhardin@impsec.org> wrote:
>
> On Thu, 6 May 2021, Alex wrote:
>
> Hi,
>
> I'm trying to use the latest ExtractText plugin, but the docx2txt
> program the plugin references is no longer available from
> http://docx2txt.sourceforge.net
>
> Do you have any recommendations for an alternative...?
>
> Perhaps one of (from Stack Overflow):
>
> unzip -p some.docx word/document.xml |\
> sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
>
> unzip -p document.docx word/document.xml |\
> sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
>
> unzip -p document.docx word/document.xml |\
> sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'
>
> ...though html2text might be better than sed for reliably de-XMLizing the
> document text.
>
> There's also this:
>
> http://abisource.com/downloads/wv/
>
> There's conflicting information on whether Antiword groks .docx, you may
> want to try it and see. It may be available from your distro, otherwise:
>
> http://www.winfield.demon.nl/index.html
>
> It might be worthwhile to use native perl utilities to unzip the file,
> extract the document.xml content and pass it through XML::XPath to
> extract the text, but that would probably involve code changes to
> ExtractText rather than just configuring an it to use external utility.
>
> Caveat: I have never looked at the ExtractText plugin.
>
> --
> John Hardin KA7OHZ http://www.impsec.org/~jhardin/
> jhardin@impsec.org pgpk -a jhardin@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
> Are you a mildly tech-literate politico horrified by the level of
> ignorance demonstrated by lawmakers gearing up to regulate online
> technology they don't even begin to grasp? Cool. Now you have a
> tiny glimpse into a day in the life of a gun owner. -- Sean Davis
> -----------------------------------------------------------------------
> 2 days until the 76th anniversary of VE day
>

/usr/local/bin/soffice --headless --convert-to pdf $di\
r/message.raw --outdir $dir
--

Re: ExtractText and docx [ In reply to ]

May 6, 2021, 9:58 PM

Post #6 of 7 (876 views)

On Thu, May 06, 2021 at 09:20:28PM -0400, Alex wrote:
>
> Also, has anyone written any meta rules for use with ExtractText that
> they'd like to share? I'd like to block all PDF file that contain any
> type of javascript - malicious or otherwise. I'd also like to block
> all PDFs that's a single page and contain a single URL - that appears
> to be the vast majority of all malicious PDFs.

That's something for PDFInfo or the likes.

ExtractText simply extracts text and pretends it's _part_ of the message
body (for body rules etc). How would that retain any info of what is "a
single PDF page"? You don't even know from what the text was extracted
from. Which is why I'm debating if the whole plugin is useful at all or
just feeding Bayes crap.

Re: ExtractText and docx [ In reply to ]

May 7, 2021, 4:26 AM

Post #7 of 7 (876 views)

On 2021-05-07 06:58, Henrik K wrote:

> Which is why I'm debating if the whole plugin is useful at all or
> just feeding Bayes crap.

oh dear :=)

bayes can only be fooled by provide poison data in autolearn, if it
manuel trained as spam, then poison data loose

maybe there is another problem, YMMV

clamav with foxhole 3dr party sigs can check javascript embedded, more
clean solution imho