Hi,
I'm trying to use the latest ExtractText plugin, but the docx2txt
program the plugin references is no longer available from
http://docx2txt.sourceforge.net
I've located a working replacement at
https://github.com/ankushshah89/python-docx2txt/ (although it's
written in python and I don't have a distro package for that), it
doesn't appear to output to stdout.
extracttext_external docx2txt /usr/local/bin/docx2txt {} -
extracttext_use docx2txt .docx application/docx
Do you have any recommendations for an alternative or how to modify
this python script to pipe its text to stdout?
# /usr/local/bin/docx2txt -h
usage: docx2txt [-h] [-i IMG_DIR] docx
A pure python-based utility to extract text and images from docx files.
positional arguments:
docx path of the docx file
optional arguments:
-h, --help show this help message and exit
-i IMG_DIR, --img_dir IMG_DIR
path of directory to extract images
Also, has anyone written any meta rules for use with ExtractText that
they'd like to share? I'd like to block all PDF file that contain any
type of javascript - malicious or otherwise. I'd also like to block
all PDFs that's a single page and contain a single URL - that appears
to be the vast majority of all malicious PDFs.
I'm trying to use the latest ExtractText plugin, but the docx2txt
program the plugin references is no longer available from
http://docx2txt.sourceforge.net
I've located a working replacement at
https://github.com/ankushshah89/python-docx2txt/ (although it's
written in python and I don't have a distro package for that), it
doesn't appear to output to stdout.
extracttext_external docx2txt /usr/local/bin/docx2txt {} -
extracttext_use docx2txt .docx application/docx
Do you have any recommendations for an alternative or how to modify
this python script to pipe its text to stdout?
# /usr/local/bin/docx2txt -h
usage: docx2txt [-h] [-i IMG_DIR] docx
A pure python-based utility to extract text and images from docx files.
positional arguments:
docx path of the docx file
optional arguments:
-h, --help show this help message and exit
-i IMG_DIR, --img_dir IMG_DIR
path of directory to extract images
Also, has anyone written any meta rules for use with ExtractText that
they'd like to share? I'd like to block all PDF file that contain any
type of javascript - malicious or otherwise. I'd also like to block
all PDFs that's a single page and contain a single URL - that appears
to be the vast majority of all malicious PDFs.