Mailing List Archive

ripOLE 0.0.3
Mostly cleaned up everything now.

ripOLE decodes MS Office files (all the ones i've tested), if you have any MS
Office files which don't decode (extract the attached files) please send them to
me so that I can disassemble them and find out what trickery (most probably just
the header-elements) of the stream is needed to be worked out.

Latest ripOLE is at:

http://www.pldaniels.com/ripole

(version 0.0.3)

Kind Regards

--
Paul L Daniels http://www.pldaniels.com
Linux/Unix systems Internet Development
ICQ#103642862,AOL:pldsoftware,IRC:inflex irc.freenode.net
A.B.N. 19 500 721 806
RE: ripOLE 0.0.3 [ In reply to ]
> ripOLE decodes MS Office files (all the ones i've tested), if
> you have any MS
> Office files which don't decode (extract the attached files)
> please send them to
> me so that I can disassemble them and find out what trickery
> (most probably just
> the header-elements) of the stream is needed to be worked out.

Hopefully the attached patch to ole.c should fix the following warning I
received when compiling on Redhat 6.2 (without breaking anything).

gcc -Wall -Werror -O2 -march=i686 -c ole.c
cc1: warnings being treated as errors
ole.c: In function `OLE_load_chain':
ole.c:818: warning: `buffer' might be used uninitialized in this function
make: *** [ole.o] Error 1

I've also included a Word 2000 document that has 2 image files within it.
One is drag-and-dropped into it, which is extracted correctly, the other is
added using Insert|Picture|From File, which isn't extracted. Is this
expected behaviour?

Other than that, it's looking good so far.
Chris
Re: ripOLE 0.0.3 [ In reply to ]
> Hopefully the attached patch to ole.c should fix the following warning I
> received when compiling on Redhat 6.2 (without breaking anything).

Ah warnings - I'm using gcc 2.95.3, but thanks for the patch - strange that my
-Werror and -Wall didn't pick it up.



> I've also included a Word 2000 document that has 2 image files within it.
> One is drag-and-dropped into it, which is extracted correctly, the other is
> added using Insert|Picture|From File, which isn't extracted. Is this
> expected behaviour?

Yes - the later is part of what i'm trying to track down as being 'other'
objects (other than direct file-attachments) which should be extracted. Imagine
that's where things will get messy :-\

The choice to decode or not is left to the olestream-unwrap module - because I
felt that the ole module should concerntrate strictly with just converting the
file into the streams - if we try do more - we risk being breaking
encapsulation/object properties.

thanks again for the files - I'll update ASAP. I also posted to Freshmeat.


--
Paul L Daniels http://www.pldaniels.com
Linux/Unix systems Internet Development
ICQ#103642862,AOL:pldsoftware,IRC:inflex irc.freenode.net
A.B.N. 19 500 721 806
Re: ripOLE 0.0.3 [ In reply to ]
> I've also included a Word 2000 document that has 2 image files within it.
> One is drag-and-dropped into it, which is extracted correctly, the other is

Chris,

I've checked the output from --debug with --save-unknown-streams, it seems the
image is directly embedded into the WordDocument element, without 'apparent'
indicator (yet) about how/where to find the image precisely. I can see the data
of the image, but I can't accurately get to it.

Do you have the original image that you 'inserted' into the document handy? So
I can do a byte-match/offset. I did notice the word EMBED in ole-stream.2 of
your decoded document.

Back to you...


--
Paul L Daniels http://www.pldaniels.com
Linux/Unix systems Internet Development
ICQ#103642862,AOL:pldsoftware,IRC:inflex irc.freenode.net
A.B.N. 19 500 721 806
RE: ripOLE 0.0.3 [ In reply to ]
> I can see the data
> of the image, but I can't accurately get to it.
>
> Do you have the original image that you 'inserted' into
> the document handy? So
> I can do a byte-match/offset. I did notice the word EMBED in
> ole-stream.2 of
> your decoded document.
>
> Back to you...

Paul,

OK, I've been doing a bit of digging, and it appears that Word (2000)
converts GIF's to PNG's when embedding them, which complicates things
slightly (seems to do the same with TIFF's as well).
I tried extracting a Word document with a GIF and a JPEG in and the JPEG is
there, but I couldn't see the GIF anywhere until I noticed the letters PNG.

Looking at the raw file in a hex editor, it looks like the original path
including filename appears followed by 16 bytes then 24 bytes then the same
16 bytes again immediately before the file data starts.

Just an educated guess, but the last 2 bytes of the 16 that are repeated
might be the file size (big endian), as converted within the document.

Anyway, I've included the original GIF and also converted it to PNG.

Hope this helps.
Chris
Re: ripOLE 0.0.3 [ In reply to ]
> OK, I've been doing a bit of digging, and it appears that Word (2000)
> converts GIF's to PNG's when embedding them, which complicates things
> slightly (seems to do the same with TIFF's as well).

Interesting that MS would be converting them to PNG's, considering they a broken
PNG support library in IE (alpha blend). I'm presuming you're referring to when
you do a Insert->Picture->FromFile operation, as apposed to
Insert->Object->File.

> Just an educated guess, but the last 2 bytes of the 16 that are repeated
> might be the file size (big endian), as converted within the document.

Ugh, once again - I think that a structured-parsing approach needs to be taken -
seems like there's no consistant 'header' which can be searched for :-\

I've put up a copy of a HTML formatted Word document-format specification - I've
just not yet had time to traverse through it myself.

http://www.pldaniels.com/ripole/wword8.html

Regards.


--
Paul L Daniels http://www.pldaniels.com
Linux/Unix systems Internet Development
ICQ#103642862,AOL:pldsoftware,IRC:inflex irc.freenode.net
A.B.N. 19 500 721 806