Mailing List Archive: storing XML can be good... was Re: ANNOUNCE: PT

storing XML can be good... was Re: ANNOUNCE: PT_XPath

Oct 22, 2001, 10:40 AM

Post #1 of 3 (2847 views)

I hear more and more often that XML should not be stored, and just be an
interchange format. And in most cases, that is likely to be correct, but...

I think there are some very convincing reasons to, indeed, store XML on the
server for particular use cases. The best possible illustration that I can
think of is a use case that I have; sorry that this 'manifesto for storing
XML' is so lengthy, but I think this position needs definitive justification
- and I think that for this use case below, I'm right... ;)

Eventually, for our internal use, my company will likely be creating a
product that will support the use and storage of NITF (News Industry Text
Format) and NewsML - both standards for news content defined by the IPTC
(International Press Telecommunications Council) and/or the Newspaper
Association of America (NAA). These DTDs were designed to address
deficiencies in current information flow in the newspaper industry; NITF, in
particular, began in the pre-XML days, as an SGML DTD with the specific goal
of allowing news producers/vendors to scrap the 150 or so different
proprietary text interchange and storage formats used by the industry. The
idea was that in translation (filtering) between lots of formats,
information is almost always lost, and always costly.

One primary idea behind the migration to NITF XML was the idea of 'value
added' data continuing to be added at every step of the workflow. That is,
rather than losing quality of content at every step of the workflow, content
gains quality: more markup, more specificity, more info to help increase the
quality of the content, which CAN be monetized, because it avoids human
labor. Two things prevent this: filtering between formats (not lossless),
and 'flattening' data to a prescribed set of flat fields. If your DTD
supports hundreds of fields, and less than 'tabular' datasets, flattening to
say 30 or 40 fields is a bad, bad thing, but to date, AFAIK, all news
vendors supporting this standard do it. The lifeline of a news story
clearly demonstrates the need to add value (not subtract it with flattening)
at every step:

>News Assignment created by editor, kept as metadata
->Reporter creates story [Editorial System]
-->Editor edits story [Editorial System]
--->Copy is edited for print [Pagination System]
---->Edited Copy is fed back (merged) into editorial
system from paginated output
----->Stories are imported into online CMS system for
www/wireless/PDA/web-services, etc publishing
------>Stories edited by online staff (adding data like
links to photos, cleaning up cutlines to make
them useable).
------->Stories are merged back into database or archive
system either in editorial system or 3rd party
archive/library system.
-------->Librarians add lots of data and subject/topic
info and
--------->Librarians export stories to vendors like
AOL, Lexis-Nexis, and Wire Services
--------->Wire services add markup to story for re-use
---------->Newspaper X uses story from wire service, edits
and adds their own text and metadata.
----------->[Cycle is repeated until story end-of-life]

At every step of the way, revision history metadata is kept in the XML. If
at every step of the way, the XML was NOT flattened, the maximum amount of
possible data is kept within this lifeline. No news vendor systems (that I
know of) do this yet, which is disappointing, considering that this was the
'vision' for the creation of this DTD (NITF) in the first place. If I
republish a syndicated story from, say, the Washington Post or AP, there
should be no reason that meta-data and/or unused content that my APIs do not
support to be stripped out.

The problem is, IMHO, ignorant newspaper production system vendors, who love
relational data stores so much, that they would love nothing more than to
flatten XML into a tabular structure, with finite n number of relations
between tables which cannot possibly express a multi-dimensional XML
structure hierarchy without loss. This is the promise that Zope with
ParsedXML has, by coupling a post-parsed DOM with object persistence
machinery, we are allowed to:
- Treat the XML as a online document
- Treat the XML as a queryable fielded database
- Use the XML without data loss due to DOM-based storage
- Bolt-on APIs including web-services APIs allowing
the reuse of the document via 'flattened' field
queries (like an RDB) but storing the most
possible data without loss. XPath, Object Query
languages, and custom extraction and editing
APIs would provide for this need. These APIs
could be used by TAL, search/catalog systems,
and distributed / web services based clients.

Frankly, this is the holy grail of newspaper production technology: the
centralized database for content, which to date, are usually RDB stores with
lost of complex relations constituting lossy systems; the crazy thing about
it is that newspapers are paying tens of millions of dollars for these
centralized database systems for running their editorial and pagination
systems. I think that XML storage done right (DOM storage coupled with and
ODB and a data access API accessible to a broad range of client
possibilities) is really something that needs to be done for the news media
industry, and I strongly feel that any advances made in this manner in
online production will trickle into the print production systems of
newspapers and magazines eventually. And, for that, it seems to be enough
justification to store XML on the server as XML (or DOM, which is even
better). And in the realm of possibility, this is one thing that
differentiates Zope from almost any other technology platform; I hope that's
a good thing...

Anyway, that's my rant,
Sean

=========================
Sean Upton
Senior Programmer/Analyst
SignOnSanDiego.com
The San Diego Union-Tribune
619.718.5241
sean.upton@uniontrib.com
=========================

Re: storing XML can be good... was Re: ANNOUNCE: PT_XPath [ In reply to ]

mj at zope

Oct 22, 2001, 11:26 AM

Post #2 of 3 (2731 views)

Permalink

On Mon, Oct 22, 2001 at 10:40:09AM -0700, sean.upton@uniontrib.com wrote:
> I hear more and more often that XML should not be stored, and just be an
> interchange format. And in most cases, that is likely to be correct, but...
>
> I think there are some very convincing reasons to, indeed, store XML on the
> server for particular use cases. The best possible illustration that I can
> think of is a use case that I have; sorry that this 'manifesto for storing
> XML' is so lengthy, but I think this position needs definitive justification
> - and I think that for this use case below, I'm right... ;)
>
> Eventually, for our internal use, my company will likely be creating a
> product that will support the use and storage of NITF (News Industry Text
> Format) and NewsML - both standards for news content defined by the IPTC
> (International Press Telecommunications Council) and/or the Newspaper
> Association of America (NAA). These DTDs were designed to address
> deficiencies in current information flow in the newspaper industry; NITF, in
> particular, began in the pre-XML days, as an SGML DTD with the specific goal
> of allowing news producers/vendors to scrap the 150 or so different
> proprietary text interchange and storage formats used by the industry. The
> idea was that in translation (filtering) between lots of formats,
> information is almost always lost, and always costly.

To me, this is still no reason to justify an XML storage. Note that Zope
doesn't force you to use a fixed number of fields; the object tree and
python provide you with much more flexibility. I am convinced that it is
posible to build a Zope app with the same fidelity as the NITF, with
lossless conversion between the internal storage format and NITF.

Using Zope objects instead of XML would buy you speed and scalability. DOMs
are memory hogs, and CPU intensive to build and manipulate. Only use a DOM
when a custom API isn't feasable. Don't carry around all that extra weight
if you can avoid it!

--
Martijn Pieters
| Software Engineer mailto:mj@zope.com
| Zope Corporation http://www.zope.com/
| Creators of Zope http://www.zope.org/
---------------------------------------------

Re: storing XML can be good... was Re: ANNOUNCE: PT_XPath [ In reply to ]

wade at lightlink

Oct 22, 2001, 12:11 PM

Post #3 of 3 (2751 views)

Permalink

I am in a situation similar to Sean's, sending and receiving a bunch of newsfeeds in NITF format. (Hi Sean, i think we met up on the NITF list a while back.)

Because I'm adding on to (and hoping to replace) an ASP system, I've got all my data in MSSQL. My tables have the usual columns that Sean alluded to: headline, byline, date, etc. I also keep the original NITF-
XML in a bigtext field. I run a cron job to rewrite the XML if any of the other fields gets edited.

Right now I don't think I would want to give up the RDB storage, even if I didn't need it for historical reasons. While each article is a tree, the organization at higher levels is definitely tabular. And many people
understand RDB's.

I would love to be able to access the tree structure of each article within Zope, without first parsing XML to a DOM. Having done a fair amount of work with DOM, I hear what Martijn is saying about it being fat and
slow. What about an object representation, like Martijn suggested, that could live in a BLOB in an RDB?

Wade Leftwich
Ithaca, NY

10/22/2001 2:26:41 PM, Martijn Pieters <mj@zope.com> wrote:

>On Mon, Oct 22, 2001 at 10:40:09AM -0700, sean.upton@uniontrib.com wrote:
>> I hear more and more often that XML should not be stored, and just be an
>> interchange format. And in most cases, that is likely to be correct, but...
>>
>> I think there are some very convincing reasons to, indeed, store XML on the
>> server for particular use cases. The best possible illustration that I can
>> think of is a use case that I have; sorry that this 'manifesto for storing
>> XML' is so lengthy, but I think this position needs definitive justification
>> - and I think that for this use case below, I'm right... ;)
>>
>> Eventually, for our internal use, my company will likely be creating a
>> product that will support the use and storage of NITF (News Industry Text
>> Format) and NewsML - both standards for news content defined by the IPTC
>> (International Press Telecommunications Council) and/or the Newspaper
>> Association of America (NAA). These DTDs were designed to address
>> deficiencies in current information flow in the newspaper industry; NITF, in
>> particular, began in the pre-XML days, as an SGML DTD with the specific goal
>> of allowing news producers/vendors to scrap the 150 or so different
>> proprietary text interchange and storage formats used by the industry. The
>> idea was that in translation (filtering) between lots of formats,
>> information is almost always lost, and always costly.
>
>To me, this is still no reason to justify an XML storage. Note that Zope
>doesn't force you to use a fixed number of fields; the object tree and
>python provide you with much more flexibility. I am convinced that it is
>posible to build a Zope app with the same fidelity as the NITF, with
>lossless conversion between the internal storage format and NITF.
>
>Using Zope objects instead of XML would buy you speed and scalability. DOMs
>are memory hogs, and CPU intensive to build and manipulate. Only use a DOM
>when a custom API isn't feasable. Don't carry around all that extra weight
>if you can avoid it!
>
>--
>Martijn Pieters
>| Software Engineer mailto:mj@zope.com
>| Zope Corporation http://www.zope.com/
>| Creators of Zope http://www.zope.org/
>---------------------------------------------
>
>_______________________________________________
>Zope-xml mailing list
>Zope-xml@zope.org
>http://lists.zope.org/mailman/listinfo/zope-xml
>
>