Mailing List Archive: XML Considered Harmful

XML Considered Harmful

michael.stemper at gmail

Sep 21, 2021, 11:12 AM

Post #1 of 75 (668 views)

On the prolog thread, somebody posted a link to:
<https://dirtsimple.org/2004/12/python-is-not-java.html>

One thing that it tangentially says is "XML is not the answer."

I read this page right when I was about to write an XML parser
to get data into the code for a research project I'm working on.
It seems to me that XML is the right approach for this sort of
thing, especially since the data is hierarchical in nature.

Does the advice on that page mean that I should find some other
way to get data into my programs, or does it refer to some kind
of misuse/abuse of XML for something that it wasn't designed
for?

If XML is not the way to package data, what is the recommended
approach?
--
Michael F. Stemper
Life's too important to take seriously.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 21, 2021, 11:42 AM

Post #2 of 75 (668 views)

On 2021-09-21, Michael F. Stemper <michael.stemper@gmail.com> wrote:
> On the prolog thread, somebody posted a link to:
><https://dirtsimple.org/2004/12/python-is-not-java.html>
>
> One thing that it tangentially says is "XML is not the answer."
>
> I read this page right when I was about to write an XML parser
> to get data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.
>
> Does the advice on that page mean that I should find some other
> way to get data into my programs, or does it refer to some kind
> of misuse/abuse of XML for something that it wasn't designed
> for?
>
> If XML is not the way to package data, what is the recommended
> approach?

I'd agree that you should not use XML unless the data is being supplied
already in XML format or perhaps if there is already a schema defined in
XML for exactly your purpose.

If there is nothing pre-existing to build upon then I'd suggest JSON.

If anyone suggests YAML, then you should just back slowly away while
speaking in a low calm voice until you have reached sufficient safe
distance, then turn and run.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 21, 2021, 11:49 AM

Post #3 of 75 (668 views)

On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:

> On the prolog thread, somebody posted a link to:
> <https://dirtsimple.org/2004/12/python-is-not-java.html>
>
> One thing that it tangentially says is "XML is not the answer."
>
> I read this page right when I was about to write an XML parser to get
> data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of thing,
> especially since the data is hierarchical in nature.
>
> Does the advice on that page mean that I should find some other way to
> get data into my programs, or does it refer to some kind of misuse/abuse
> of XML for something that it wasn't designed for?
>
> If XML is not the way to package data, what is the recommended approach?

1'st can I say don't write your own XML parser, there are already a
number of existing parsers that should do everything you will need. This
is a wheel that does not need re-inventing.

2nd if you are not generating the data then you have to use whatever data
format you are supplied

as far as I can see the main issue with XML is bloat, it tries to do too
many things & is a very verbose format, often the quantity of mark-up can
easily exceed the data contained within it.

other formats such a JSON & csv have far less overhead, although again
not always suitable.

As in all such cases it is a matter of choosing the most apropriate tool
for the job in hand.

--
Antonym, n.:
The opposite of the word you're trying to think of.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

michael.stemper at gmail

Sep 21, 2021, 12:22 PM

Post #4 of 75 (668 views)

On 21/09/2021 13.49, alister wrote:
> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>
>> On the prolog thread, somebody posted a link to:
>> <https://dirtsimple.org/2004/12/python-is-not-java.html>
>>
>> One thing that it tangentially says is "XML is not the answer."
>>
>> I read this page right when I was about to write an XML parser to get
>> data into the code for a research project I'm working on.
>> It seems to me that XML is the right approach for this sort of thing,
>> especially since the data is hierarchical in nature.
>>
>> Does the advice on that page mean that I should find some other way to
>> get data into my programs, or does it refer to some kind of misuse/abuse
>> of XML for something that it wasn't designed for?
>>
>> If XML is not the way to package data, what is the recommended approach?
>
> 1'st can I say don't write your own XML parser, there are already a
> number of existing parsers that should do everything you will need. This
> is a wheel that does not need re-inventing.

I was going to build it on top of xml.etree.ElementTree

> 2nd if you are not generating the data then you have to use whatever data
> format you are supplied

It's my own research, so I can give myself the data in any format that I
like.

> as far as I can see the main issue with XML is bloat, it tries to do too
> many things & is a very verbose format, often the quantity of mark-up can
> easily exceed the data contained within it.
>
> other formats such a JSON & csv have far less overhead, although again
> not always suitable.

I've heard of JSON, but never done anything with it.

How does CSV handle hierarchical data? For instance, I have
generators[1], each of which has a name, a fuel and one or more
incremental heat rate curves. Each fuel has a name, UOM, heat content,
and price. Each incremental cost curve has a name, and a series of
ordered pairs (representing a piecewise linear curve).

Can CSV files model this sort of situation?

> As in all such cases it is a matter of choosing the most apropriate tool
> for the job in hand.

Naturally. That's what I'm exploring.

[1] The kind made of tons of iron and copper, filled with oil, and
rotating at 1800 rpm.

--
Michael F. Stemper
This sentence no verb.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

petef4+usenet at gmail

Sep 21, 2021, 2:21 PM

Post #5 of 75 (668 views)

"Michael F. Stemper" <michael.stemper@gmail.com> writes:

> On 21/09/2021 13.49, alister wrote:
>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
> It's my own research, so I can give myself the data in any format that I
> like.
>
>> as far as I can see the main issue with XML is bloat, it tries to do
>> too many things & is a very verbose format, often the quantity of
>> mark-up can easily exceed the data contained within it. other formats
>> such a JSON & csv have far less overhead, although again not always
>> suitable.
>
> I've heard of JSON, but never done anything with it.

Then you should certainly try to get a basic understanding of it. One
thing JSON shares with XML is that it is best left to machines to
produce and consume. Because both can be viewed in a text editor there
is a common misconception that they are easy to edit. Not so, commas are
a common bugbear in JSON and non-trivial edits in (XML unaware) text
editors are tricky.

Consider what overhead you should worry about. If you are concerned
about file sizes then XML, JSON and CSV should all compress to a similar
size.

> How does CSV handle hierarchical data? For instance, I have
> generators[1], each of which has a name, a fuel and one or more
> incremental heat rate curves. Each fuel has a name, UOM, heat content,
> and price. Each incremental cost curve has a name, and a series of
> ordered pairs (representing a piecewise linear curve).
>
> Can CSV files model this sort of situation?

The short answer is no. CSV files represent spreadsheet row-column
values with nothing fancier such as formulas or other redirections.

CSV is quite good as a lowest common denominator exchange format. I say
quite because I would characterize it by 8 attributes and you need to
pick a dialect such as MS Excel which sets out what those are. XML and
JSON are controlled much better. You can easily verify that you conform
to those and guarantee that *any* conformant parser can read your
content. XML is more powerful in that repect than JSON in that you can
define and enforce schemas. In your case the fuel name, UOM, etc. can be
validated with standard tools. In JSON all that checking is entirely
handled by the consuming program(s).

>> As in all such cases it is a matter of choosing the most apropriate tool
>> for the job in hand.
>
> Naturally. That's what I'm exploring.

You might also like to consider HDF5. It is targeted at large volumes of
scientific data and its capabilities are well above what you need.
MATLAB, Octave and Scilab use it as their native format. PyTables and
h2py provide Python/NumPy bindings to it.

--
Pete Forman
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 21, 2021, 3:30 PM

Post #6 of 75 (668 views)

On Tue, 21 Sep 2021 14:22:52 -0500, Michael F. Stemper wrote:

> On 21/09/2021 13.49, alister wrote:
>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>>
>>> On the prolog thread, somebody posted a link to:
>>> <https://dirtsimple.org/2004/12/python-is-not-java.html>
>>>
>>> One thing that it tangentially says is "XML is not the answer."
>>>
>>> I read this page right when I was about to write an XML parser to get
>>> data into the code for a research project I'm working on.
>>> It seems to me that XML is the right approach for this sort of thing,
>>> especially since the data is hierarchical in nature.
>>>
>>> Does the advice on that page mean that I should find some other way to
>>> get data into my programs, or does it refer to some kind of
>>> misuse/abuse of XML for something that it wasn't designed for?
>>>
>>> If XML is not the way to package data, what is the recommended
>>> approach?
>>
>> 1'st can I say don't write your own XML parser, there are already a
>> number of existing parsers that should do everything you will need.
>> This is a wheel that does not need re-inventing.
>
> I was going to build it on top of xml.etree.ElementTree
>
so not writing a parser, using one, that's ok

>> 2nd if you are not generating the data then you have to use whatever
>> data format you are supplied
>
> It's my own research, so I can give myself the data in any format that I
> like.
>
>> as far as I can see the main issue with XML is bloat, it tries to do
>> too many things & is a very verbose format, often the quantity of
>> mark-up can easily exceed the data contained within it.
>>
>> other formats such a JSON & csv have far less overhead, although again
>> not always suitable.
>
> I've heard of JSON, but never done anything with it.
the python json library makes it simple.
it was originally invented for javascript, it looks very much like the
repl for a list/dictionary but if you are using std libraries you don't
really need to know except for academic interst
>
> How does CSV handle hierarchical data?
It dosn't, if you have heirachiacl data it is not a suitable format
> For instance, I have
> generators[1], each of which has a name, a fuel and one or more
> incremental heat rate curves. Each fuel has a name, UOM, heat content,
> and price. Each incremental cost curve has a name, and a series of
> ordered pairs (representing a piecewise linear curve).
>
> Can CSV files model this sort of situation?
>
>> As in all such cases it is a matter of choosing the most apropriate
>> tool for the job in hand.
>
> Naturally. That's what I'm exploring.
>
>
> [1] The kind made of tons of iron and copper, filled with oil, and
> rotating at 1800 rpm.

--
Riches cover a multitude of woes.
-- Menander
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Sep 21, 2021, 3:49 PM

Post #7 of 75 (668 views)

ram@zedat.fu-berlin.de (Stefan Ram) writes:
<snip>
> - S expressions (i.e., LISP notation)

If you're looking at hierarchical data and you don't have some good
reason to use something else, this is very likely to be your simplest
option.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 21, 2021, 3:58 PM

Post #8 of 75 (668 views)

On 2021-09-21, Pete Forman <petef4+usenet@gmail.com> wrote:
> CSV is quite good as a lowest common denominator exchange format. I say
> quite because I would characterize it by 8 attributes and you need to
> pick a dialect such as MS Excel which sets out what those are. XML and
> JSON are controlled much better. You can easily verify that you conform
> to those and guarantee that *any* conformant parser can read your
> content. XML is more powerful in that repect than JSON in that you can
> define and enforce schemas. In your case the fuel name, UOM, etc. can be
> validated with standard tools. In JSON all that checking is entirely
> handled by the consuming program(s).

That's not true. You can use "JSON Schema" to create a schema
for validating JSON files, and there appear to be at least four
implementations in Python.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Sep 21, 2021, 5:30 PM

Post #9 of 75 (668 views)

In comp.lang.python, Michael F. Stemper <michael.stemper@gmail.com> wrote:
> I've heard of JSON, but never done anything with it.

You probably have used it inadvertantly on a regular basis over the
past few years. Websites live on it.

> How does CSV handle hierarchical data? For instance, I have
> generators[1], each of which has a name, a fuel and one or more
> incremental heat rate curves. Each fuel has a name, UOM, heat content,
> and price. Each incremental cost curve has a name, and a series of
> ordered pairs (representing a piecewise linear curve).
>
> Can CSV files model this sort of situation?

Can a string of ones and zeros encode the sounds of Bach, the images
of his sheet music, the details to reproduce his bust in melted plastic
extruded from nozzle under the control of machines?

Yes, CSV files can model that. But it would not be my first choice of
data format. (Neither would JSON.) I'd probably use XML.

I rather suspect that all (many) of those genomes that end up in
Microsoft Excel files get there via a CSV export from a command line
tool. Once you can model life in CSV, everything seems possible.

> [1] The kind made of tons of iron and copper, filled with oil, and
> rotating at 1800 rpm.

Those are rather hard to model in CSV, too, but I'm sure it could be
done.

Elijah
------
for bonus round, use punched holes in paper to encode the ones and zeros
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Sep 21, 2021, 6:27 PM

Post #10 of 75 (668 views)

Eli the Bearded <*@eli.users.panix.com> writes:

> In comp.lang.python, Michael F. Stemper <michael.stemper@gmail.com> wrote:
>> I've heard of JSON, but never done anything with it.
>
> You probably have used it inadvertantly on a regular basis over the
> past few years. Websites live on it.

If the user has any interaction whatever with the formats being used to
transfer data then something is very, very wrong. Someone using a
website built on JSON isn't using JSON in any meaningful sense of the
term.

>> How does CSV handle hierarchical data? For instance, I have
>> generators[1], each of which has a name, a fuel and one or more
>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>> and price. Each incremental cost curve has a name, and a series of
>> ordered pairs (representing a piecewise linear curve).
>>
>> Can CSV files model this sort of situation?
>
> Can a string of ones and zeros encode the sounds of Bach, the images
> of his sheet music, the details to reproduce his bust in melted plastic
> extruded from nozzle under the control of machines?
>
> Yes, CSV files can model that. But it would not be my first choice of
> data format. (Neither would JSON.) I'd probably use XML.
>
> I rather suspect that all (many) of those genomes that end up in
> Microsoft Excel files get there via a CSV export from a command line
> tool. Once you can model life in CSV, everything seems possible.

Whenever someone asks "can this be done?" in any sort of computer
related question, the real question is "is this practical?" I have hazy
memories of seeing a Turing Machine implemented in an Excel spreadsheet,
so *anything* can, with sufficiently ridiculous amounts of work. That's
not really helpful here.

>> [1] The kind made of tons of iron and copper, filled with oil, and
>> rotating at 1800 rpm.
>
> Those are rather hard to model in CSV, too, but I'm sure it could be
> done.

So let's try to point him at representations that are easy.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

ethan at stoneleaf

Sep 21, 2021, 7:36 PM

Post #11 of 75 (668 views)

On 9/21/21 11:12 AM, Michael F. Stemper wrote:

> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.

If you're looking for a format that you can read (as a human) and possibly hand-edit,
check out NestedText:

https://nestedtext.org/en/stable/

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

drsalists at gmail

Sep 21, 2021, 7:46 PM

Post #12 of 75 (668 views)

On Tue, Sep 21, 2021 at 7:26 PM Michael F. Stemper <
michael.stemper@gmail.com> wrote:

> If XML is not the way to package data, what is the recommended
> approach?
>

I prefer both JSON and YAML over XML.

XML has both elements and tags, but it didn't really need both. This
results in more complexity than necessary. Also, XSLT and XPath are not
really all that simple.

But there's hope. If you're stuck with XML, you can use xmltodict, which
makes XML almost as easy as JSON.

HTH.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

petef4+usenet at gmail

Sep 22, 2021, 12:56 AM

Post #13 of 75 (661 views)

Jon Ribbens <jon+usenet@unequivocal.eu> writes:

> On 2021-09-21, Pete Forman <petef4+usenet@gmail.com> wrote:
>> CSV is quite good as a lowest common denominator exchange format. I
>> say quite because I would characterize it by 8 attributes and you
>> need to pick a dialect such as MS Excel which sets out what those
>> are. XML and JSON are controlled much better. You can easily verify
>> that you conform to those and guarantee that *any* conformant parser
>> can read your content. XML is more powerful in that repect than JSON
>> in that you can define and enforce schemas. In your case the fuel
>> name, UOM, etc. can be validated with standard tools. In JSON all
>> that checking is entirely handled by the consuming program(s).
>
> That's not true. You can use "JSON Schema" to create a schema for
> validating JSON files, and there appear to be at least four
> implementations in Python.

Fair point. It has been a while since I looked at JSON schemas and they
were rather less mature then.

--
Pete Forman
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

michael.stemper at gmail

Sep 22, 2021, 7:40 AM

Post #14 of 75 (661 views)

On 21/09/2021 16.21, Pete Forman wrote:
> "Michael F. Stemper" <michael.stemper@gmail.com> writes:
>> On 21/09/2021 13.49, alister wrote:
>>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>> It's my own research, so I can give myself the data in any format that I
>> like.
>>
>>> as far as I can see the main issue with XML is bloat, it tries to do
>>> too many things & is a very verbose format, often the quantity of
>>> mark-up can easily exceed the data contained within it. other formats
>>> such a JSON & csv have far less overhead, although again not always
>>> suitable.
>>
>> I've heard of JSON, but never done anything with it.
>
> Then you should certainly try to get a basic understanding of it. One
> thing JSON shares with XML is that it is best left to machines to
> produce and consume. Because both can be viewed in a text editor there
> is a common misconception that they are easy to edit. Not so, commas are
> a common bugbear in JSON and non-trivial edits in (XML unaware) text
> editors are tricky.

Okay, after playing around with the example in Lubanovic's book[1]
I've managed to create a dict of dicts of dicts and write it to a
json file. It seems to me that this is how json handles hierarchical
data. Is that understanding correct?

Is this then the process that I would use to create a *.json file
to provide data to my various programs? Copy and paste the current
hard-coded assignment statements into REPL, use json.dump(dict,fp)
to write it to a file, and then read the file into each program
with json.load(fp)? (Actually, I'd write a function to do that,
just as I would with XML.)

> Consider what overhead you should worry about. If you are concerned
> about file sizes then XML, JSON and CSV should all compress to a similar
> size.

Not a concern at all for my current application.

>> How does CSV handle hierarchical data? For instance, I have
>> generators[1], each of which has a name, a fuel and one or more
>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>> and price. Each incremental cost curve has a name, and a series of
>> ordered pairs (representing a piecewise linear curve).
>>
>> Can CSV files model this sort of situation?
>
> The short answer is no. CSV files represent spreadsheet row-column
> values with nothing fancier such as formulas or other redirections.

Okay, that was what I suspected.

> CSV is quite good as a lowest common denominator exchange format. I say
> quite because I would characterize it by 8 attributes and you need to
> pick a dialect such as MS Excel which sets out what those are. XML and
> JSON are controlled much better. You can easily verify that you conform
> to those and guarantee that *any* conformant parser can read your
> content. XML is more powerful in that repect than JSON in that you can
> define and enforce schemas. In your case the fuel name, UOM, etc. can be
> validated with standard tools.

Yeah, validating against a DTD is pretty easy, since lxml.etree does all
of the work.

> In JSON all that checking is entirely
> handled by the consuming program(s).
Well, the consumer's (almost) always going to need to do *some*
validation. For instance, as far as I can tell, a DTD can't specify
that there must be at least two of a particular item.

The designers of DTD seem to have taken the advice of MacLennan[2]:
"The only reasonable numbers are zero, one, or infinity."

Which is great until you need to make sure that you have enough
points to define at least one line segment.

>>> As in all such cases it is a matter of choosing the most apropriate tool
>>> for the job in hand.
>>
>> Naturally. That's what I'm exploring.
>
> You might also like to consider HDF5. It is targeted at large volumes of
> scientific data and its capabilities are well above what you need.

Yeah, I won't be looking at more than five or ten generators at most. A
small number is enough to confirm or refute the behavior that I'm
testing.

[1] _Introducing Python: Modern Computing in Simple Packages_,
Second Release, (c) 2015, Bill Lubanovic, O'Reilly Media, Inc.
[2] _Principles of Programming Languages: Design, Evaluation,
and Implementation_, Second Edition, (c) 1987, Bruce J. MacLennan,
Holt, Rinehart, & Winston
--
Michael F. Stemper
No animals were harmed in the composition of this message.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

michael.stemper at gmail

Sep 22, 2021, 7:52 AM

Post #15 of 75 (661 views)

On 21/09/2021 19.30, Eli the Bearded wrote:
> In comp.lang.python, Michael F. Stemper <michael.stemper@gmail.com> wrote:
>> I've heard of JSON, but never done anything with it.
>
> You probably have used it inadvertantly on a regular basis over the
> past few years. Websites live on it.

I used to use javascript when I was running Windows (up until 2009),
since it was the only programming language to which I had ready
access. Then I got a linux box and quickly discovered python. I
dropped javascript like a hot potato.

>> How does CSV handle hierarchical data? For instance, I have
>> generators[1], each of which has a name, a fuel and one or more
>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>> and price. Each incremental cost curve has a name, and a series of
>> ordered pairs (representing a piecewise linear curve).
>>
>> Can CSV files model this sort of situation?
>
> Can a string of ones and zeros encode the sounds of Bach, the images
> of his sheet music, the details to reproduce his bust in melted plastic
> extruded from nozzle under the control of machines?
>
> Yes, CSV files can model that. But it would not be my first choice of
> data format. (Neither would JSON.) I'd probably use XML.

Okay. 'Go not to the elves for counsel, for they will say both no
and yes.' (I'm not actually surprised to find differences of opinion.)

>> [1] The kind made of tons of iron and copper, filled with oil, and
>> rotating at 1800 rpm.
>
> Those are rather hard to model in CSV, too, but I'm sure it could be
> done.

> for bonus round, use punched holes in paper to encode the ones and zeros

I've done cardboard.

--
Michael F. Stemper
No animals were harmed in the composition of this message.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Sep 22, 2021, 9:31 AM

Post #16 of 75 (661 views)

On Tue, 21 Sep 2021 13:12:10 -0500, "Michael F. Stemper"
<michael.stemper@gmail.com> declaimed the following:

>On the prolog thread, somebody posted a link to:
><https://dirtsimple.org/2004/12/python-is-not-java.html>
>
>One thing that it tangentially says is "XML is not the answer."
>
>I read this page right when I was about to write an XML parser
>to get data into the code for a research project I'm working on.
>It seems to me that XML is the right approach for this sort of
>thing, especially since the data is hierarchical in nature.
>
>Does the advice on that page mean that I should find some other
>way to get data into my programs, or does it refer to some kind
>of misuse/abuse of XML for something that it wasn't designed
>for?

There are some that try to use XML as a /live/ data /storage/ format
(such as http://www.drivehq.com/web/brana/pandora.htm which has to parse
XML files for all configuration data and filter definitions on start-up,
and update those files on any changes).

If you control both the data generation and the data consumption,
finding some format with less overhead than XML is probably to be
recommended. XML is more a self-documented (in theory) means of packaging
data for transport between widely disparate applications, which are likely
written by different teams, if not different companies, who only interface
via the definition of the data as seen by XML.

>
>If XML is not the way to package data, what is the recommended
>approach?

Again, if you control both generation and consumption... I'd probably
use an RDBM. SQLite tends to be packaged with Python [Windows] or, at the
least, the DB-API adapter [Linux tends to expect SQLite as a standard
installed item]. SQLite is a "file server" model (as is the JET engine used
by M$ Access) -- each application (instance) is directly accessing the
database file; there is no server process mediating access.

Hierarchical (since you mention that in later posts) would be
represented by relations (terminology from relational theory -- a "table"
to most) linked by foreign keys.

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Sep 22, 2021, 3:37 PM

Post #17 of 75 (661 views)

On Wed, 22 Sep 2021 09:52:59 -0500, "Michael F. Stemper"
<michael.stemper@gmail.com> declaimed the following:

>On 21/09/2021 19.30, Eli the Bearded wrote:
>> In comp.lang.python, Michael F. Stemper <michael.stemper@gmail.com> wrote:
>>> How does CSV handle hierarchical data? For instance, I have
>>> generators[1], each of which has a name, a fuel and one or more
>>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>>> and price. Each incremental cost curve has a name, and a series of
>>> ordered pairs (representing a piecewise linear curve).
>>>
>>> Can CSV files model this sort of situation?
>>
<SNIP>
>> Yes, CSV files can model that. But it would not be my first choice of
>> data format. (Neither would JSON.) I'd probably use XML.
>
>Okay. 'Go not to the elves for counsel, for they will say both no
>and yes.' (I'm not actually surprised to find differences of opinion.)
>
You'd have to include a "level" (and/or data type if multiple objects
can be at the same level) field (as the first field) in CSV which
identifies how to parse the rest of the CSV data (well, technically, the
CSV module has "parsed" it -- in terms of splitting at commas, handling
quoted strings (which may contain commas which are not split points, etc.).

1-generator, name
2-fuel, name, UOM, heat-content, price
2-curve, name
3-point, X, Y
3-point, X, Y
...
2-curve, name
3-point, X, Y
3-point, X, Y
...

You extract objects at each level; if the level is the same or "lower"
(numerically -- higher in hierarchy) you attach the "previously" extracted
object to the parent object... Whether list or dictionary, or class
instance(s):

class Point():
#Point may be overkill, easier to just use a tuple (X, Y)
def __init__(self, X, Y):
self.X = X
self.Y = Y

class Curve():
def __init__(self, name):
self.name = name
self.points = []

#use as aCurve.points.append(currentPoint)

class Fuel():
def __init__(self, name, ..., price):
self.name = name
...
self.price = price

class Generator():
def __init__(self, name):
self.name = name
self.fuel = None
self.curves = []

#aGenerator.fuel = currentCurve
#aGenerator.curves.append(currentCurve)

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

auriocus at gmx

Sep 23, 2021, 2:21 AM

Post #18 of 75 (642 views)

Am 22.09.21 um 16:52 schrieb Michael F. Stemper:
> On 21/09/2021 19.30, Eli the Bearded wrote:
>> Yes, CSV files can model that. But it would not be my first choice of
>> data format. (Neither would JSON.) I'd probably use XML.
>
> Okay. 'Go not to the elves for counsel, for they will say both no
> and yes.' (I'm not actually surprised to find differences of opinion.)

It is wrong, CSV has no model of hierarchical data. A CSV file is a 2d
table, just like a database table or an Excel sheet.

You can /layer/ high-dimensional data on top of a 2D table, there is the
relational algebra theory behind this, but it is wrong (or misleading at
best) to say that CSV can model hierarchical data.

It's the same as saying "CSV supports images". Of course it doesn't, its
a textfile, but you could encode a JPEG as base64 and then put this
string into the cell of a CSV table. That definitely isn't what a sane
person would understand as "support".

Christian

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

mats at wichmann

Sep 23, 2021, 5:53 AM

Post #19 of 75 (642 views)

On 9/22/21 10:31, Dennis Lee Bieber wrote:

> If you control both the data generation and the data consumption,
> finding some format ...

This is really the key. I rant at people seeming to believe that csv is
THE data interchange format, and it's about as bad as it gets at that,
if you have a choice. xml is noisy but at least (potentially)
self-documenting, and ought to be able to recover from certain errors.
The problem with csv is that a substantial chunk of the world seems to
live inside Excel, and so data is commonly both generated in csv so it
can be imported into excel and generated in csv as a result of exporting
from excel, so the parts often are *not* in your control.

Sigh.

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

rosuav at gmail

Sep 23, 2021, 6:27 AM

Post #20 of 75 (642 views)

On Thu, Sep 23, 2021 at 10:55 PM Mats Wichmann <mats@wichmann.us> wrote:
>
> On 9/22/21 10:31, Dennis Lee Bieber wrote:
>
> > If you control both the data generation and the data consumption,
> > finding some format ...
>
> This is really the key. I rant at people seeming to believe that csv is
> THE data interchange format, and it's about as bad as it gets at that,
> if you have a choice. xml is noisy but at least (potentially)
> self-documenting, and ought to be able to recover from certain errors.
> The problem with csv is that a substantial chunk of the world seems to
> live inside Excel, and so data is commonly both generated in csv so it
> can be imported into excel and generated in csv as a result of exporting
> from excel, so the parts often are *not* in your control.
>
> Sigh.

The only people who think that CSV is *the* format are people who
habitually live in spreadsheets. People who move data around the
internet, from program to program, are much more likely to assume that
JSON is the sole format. Of course, there is no single ultimate data
interchange format, but JSON is a lot closer to one than CSV is.

(Or to be more precise: any such thing as a "single ultimate data
interchange format" will be so generic that it isn't enough to define
everything. For instance, "a stream of bytes" is a universal data
interchange format, but that's not ultimately a very useful claim.)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

michael.stemper at gmail

Sep 23, 2021, 10:23 AM

Post #21 of 75 (642 views)

On 22/09/2021 17.37, Dennis Lee Bieber wrote:
> On Wed, 22 Sep 2021 09:52:59 -0500, "Michael F. Stemper"
> <michael.stemper@gmail.com> declaimed the following:
>> On 21/09/2021 19.30, Eli the Bearded wrote:
>>> In comp.lang.python, Michael F. Stemper <michael.stemper@gmail.com> wrote:

>>>> How does CSV handle hierarchical data? For instance, I have

>>>> Can CSV files model this sort of situation?
>>>
> <SNIP>
>>> Yes, CSV files can model that. But it would not be my first choice of
>>> data format. (Neither would JSON.) I'd probably use XML.
>>
>> Okay. 'Go not to the elves for counsel, for they will say both no
>> and yes.' (I'm not actually surprised to find differences of opinion.)
>>
> You'd have to include a "level" (and/or data type if multiple objects
> can be at the same level) field (as the first field) in CSV which
> identifies how to parse the rest of the CSV data (well, technically, the
> CSV module has "parsed" it -- in terms of splitting at commas, handling
> quoted strings (which may contain commas which are not split points, etc.).
>
> 1-generator, name
> 2-fuel, name, UOM, heat-content, price
> 2-curve, name
> 3-point, X, Y
> 3-point, X, Y
> ...
> 2-curve, name
> 3-point, X, Y
> 3-point, X, Y

This reminds me of how my (former) employer imported data models into
our systems from the 1970s until the mid-2000s. We had 80-column records
(called "card images"), that would have looked like:

FUEL0 LIGNITE TON 13.610 043.581
UNIT1 COAL CREK1
UNIT2 ...

The specific columns for the start and end of each field on each record
were defined in a thousand-plus page document. (We modeled all of a
power system, not just economic data about generators.)

However, this doesn't seem like it would fit too well with the csv
module, since it requires a lot more logic on the part of the consuming
program.

Interesting flashback, though.

--
Michael F. Stemper
Deuteronomy 24:17
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Sep 23, 2021, 10:51 AM

Post #22 of 75 (642 views)

In comp.lang.python, Christian Gollwitzer <auriocus@gmx.de> wrote:
> Am 22.09.21 um 16:52 schrieb Michael F. Stemper:
>> On 21/09/2021 19.30, Eli the Bearded wrote:
>>> Yes, CSV files can model that. But it would not be my first choice of
>>> data format. (Neither would JSON.) I'd probably use XML.
>> Okay. 'Go not to the elves for counsel, for they will say both no
>> and yes.' (I'm not actually surprised to find differences of opinion.)

Well, I have a recommendation with my answer.

> It's the same as saying "CSV supports images". Of course it doesn't, its
> a textfile, but you could encode a JPEG as base64 and then put this
> string into the cell of a CSV table. That definitely isn't what a sane
> person would understand as "support".

I'd use one of the netpbm formats instead of JPEG. PBM for one bit
bitmaps, PGM for one channel (typically grayscale), PPM for three
channel RGB, and PAM for anything else (two channel gray plus alpha,
CMYK, RGBA, HSV, YCbCr, and more exotic formats). JPEG is tricky to
map to CSV since it is a three channel format (YCbCr), where the
channels are typically not at the same resolution. Usually Y is full
size and the Cb and Cr channels are one quarter size ("4:2:0 chroma
subsampling"). The unequal size of the channels does not lend itself
to CSV, but I can't say it's impossible.

But maybe you meant the whole JFIF or Exif JPEG file format base64
encoded with no attempt to understand the image. That sort of thing
is common in JSON, and I've seen it in YAML, too. It wouldn't surprise
me if people do that in CSV or XML, but I have so far avoided seeing
that. I used that method for sticking a tiny PNG in a CSS file just
earlier this month. The whole PNG was smaller than the typical headers
of an HTTP/1.1 request and response, so I figured "don't make it a
separate file".

Elijah
------
can at this point recegnize a bunch of "magic numbers" in base64

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

michael.stemper at gmail

Sep 23, 2021, 1:06 PM

Post #23 of 75 (642 views)

On 23/09/2021 12.51, Eli the Bearded wrote:
>> Am 22.09.21 um 16:52 schrieb Michael F. Stemper:
>>> On 21/09/2021 19.30, Eli the Bearded wrote:
>>>> Yes, CSV files can model that. But it would not be my first choice of
>>>> data format. (Neither would JSON.) I'd probably use XML.
>>> Okay. 'Go not to the elves for counsel, for they will say both no
>>> and yes.' (I'm not actually surprised to find differences of opinion.)
>
> Well, I have a recommendation with my answer.

Sorry, didn't mean that to be disparaging.

--
Michael F. Stemper
This post contains greater than 95% post-consumer bytes by weight.
--
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful [ In reply to ]

python-list at python

Sep 23, 2021, 2:26 PM

Post #24 of 75 (642 views)

Can we agree that there are way more general ways to store data than
anything currently in common use and that in some ways, CSV and cousins like
TSV are a subset of the others in a sense? There are trees and arbitrary
graphs and many complex data structures often encountered while a program is
running as in-memory objects. Many are not trivial to store.

But some are if all you see is table-like constructs including matrices and
data.frames.

I mean any rectangular data format with umpteen rows and N columns can
trivially be stored in many other formats and especially when it allows some
columns to have NA values. The other format would simply have major
categories that contain components with one per column, and if missing,
represents an NA. Is there any reason JSON or XML cannot include the
contents of any CSV with headers and without loss of info?

Going the other way is harder. Note that a data.frame type of structure
often imposes restrictions on a CSV and requires everything in a column to
be of the same type, or coercible to a common type. (well, not always true
as in using list columns in R.) But given some arbitrary structure in XML,
can you look at all possible labels and if it is not too complex, make a CSV
with one or more columns for every possible need? It can be a problem if say
a record for an Author allows multiple actual co-authors. Normal books may
let you get by with multiple columns (mostly containing an NA) with names
like author1, author2, author3, ...

But scientific papers seemingly allow oodles of authors and any time you
update the data, you may need yet another column. And, of course, processing
data where many columns have the same meaning is a bit of a pain. Data
structures can also often be nested multiple levels and at some point, CSV
is not a reasonable fit unless you play database games and make multiple
tables you can store and retrieve to make complex queries, as in many
relational database systems. Yes, each such table can be a CSV.

But if you give someone a hammer, they tend to stop using thumbtacks or
other tools. The real question is what kind of data makes good sense for an
application. If a nice rectangular format works, great. Even if not, the
Author problem above can fairly easily be handled by making the author
column something like a character string you compose as "Last1, First1;
Last2, First2; Last3, First3" and that fits fine in a CSV but can be taken
apart in your software if looking for any book by a particular author. Not
optimal, but a workaround I am sure is used.

But using the most abstract and complex storage method is very often
overkill and unless you are very good at it, may well be a fairly slow and
even error-prone way to solve a problem.

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net@python.org> On
Behalf Of Chris Angelico
Sent: Thursday, September 23, 2021 9:27 AM
To: Python <python-list@python.org>
Subject: Re: XML Considered Harmful

On Thu, Sep 23, 2021 at 10:55 PM Mats Wichmann <mats@wichmann.us> wrote:
>
> On 9/22/21 10:31, Dennis Lee Bieber wrote:
>
> > If you control both the data generation and the data
> > consumption, finding some format ...
>
> This is really the key. I rant at people seeming to believe that csv
> is THE data interchange format, and it's about as bad as it gets at
> that, if you have a choice. xml is noisy but at least (potentially)
> self-documenting, and ought to be able to recover from certain errors.
> The problem with csv is that a substantial chunk of the world seems to
> live inside Excel, and so data is commonly both generated in csv so it
> can be imported into excel and generated in csv as a result of
> exporting from excel, so the parts often are *not* in your control.
>
> Sigh.

The only people who think that CSV is *the* format are people who habitually
live in spreadsheets. People who move data around the internet, from program
to program, are much more likely to assume that JSON is the sole format. Of
course, there is no single ultimate data interchange format, but JSON is a
lot closer to one than CSV is.

(Or to be more precise: any such thing as a "single ultimate data
interchange format" will be so generic that it isn't enough to define
everything. For instance, "a stream of bytes" is a universal data
interchange format, but that's not ultimately a very useful claim.)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

RE: XML Considered Harmful [ In reply to ]

python-list at python

Sep 23, 2021, 2:59 PM

Post #25 of 75 (642 views)

What you are describing Stephen, is what I meant by emulating a relational database with tables.

And, FYI, There is no guarantee that two authors with the same name will not be assumed to be the same person.

Besides the lack of any one official CSV format, there are oodles of features I have seen that are normally external to the CSV. For example, I have often read in data from a CSV or similar, where you could tell the software to consider a blank or 999 to mean NA and what denotes a line in the file to be ignored as a comment and whether a separator is a space or any combination of whitespace and what quotes something so say you can hide a comma and how to handle escapes and whether to skip blank lines and more.

Now a really good design might place some metadata into the file that can be used to set defaults for things like that or incorporate them into the format unambiguously. It might calculate the likely data type for various fields and store that in the metadata. So even if you stored rectangular data in a CSV file, perhaps the early lines would be in some format that can be read as comments and supply some info like the above.

Are any of the CSV variants more like that?

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net@python.org> On Behalf Of Stefan Ram
Sent: Thursday, September 23, 2021 5:43 PM
To: python-list@python.org
Subject: Re: XML Considered Harmful

"Avi Gross" <avigross@verizon.net> writes:
>But scientific papers seemingly allow oodles of authors and any time
>you update the data, you may need yet another column.

You can use three CSV files: papers, persons, and authors:

papers.csv

1, "Is the accelerated expansion evidence of a change of signature?"

persons.csv

1, Marc Mars

authors.csv

1, 1

I.e., paper 1 is authored by person 1.

Now, when we learn that José M. M. Senovilla also is a
co-author of "Is the accelerated expansion evidence of a
forthcoming change of signature?", we do only have to add
new rows, no new colums.

papers.csv

1, "Is the accelerated expansion evidence of a change of signature?"

persons.csv

1, "Marc Mars"
2, "José M. M. Senovilla"

authors.csv

1, 1
1, 2

The real problem with CSV is that there is no CSV.

This is not a specific data language with a specific
specification. Instead it is a vague designation for
a plethora of CSV dialects, which usually dot not even
have a specification. Compare this with XML. XML has
a sole specification managed by the W3C.

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list