Mailing List Archive: XML Considered Harmful

Re: XML Considered Harmful [ In reply to ]

Sep 23, 2021, 3:02 PM

Post #26 of 75 (953 views)

On Fri, Sep 24, 2021 at 7:11 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, Christian Gollwitzer <auriocus@gmx.de> wrote:
> > Am 22.09.21 um 16:52 schrieb Michael F. Stemper:
> >> On 21/09/2021 19.30, Eli the Bearded wrote:
> >>> Yes, CSV files can model that. But it would not be my first choice of
> >>> data format. (Neither would JSON.) I'd probably use XML.
> >> Okay. 'Go not to the elves for counsel, for they will say both no
> >> and yes.' (I'm not actually surprised to find differences of opinion.)
>
> Well, I have a recommendation with my answer.
>
> > It's the same as saying "CSV supports images". Of course it doesn't, its
> > a textfile, but you could encode a JPEG as base64 and then put this
> > string into the cell of a CSV table. That definitely isn't what a sane
> > person would understand as "support".
>
> I'd use one of the netpbm formats instead of JPEG. PBM for one bit
> bitmaps, PGM for one channel (typically grayscale), PPM for three
> channel RGB, and PAM for anything else (two channel gray plus alpha,
> CMYK, RGBA, HSV, YCbCr, and more exotic formats). JPEG is tricky to
> map to CSV since it is a three channel format (YCbCr), where the
> channels are typically not at the same resolution. Usually Y is full
> size and the Cb and Cr channels are one quarter size ("4:2:0 chroma
> subsampling"). The unequal size of the channels does not lend itself
> to CSV, but I can't say it's impossible.
>

Examine prior art, and I truly do mean art, from Matt Parker:

https://www.youtube.com/watch?v=UBX2QQHlQ_I

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 23, 2021, 3:55 PM

Post #27 of 75 (953 views)

Permalink

On 2021-09-23, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> The real problem with CSV is that there is no CSV.
>
> This is not a specific data language with a specific
> specification. Instead it is a vague designation for
> a plethora of CSV dialects, which usually dot not even
> have a specification.

Indeed. For example, at least at some points in its history,
Excel has been unable to import CSV written by itself, because
its importer was incompatible with its own exporter.

> Compare this with XML. XML has a sole specification managed
> by the W3C.

Other well-defined formats are also available ;-)
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 23, 2021, 6:02 PM

Post #28 of 75 (953 views)

Permalink

On 22/09/2021 07.22, Michael F. Stemper wrote:
> On 21/09/2021 13.49, alister wrote:
>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>>
>>> On the prolog thread, somebody posted a link to:
>>> <https://dirtsimple.org/2004/12/python-is-not-java.html>

Given the source, shouldn't one take any criticism of Python (or Java)
with at least the proverbial grain of salt!

>>> One thing that it tangentially says is "XML is not the answer."

"tangential" as in 'spinning off'?

...

> It's my own research, so I can give myself the data in any format that I
> like.
...
With that, why not code it as Python expressions, and include the module?
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

rosuav at gmail

Sep 23, 2021, 8:11 PM

Post #29 of 75 (953 views)

Permalink

On Fri, Sep 24, 2021 at 12:22 PM Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> dn <PythonList@DancesWithMice.info> writes:
> >With that, why not code it as Python expressions, and include the module?
>
> This might create a code execution vulnerability if such
> files are exchanged between multiple parties.
>
> If code execution vulnerabilities and human-readability are
> not an issue, then one could also think about using pickle.
>
> If one ignores security concerns for a moment, serialization into
> a text format and subsequent deserialization can be a easy as:
>
> |>>> eval( str( [1, (2, 3)] ))
> |[1, (2, 3)]
>

One good hybrid is to take a subset of Python syntax (so it still
looks like a Python script for syntax highlighting etc), and then
parse that yourself, using the ast module. For instance, you can strip
out comments, then look for "VARNAME = ...", and parse the value using
ast.literal_eval(), which will give you a fairly flexible file format
that's still quite safe.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

drsalists at gmail

Sep 23, 2021, 8:44 PM

Post #30 of 75 (953 views)

Permalink

On Thu, Sep 23, 2021 at 8:12 PM Chris Angelico <rosuav@gmail.com> wrote:

> One good hybrid is to take a subset of Python syntax (so it still
> looks like a Python script for syntax highlighting etc), and then
> parse that yourself, using the ast module. For instance, you can strip
> out comments, then look for "VARNAME = ...", and parse the value using
> ast.literal_eval(), which will give you a fairly flexible file format
> that's still quite safe.
>

Restricting Python with the ast module is interesting, but I don't think
I'd want to bet my career on the actual safety of such a thing. Given that
Java bytecode was a frequent problem inside web browsers, imagine all the
messiness that could accidentally happen with a subset of Python syntax
from untrusted sources.

ast.literal_eval might be a little better - or a list of such, actually.

Better still to use JSON or ini format - IOW something designed for the
purpose.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

rosuav at gmail

Sep 23, 2021, 8:49 PM

Post #31 of 75 (953 views)

Permalink

On Fri, Sep 24, 2021 at 1:44 PM Dan Stromberg <drsalists@gmail.com> wrote:
>
>
> On Thu, Sep 23, 2021 at 8:12 PM Chris Angelico <rosuav@gmail.com> wrote:
>>
>> One good hybrid is to take a subset of Python syntax (so it still
>> looks like a Python script for syntax highlighting etc), and then
>> parse that yourself, using the ast module. For instance, you can strip
>> out comments, then look for "VARNAME = ...", and parse the value using
>> ast.literal_eval(), which will give you a fairly flexible file format
>> that's still quite safe.
>
>
> Restricting Python with the ast module is interesting, but I don't think I'd want to bet my career on the actual safety of such a thing. Given that Java bytecode was a frequent problem inside web browsers, imagine all the messiness that could accidentally happen with a subset of Python syntax from untrusted sources.
>
> ast.literal_eval might be a little better - or a list of such, actually.

Uhh, I specifically mention literal_eval in there :) Simple text
parsing followed by literal_eval for the bulk of it is a level of
safety that I *would* bet my career on.

> Better still to use JSON or ini format - IOW something designed for the purpose.

It all depends on how human-editable it needs to be. JSON has several
problems in that respect, including some rigidities, and a lack of
support for comments. INI format doesn't have enough data types for
many purposes. YAML might be closer, but it's not for every situation
either.

That's why we have options.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 23, 2021, 10:40 PM

Post #32 of 75 (953 views)

Permalink

On 24/09/2021 14.07, Stefan Ram wrote:
> dn <PythonList@DancesWithMice.info> writes:
>> With that, why not code it as Python expressions, and include the module?
>
> This might create a code execution vulnerability if such
> files are exchanged between multiple parties.

The OP's spec, as quoted earlier(!), reads:

"It's my own research, so I can give myself the data in any format that
I like."

Whither "files are exchanged" and/or "multiple parties"? Are these
anticipations of problems that may/won't ever apply? aka YAGNI.

Concern about such an approach *is* warranted.

However, the preceding question to be considered during the design-stage
is: 'does such concern apply?'. The OP describes full and unique agency.
Accordingly, "KISS"!

NB my personal choice would likely be JSON or YAML, but see reservations
(eg @Chris) - and with greater relevance: shouldn't we consider the OP's
'learning curve'?
(such deduced only from OP's subsequent reactions/responses 'here' -
with any and all due apologies)
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 23, 2021, 11:33 PM

Post #33 of 75 (953 views)

Permalink

I had to use XML once because that was demanded by the receiving machine over which I had no say.I wouldn't use it otherwise because staring at it makes you dizzy.I would want to know how the data are derived from the multiple sources and transmitted to the collating platform before pontificating.Then I would ignore any potential future enhancements and choose the easiest possible mechanism. I have used json with python and been delighted at the ease of converting data into dicts and even arbitrary nesting where data values can also be dicts etc.Good luck--(Unsigned mail from my phone)
-------- Original message --------From: dn via Python-list <python-list@python.org> Date: 24/9/21 15:42 (GMT+10:00) To: python-list@python.org Subject: Re: XML Considered Harmful On 24/09/2021 14.07, Stefan Ram wrote:> dn <PythonList@DancesWithMice.info> writes:>> With that, why not code it as Python expressions, and include the module?> > This might create a code execution vulnerability if such > files are exchanged between multiple parties.The OP's spec, as quoted earlier(!), reads:"It's my own research, so I can give myself the data in any format thatI like."Whither "files are exchanged" and/or "multiple parties"? Are theseanticipations of problems that may/won't ever apply? aka YAGNI.Concern about such an approach *is* warranted.However, the preceding question to be considered during the design-stageis: 'does such concern apply?'. The OP describes full and unique agency.Accordingly, "KISS"!NB my personal choice would likely be JSON or YAML, but see reservations(eg @Chris) - and with greater relevance: shouldn't we consider the OP's'learning curve'?(such deduced only from OP's subsequent reactions/responses 'here' -with any and all due apologies)-- Regards,=dn-- https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

bursejan at gmail

Sep 24, 2021, 6:46 AM

Post #34 of 75 (953 views)

Permalink

BTW: I think its problematic to associate Java with XML.

Michael F. Stemper schrieb am Dienstag, 21. September 2021 um 20:12:33 UTC+2:
> On the prolog thread, somebody posted a link to:
> <https://dirtsimple.org/2004/12/python-is-not-java.html>

The above linke is very old, from 2004, and might apply
how Java presented itself back in those days. But since
the Jigsaw project, XML has practically left Java.

Its all not anymore part of the javax.* or java.* namespace,
Oracle got rid of XML technologies housing in these
namespaces, and there is now the jakarta.* namespace.

Example JAXB:
Jakarta XML Binding (JAXB; formerly Java Architecture for XML Binding)
https://de.wikipedia.org/wiki/Jakarta_XML_Binding

If I remember well, also XML never went into the Java
Language Specification, unlike the Scala programming
language, where you can have XML literals:

XML literals in scala
https://tuttlem.github.io/2015/02/24/xml-literals-in-scala.html

An easy protection against tampered XML data vulnerabilities
is DTD or some other XML schema language. It can at least catch
problems that are in the scope of the schema language.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

bursejan at gmail

Sep 24, 2021, 7:55 AM

Post #35 of 75 (953 views)

Permalink

Or then use cryptographic methods to protect your XML
file when in transit. Like encryption and/or signatures.

Mostowski Collapse schrieb am Freitag, 24. September 2021 um 15:46:27 UTC+2:
> BTW: I think its problematic to associate Java with XML.
> Michael F. Stemper schrieb am Dienstag, 21. September 2021 um 20:12:33 UTC+2:
> > On the prolog thread, somebody posted a link to:
> > <https://dirtsimple.org/2004/12/python-is-not-java.html>
> The above linke is very old, from 2004, and might apply
> how Java presented itself back in those days. But since
> the Jigsaw project, XML has practically left Java.
>
> Its all not anymore part of the javax.* or java.* namespace,
> Oracle got rid of XML technologies housing in these
> namespaces, and there is now the jakarta.* namespace.
>
> Example JAXB:
> Jakarta XML Binding (JAXB; formerly Java Architecture for XML Binding)
> https://de.wikipedia.org/wiki/Jakarta_XML_Binding
>
> If I remember well, also XML never went into the Java
> Language Specification, unlike the Scala programming
> language, where you can have XML literals:
>
> XML literals in scala
> https://tuttlem.github.io/2015/02/24/xml-literals-in-scala.html
>
> An easy protection against tampered XML data vulnerabilities
> is DTD or some other XML schema language. It can at least catch
> problems that are in the scope of the schema language.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

hjp-python at hjp

Sep 24, 2021, 11:29 AM

Post #36 of 75 (953 views)

Permalink

On 2021-09-21 19:46:19 -0700, Dan Stromberg wrote:
> On Tue, Sep 21, 2021 at 7:26 PM Michael F. Stemper <
> michael.stemper@gmail.com> wrote:
> > If XML is not the way to package data, what is the recommended
> > approach?
> >
>
> I prefer both JSON and YAML over XML.
>
> XML has both elements and tags, but it didn't really need both.

I think you meant "both elements and attributes". Tags are how you
denote elements, so they naturally go together.

I agree that for representing data (especially object-oriented data) the
distiction between (sub-)elements and attributes seems moot (should
represent that field as an attribute or a field?), but don't forget that
XML was intended to replace SGML, and that SGML was intended to mark up
text, not represent any data.

Would you really want to write

<p>Mr. <party role="defendant">Smith</person>s point was corroborated by
Ms. <witness>Jones</witness> point that <quote>bla, bla</quote>, which
seemed more plausibe than Mr. <party role="plaintiff">Willam</party>
claim that <quote>blub, blub</quote>.

as

<p>Mr. <party role><defendant/>Smith</person>s point was corroborated by
Ms. <witness>Jones</witness> point that <quote>bla, bla</quote>, which
seemed more plausibe than Mr. <party role><plaintiff/>Willam</party>
claim that <quote>blub, blub</quote>.

or

<p>Mr. <party role><defendant>Smith<(defendant></person>s point was
corroborated by Ms. <witness>Jones</witness> point that <quote>bla,
bla</quote>, which seemed more plausibe than Mr. <party>
<plaintiff/>Willam</plaintiff></party> claim that <quote>blub,
blub</quote>.

?

I probably chose an example (no doubt influenced by the fact that SGML
was originally invented to digitize court decisions) which is too simple
(in HTML I often see many attributes on a single element, even with
CSS), but even here you can see that attributes add clarity.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: XML Considered Harmful [ In reply to ]

hjp-python at hjp

Sep 24, 2021, 11:34 AM

Post #37 of 75 (953 views)

Permalink

On 2021-09-23 06:53:10 -0600, Mats Wichmann wrote:
> The problem with csv is that a substantial chunk of the world seems to
> live inside Excel,

This is made sp much worse by Excel being exceptionally bad at reading
CSV.

Several hundred genes were recently renamed because Excel was unable to
read their names as simply strings and insisted on interpreting them as
something else (e.g. dates).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: XML Considered Harmful [ In reply to ]

hjp-python at hjp

Sep 24, 2021, 11:59 AM

Post #38 of 75 (953 views)

Permalink

On 2021-09-21 13:12:10 -0500, Michael F. Stemper wrote:
> I read this page right when I was about to write an XML parser
> to get data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.
>
> Does the advice on that page mean that I should find some other
> way to get data into my programs, or does it refer to some kind
> of misuse/abuse of XML for something that it wasn't designed
> for?
>
> If XML is not the way to package data, what is the recommended
> approach?

There are a gazillion formats and depending on your needs one of them
might be perfect. Or you may have to define you own bespoke format (I
mean, nobody (except Matt Parker) tries to represent images or videos as
CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
that).

Of the three formats discussed here my take is:

CSV: Good for tabular data of a single data type (strings). As soon as
there's a second data type (numbers, dates, ...) you leave standard
territory and are into "private agreements".

JSON: Has a few primitive data types (bool, number, string) and a two
compound types (list, dict(string -> any)). Still missing many
frequently used data types (e.g. dates) and has no standard way to
denote composite types. But its simple and if it's sufficient for your
needs, use it.

XML: Originally invented for text markup, and that shows. Can represent
different types (via tags), can define those types (via DTD and/or
schemas), can identify schemas in a globally-unique way and you can mix
them all in a single document (and there are tools available to validate
your files). But those features make it very complex (you almost
certainly don't want to write your own parser) and you really have to
understand the data model (especiall namespaces) to use it.

You can of course represent any data in any format if you jump through
enough hoops, but the real question is "does the data I have fit
naturally within the data model of the format I'm trying to use". If it
doesn't, look for something else. For me, CSV, JSON and XML form a
hierarchy where each can naturally represent all the data of its
predecessors, but not vice versa.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 24, 2021, 3:51 PM

Post #39 of 75 (953 views)

Permalink

On 25/09/2021 06.59, Peter J. Holzer wrote:
> There are a gazillion formats and depending on your needs one of them
> might be perfect. Or you may have to define you own bespoke format (I
> mean, nobody (except Matt Parker) tries to represent images or videos as
> CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
> that).
>
> Of the three formats discussed here my take is:
>
> CSV: Good for tabular data of a single data type (strings). As soon as
> there's a second data type (numbers, dates, ...) you leave standard
> territory and are into "private agreements".
>
> JSON: Has a few primitive data types (bool, number, string) and a two
> compound types (list, dict(string -> any)). Still missing many
> frequently used data types (e.g. dates) and has no standard way to
> denote composite types. But its simple and if it's sufficient for your
> needs, use it.
>
> XML: Originally invented for text markup, and that shows. Can represent
> different types (via tags), can define those types (via DTD and/or
> schemas), can identify schemas in a globally-unique way and you can mix
> them all in a single document (and there are tools available to validate
> your files). But those features make it very complex (you almost
> certainly don't want to write your own parser) and you really have to
> understand the data model (especiall namespaces) to use it.

and YAML?
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

rosuav at gmail

Sep 24, 2021, 4:00 PM

Post #40 of 75 (953 views)

Permalink

On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
<python-list@python.org> wrote:
>
> On 25/09/2021 06.59, Peter J. Holzer wrote:
> > There are a gazillion formats and depending on your needs one of them
> > might be perfect. Or you may have to define you own bespoke format (I
> > mean, nobody (except Matt Parker) tries to represent images or videos as
> > CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
> > that).
> >
> > Of the three formats discussed here my take is:
> >
> > CSV: Good for tabular data of a single data type (strings). As soon as
> > there's a second data type (numbers, dates, ...) you leave standard
> > territory and are into "private agreements".
> >
> > JSON: Has a few primitive data types (bool, number, string) and a two
> > compound types (list, dict(string -> any)). Still missing many
> > frequently used data types (e.g. dates) and has no standard way to
> > denote composite types. But its simple and if it's sufficient for your
> > needs, use it.
> >
> > XML: Originally invented for text markup, and that shows. Can represent
> > different types (via tags), can define those types (via DTD and/or
> > schemas), can identify schemas in a globally-unique way and you can mix
> > them all in a single document (and there are tools available to validate
> > your files). But those features make it very complex (you almost
> > certainly don't want to write your own parser) and you really have to
> > understand the data model (especiall namespaces) to use it.
>
> and YAML?

Invented because there weren't enough markup languages, so we needed another?

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 24, 2021, 4:26 PM

Post #41 of 75 (953 views)

Permalink

On 25/09/2021 11.00, Chris Angelico wrote:

> Invented because there weren't enough markup languages, so we needed another?

Anything You Can Do I Can Do Better
https://www.youtube.com/watch?v=_UB1YAsPD6U

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 24, 2021, 4:32 PM

Post #42 of 75 (953 views)

Permalink

On 2021-09-24, Chris Angelico <rosuav@gmail.com> wrote:
> On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
><python-list@python.org> wrote:
>> On 25/09/2021 06.59, Peter J. Holzer wrote:
>> > CSV: Good for tabular data of a single data type (strings). As soon as
>> > there's a second data type (numbers, dates, ...) you leave standard
>> > territory and are into "private agreements".

CSV is not good for strings, as there is no one specification of how to
encode things like newlines and commas within the strings, so you may
find that your CSV data transfer fails or even silently corrupts data.

>> > JSON: Has a few primitive data types (bool, number, string) and a two
>> > compound types (list, dict(string -> any)). Still missing many
>> > frequently used data types (e.g. dates) and has no standard way to
>> > denote composite types. But its simple and if it's sufficient for your
>> > needs, use it.

JSON Schema provides a way to denote composite types.

>> > XML: Originally invented for text markup, and that shows. Can represent
>> > different types (via tags), can define those types (via DTD and/or
>> > schemas), can identify schemas in a globally-unique way and you can mix
>> > them all in a single document (and there are tools available to validate
>> > your files). But those features make it very complex (you almost
>> > certainly don't want to write your own parser) and you really have to
>> > understand the data model (especiall namespaces) to use it.
>>
>> and YAML?
>
> Invented because there weren't enough markup languages, so we needed
> another?

Invented as a drunken bet that got out of hand, and used by people who
don't realise this.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

greg.ewing at canterbury

Sep 24, 2021, 4:46 PM

Post #43 of 75 (953 views)

Permalink

On 25/09/21 6:29 am, Peter J. Holzer wrote:
> don't forget that
> XML was intended to replace SGML, and that SGML was intended to mark up
> text, not represent any data.

And for me this is the number one reason why XML is the wrong
tool for almost everything it's used for nowadays.

It's bizarre. It's as though there were a large community of
professional builders who insisted on using hammers to drive
scews, and extolled the advantages of doing so.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

greg.ewing at canterbury

Sep 24, 2021, 5:01 PM

Post #44 of 75 (953 views)

Permalink

On 25/09/21 6:34 am, Peter J. Holzer wrote:
> Several hundred genes were recently renamed because Excel was unable to
> read their names as simply strings and insisted on interpreting them as
> something else (e.g. dates).

Another fun one I've come across is interpreting phone numbers
as floating point and writing them out again with exponents...

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

greg.ewing at canterbury

Sep 24, 2021, 5:14 PM

Post #45 of 75 (953 views)

Permalink

On 25/09/21 10:51 am, dn wrote:
>> XML: Originally invented for text markup, and that shows. Can represent
>> different types (via tags), can define those types (via DTD and/or
>> schemas), can identify schemas in a globally-unique way and you can mix
>> them all in a single document (and there are tools available to validate
>> your files). But those features make it very complex

And for all that complexity, it still doesn't map very well
onto the kinds of data structures used inside programs (lists,
structs, etc.), so you end up having to build those structures
on top of it, and everyone does that in a different way.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

greg.ewing at canterbury

Sep 24, 2021, 5:16 PM

Post #46 of 75 (953 views)

Permalink

On 25/09/21 11:00 am, Chris Angelico wrote:
> On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
> <python-list@python.org> wrote:
>>
>> and YAML?
>
> Invented because there weren't enough markup languages, so we needed another?

There were *too many* markup languages, so we invented another!

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

hjp-python at hjp

Sep 25, 2021, 3:46 AM

Post #47 of 75 (953 views)

Permalink

On 2021-09-24 23:32:47 -0000, Jon Ribbens via Python-list wrote:
> On 2021-09-24, Chris Angelico <rosuav@gmail.com> wrote:
> > On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
> ><python-list@python.org> wrote:
> >> On 25/09/2021 06.59, Peter J. Holzer wrote:
> >> > CSV: Good for tabular data of a single data type (strings). As soon as
> >> > there's a second data type (numbers, dates, ...) you leave standard
> >> > territory and are into "private agreements".
>
> CSV is not good for strings, as there is no one specification of how to
> encode things like newlines and commas within the strings, so you may
> find that your CSV data transfer fails or even silently corrupts data.

Those two cases are actually pretty straightforward: Just enclose the
field in quotes.

Handling quotes is less standardized. I think doubling quotes is much more
common than an escape character, but I've certainly seen both.

But if you get down to it, the problems with CSV start at a much lower
level:

1) The encoding is not defined. These days UTF-8 (with our without BOM)
is pretty common, but I still regularly get files in Windows-1252
encoding and occasionally something else.

2) The record separator isn't defined. CRLF is most common, followed by
LF. But just recently I got a file with CR (Does Eurostat still use
some Macs with MacOS 9?)

3) The field separator isn't defined. Officially the format is known as
"comma separated values", but in my neck of the woods it's actually
semicolon-separated in the vast majority of cases.

So even for the most simple files there are three parameters the sender
and the receiver have to agree on.

> >> > JSON: Has a few primitive data types (bool, number, string) and a two
> >> > compound types (list, dict(string -> any)). Still missing many
> >> > frequently used data types (e.g. dates) and has no standard way to
> >> > denote composite types. But its simple and if it's sufficient for your
> >> > needs, use it.
>
> JSON Schema provides a way to denote composite types.

I probably wasn't clear what I meant. In XML, every element has a tag,
which is basically its type. So by looking at an XML file (without
reference to a schema) you can tell what each element is. And a
validator can say something like "expected a 'product' or 'service'
element here but found a 'person'".

In JSON everything is just an object or a list. You may guess that an
object with a field "product_id" is a product, but is one with "name":
"Billy" a person or a piece of furniture?

I'm not familiar with JSON schema (I know that it exists and I've read a
tutorial or two but I've never used it in a real project), but as far as
I know it doesn't change that. It describes the structure of a JSON
document but it doesn't add type information to that document. So a
validator can at best guess what the malformed thing it just found was
supposed to be.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: XML Considered Harmful [ In reply to ]

python-list at python

Sep 25, 2021, 4:49 AM

Post #48 of 75 (952 views)

Permalink

On 2021-09-25, Peter J. Holzer <hjp-python@hjp.at> wrote:
> On 2021-09-24 23:32:47 -0000, Jon Ribbens via Python-list wrote:
>> JSON Schema provides a way to denote composite types.
>
> I probably wasn't clear what I meant. In XML, every element has a tag,
> which is basically its type. So by looking at an XML file (without
> reference to a schema) you can tell what each element is. And a
> validator can say something like "expected a 'product' or 'service'
> element here but found a 'person'".
>
> In JSON everything is just an object or a list. You may guess that an
> object with a field "product_id" is a product, but is one with "name":
> "Billy" a person or a piece of furniture?
>
> I'm not familiar with JSON schema (I know that it exists and I've read a
> tutorial or two but I've never used it in a real project), but as far as
> I know it doesn't change that. It describes the structure of a JSON
> document but it doesn't add type information to that document. So a
> validator can at best guess what the malformed thing it just found was
> supposed to be.

JSON Schema absolutely does change that. You can create named types
and specify where they may appear in the document. With a well-defined
schema you do not need to make any guesses about what type something is.
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

Karsten.Hilbert at gmx

Sep 25, 2021, 6:24 AM

Post #49 of 75 (953 views)

Permalink

Am Fri, Sep 24, 2021 at 08:59:23PM +0200 schrieb Peter J. Holzer:

> JSON: Has a few primitive data types (bool, number, string) and a two
> compound types (list, dict(string -> any)). Still missing many
> frequently used data types (e.g. dates)

But that (dates) at least has a well-known mapping to string,
which makes it usable within JSON.

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
--
https://mail.python.org/mailman/listinfo/python-list

Re: XML Considered Harmful [ In reply to ]

michael.stemper at gmail

Sep 25, 2021, 1:20 PM

Post #50 of 75 (952 views)

Permalink

On 21/09/2021 13.12, Michael F. Stemper wrote:

> If XML is not the way to package data, what is the recommended
> approach?

Well, there have been a lot of ideas put forth on this thread,
many more than I expected. I'd like to thank everyone who
took the time to contribute.

Most of the reasons given for avoiding XML appear to be along
the lines of "XML has all of these different options that it
supports."

However, it seems that I could ignore 99% of those things and
just use a teeny subset of its capabilities. For instance, if
I modeled a fuel like this:

<Fuel name="Montana Sub-Bituminous">
<uom>ton</uom>
<price>21.96</price>
<heat_content>18.2</heat_content>
</Fuel>

and a generating unit like this:

<Generator name="Skunk Creek 1">
<IHRcurve name="normal">
<point P="63" IHR="8.513"/>
<point P="105" IHR="8.907"/>
<point P="241" IHR="9.411"/>
<point P="455" IHR="10.202"/>
</IHRcurve>
<IHRcurve name="constrained">
<point P="63" IHR="8.514"/>
<point P="103" IHR="9.022"/>
<point P="223" IHR="9.511"/>
<point P="415" IHR="10.102"/>
</IHRcurve>
</Generator>

why would the fact that I could have chosen, instead, to model
the unit of measure as an attribute of the fuel, or its name
as a sub-element matter? Once the modeling decision has been
made, all of the decisions that might have been would seem to
be irrelevant.

Some years back, IEC's TC57 came up with CIM[1]. This nailed down
a lot of decisions. The fact that other decisions could have been
made doesn't seem to keep utilities from going forward with it as
an enterprise-wide data model.

My current interests are not anywhere so expansive, but it seems
that the situations are at least similar:
1. Look at an endless range of options for a data model.
2. Pick one.
3. Run with it.

To clearly state my (revised) question:

Why does the existence of XML's many options cause a problem
for my use case?

Other reactions:

Somebody pointed out that some approaches would require that I
climb a learning curve. That's appreciated, although learning
new things is always good.

NestedText looks cool, and a lot like YAML. Having not gotten
around to playing with YAML yet, I was surprised to learn that it
tries to guess data types. This sounds as if it could lead to the
same type of problems that led to the names of some genes being
turned into dates.

It was suggested that I use an RDBMS, such as sqlite3, for the
input data. I've used sqlite3 for real-time data exchange between
concurrently-running programs. However, I don't see syntax like:

sqlite> INSERT INTO Fuels
...> (name,uom,price,heat_content)
...> VALUES ("Montana Sub-Bituminous", "ton", 21.96, 13.65);

as being nearly as readable as the XML that I've sketched above.
Yeah, I could write a program to do this, but that doesn't really
change anything, since I'd still need to get the data into the
program.

(Changing a value would be even worse, requiring the dreaded
UPDATE INTO statement, instead of five seconds in vi.)

Many of the problems listed for CSV, which come from its lack of
standardization, seem similar to those given for XML. "Commas
or tabs?" "How are new-lines represented?" If I was to use CSV,
I'd be able to just pick answers. However, fitting hierarchical
data into rows/columns just seems wrong, so I doubt that I'll
end up going that way.

As far as disambiguating authors, I believe that most journals
are now expecting an ORCID[2] (which doesn't help with papers
published before that came around).

As far as use of XML to store program state, I wouldn't ever
consider that. As noted above, I've used an RDBMS to do so.
It handles all of the concurrency issues for me. The current use
case is specifically for raw, static input.

Fascinating to find out that XML was originally designed to
mark up text, especially legal text.

It was nice to be reminded of what Matt Parker looked like when
he had hair.

[1] <https://en.wikipedia.org/wiki/Common_Information_Model_(electricity)>
[2] <https://orcid.org/>
--
Michael F. Stemper
Psalm 82:3-4
--
https://mail.python.org/mailman/listinfo/python-list

Mailing List Archive

Attached Files:

Attached Files:

Attached Files:

Attached Files: