Mailing List Archive: Newline (NuBe Question)

Re: Newline (NuBe Question) [ In reply to ]

Nov 26, 2023, 7:01 PM

Post #26 of 29 (126 views)

On Mon, 27 Nov 2023 at 13:52, AVI GROSS via Python-list
<python-list@python.org> wrote:
> Be that as it
> may, and I have no interest in this topic, in the future I may use the ever
> popular names of Primus, Secundus and Tertius and get blamed for using
> Latin.
>

Imperious Prima flashes forth her edict to "begin it". In gentler tone
Secunda hopes there will be nonsense in it. While Tertia interrupts
the tale not more than once a minute.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

RE: Newline (NuBe Question) [ In reply to ]

python-list at python

Nov 26, 2023, 7:15 PM

Post #27 of 29 (126 views)

Permalink

Dave,

Back on a hopefully more serious note, I want to make a bit of an analogy
with what happens when you save data in a format like a .CSV file.

Often you have a choice of including a header line giving names to the
resulting columns, or not.

If you read in the data to some structure, often to some variation I would
loosely call a data.frame or perhaps something like a matrix, then without
headers you have to specify what you want positionally or create your own
names for columns to use. If names are already there, your program can
manipulate things by using the names and if they are well chosen, with no
studs among them, the resulting code can be quite readable. More
importantly, if the data being read changes and includes additional columns
or in a different order, your original program may run fine as long as the
names of the columns you care about remain the same.

Positional programs can be positioned to fail in quite subtle ways if the
positions no longer apply.

As I see it, many situations where some aspects are variable are not ideal
for naming. A dictionary is an example that is useful when you have no idea
how many items with unknown keys may be present. You can iterate over the
names that are there, or use techniques that detect and deal with keys from
your list that are not present. Not using names/keys here might involve a
longer list with lots of empty slots to designate missing items, This
clearly is not great when the data present is sparse or when the number of
items is not known in advance or cannot be maintained in the right order.

There are many other situations with assorted tradeoffs and to insist on
using lists/tuples exclusively would be silly but at the same time, if you
are using a list to hold the real and imaginary parts of a complex number,
or the X/Y[/Z] coordinates of a point where the order is almost universally
accepted, then maybe it is not worth using a data structure more complex or
derived as the use may be obvious.

I do recall odd methods sometimes used way back when I programmed in C/C++
or similar languages when some method was used to declare small constants
like:

#define FIRSTNAME 1
#define LASTNAME 2

Or concepts like "const GPA = 3"

And so on, so code asking for student_record[LASTNAME] would be a tad more
readable and if the order of entries somehow were different, just redefine
the constant.

In some sense, some of the data structures we are discussing, under the
hood, actually may do something very similar as they remap the name to a
small integer offset. Others may do much more or be slower but often add
value in other ways. A full-blown class may not just encapsulate the names
of components of an object but verify the validity of the contents or do
logging or any number of other things. Using a list or tuple does nothing
else.

So if you need nothing else, they are often suitable and sometimes even
preferable.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of DL Neil via Python-list
Sent: Sunday, November 26, 2023 5:19 PM
To: python-list@python.org
Subject: Re: Newline (NuBe Question)

On 11/27/2023 10:04 AM, Peter J. Holzer via Python-list wrote:
> On 2023-11-25 08:32:24 -0600, Michael F. Stemper via Python-list wrote:
>> On 24/11/2023 21.45, avi.e.gross@gmail.com wrote:
>>> Of course, for serious work, some might suggest avoiding constructs like
a
>>> list of lists and switch to using modules and data structures [...]
>>
>> Those who would recommend that approach do not appear to include Mr.
>> Rossum, who said:
>> Avoid overengineering data structures.
> ^^^^^^^^^^^^^^^
>
> The key point here is *over*engineering. Don't make things more
> complicated than they need to be. But also don't make them simpler than
> necessary.
>
>> Tuples are better than objects (try namedtuple too though).
>
> If Guido thought that tuples would always be better than objects, then
> Python wouldn't have objects. Why would he add such a complicated
> feature to the language if he thought it was useless?
>
> The (unspoken?) context here is "if tuples are sufficient, then ..."

At recent PUG-meetings I've listened to a colleague asking questions and
conducting research on Python data-structures*, eg lists-of-lists cf
lists-of-tuples, etc, etc. The "etc, etc" goes on for some time!
Respecting the effort, even as it becomes boringly-detailed, am
encouraging him to publish his findings.

* sadly, he is resistant to OOP and included only a cursory look at
custom-objects, and early in the process. His 'new thinking' has been to
look at in-core databases and the speed-ups SQL (or other) might offer...

However, his motivation came from a particular application, and to
create a naming-system so that he could distinguish a list-of-lists
structure from some other tabular abstraction. The latter enables the
code to change data-format to speed the next process, without the coder
losing-track of the data-type/format.

The trouble is, whereas the research reveals which is faster
(in-isolation, and (only) on his 'platform'), my suspicion is that he
loses all gains by reformatting the data between 'the most efficient'
structure for each step. A problem of only looking at the 'micro',
whilst ignoring wider/macro concerns.

Accordingly, as to the word "engineering" (above), a reminder that we
work in two domains: code and data. The short 'toy examples' in training
courses discourage us from a design-stage for the former - until we
enter 'the real world' and meet a problem/solution too large to fit in a
single human-brain. Sadly, too many of us are pre-disposed to be
math/algorithmically-oriented, and thus data-design is rarely-considered
(in the macro!). Yet, here we are...

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newline (NuBe Question) [ In reply to ]

python-list at python

Nov 26, 2023, 11:49 PM

Post #28 of 29 (126 views)

Permalink

Avi,

On 11/27/2023 4:15 PM, avi.e.gross@gmail.com wrote:
> Dave,
>
> Back on a hopefully more serious note, I want to make a bit of an analogy
> with what happens when you save data in a format like a .CSV file.
>
> Often you have a choice of including a header line giving names to the
> resulting columns, or not.
>
> If you read in the data to some structure, often to some variation I would
> loosely call a data.frame or perhaps something like a matrix, then without
> headers you have to specify what you want positionally or create your own
> names for columns to use. If names are already there, your program can
> manipulate things by using the names and if they are well chosen, with no
> studs among them, the resulting code can be quite readable. More
> importantly, if the data being read changes and includes additional columns
> or in a different order, your original program may run fine as long as the
> names of the columns you care about remain the same.
>
> Positional programs can be positioned to fail in quite subtle ways if the
> positions no longer apply.

Must admit to avoiding .csv files, if possible, and working directly
with the .xls? original (cf expecting the user to export the .csv - and
NOT change the worksheet thereafter).

However, have recently been using the .csv format (as described) as a
placeholder or introduction to formatting data for an RDBMS.

In a tabular structure, the expectation is that every field (column/row
intersection) will contain a value. In the RDBMS-world, if the value is
not-known then it will be recorded as NULL (equivalent of Python's None).

Accordingly, two points:
1 the special case of missing/unavailable data can be handled with ease,
2 most 'connector' interfaces will give the choice of retrieving data
into a tuple or a dictionary (where the keys are the column-names). The
latter easing data-identification issues (as described) both in terms of
improving over relational-positioning and name-continuity (or column
changes/expansions).

The point about data 'appearing' without headings should be considered
carefully. The phrase "create your own names for columns" only vaguely
accesses the problem. If someone else has created/provided the data,
then we need to know the exact design (schema = rules). What is the
characteristic of each component? Not only column-names, but also what
is the metric (eg the infamous confusion between feet and meters)...

> As I see it, many situations where some aspects are variable are not ideal
> for naming. A dictionary is an example that is useful when you have no idea
> how many items with unknown keys may be present. You can iterate over the
> names that are there, or use techniques that detect and deal with keys from
> your list that are not present. Not using names/keys here might involve a
> longer list with lots of empty slots to designate missing items, This
> clearly is not great when the data present is sparse or when the number of
> items is not known in advance or cannot be maintained in the right order.

Agreed, and this is the draw-back incurred by folk who wish to take
advantage of the schema-less (possibility) NoSQL DBs. The DB enjoys
flexibility, but the downstream-coder has to contort and flex to cope.

In this case, JSON files are an easy place-holder/intro for NoSQL DBs -
in fact, Python dicts and MongoDB go hand-in-glove.

The next issue raised is sparseness. In a table, the assumption is that
all fields, or at least most of them, will be filled with values.
However, a sparse matrix would make such very 'expensive' in terms of
storage-space (efficacy).

Accordingly, there are other ways of doing things. All of these involve
labeling each data-item (thus, the data expressed as a table needs to be
at least 50% empty to justify the structural change).

In this case, one might consider a tree-type of structure - and if we
have to continue the pattern, we might look at a Network Database
methodology (as distinct from a DB on a network!)

> There are many other situations with assorted tradeoffs and to insist on
> using lists/tuples exclusively would be silly but at the same time, if you
> are using a list to hold the real and imaginary parts of a complex number,
> or the X/Y[/Z] coordinates of a point where the order is almost universally
> accepted, then maybe it is not worth using a data structure more complex or
> derived as the use may be obvious.

No argument (in case anyone thought I might...)

See @Peter's earlier advice.

Much of the consideration (apart from mutable/immutable) is likely to be
ease of coding. Getting down 'into the weeds' is probably pointless
unless questions are being asked about (execution-time) performance...

Isn't the word "obvious" where this discussion started? Whereas "studs"
might be an "obvious" abbreviation for "students" to some, it is not to
others (quite aside from the abbreviation being unnecessary in this
day-and-age).

Curiously, whereas I DO happen to think a point as ( x, y, ) or ( x, y,
z, ) and thus quite happily interpret ( 1, 2, 3, ) as a location in 3D
space, I had a trainee bring a 'problem' on this exact assumption:-

He had two positions ( x1, y1, ) and ( x2, y2, ) and was computing the
vector between them ( x2 - x1, y2 - y1 ), accordingly:

def compute_distance( x1, x2, y1, y2, ):
# with return calculated as above

Trouble is, the function-call was:

result = compute_distance( x1, y1, x2, y2, )

In other words, the function's signature was consistent with the
calculation. Whereas, the function-call was consistent with the way the
data had 'arrived'. Oops!

As soon as a (data)class Point( x, y, ) was created, the function's
signature became:

def compute_distance( starting_point:Point, ending_point:Point, ):
# with amended return calculation

and the function-call became congruent, naturally.
(in fact, the function was moved into the dataclass to become a method
which simplified the signature and call(s) )

Thus, what was "obvious" to the same guy's brain when he was writing the
function, and what seemed "obvious" when the function was being used,
were materially (and catastrophically) different!

So, even though we (two) might think in terms of "universally", we
are/were wrong!

Thus, a DESIGNED data-type helps to avoid errors, and even when the
data-usage seems "obvious", offers advantage!

Once again, am tempted to suggest that the saving of:

point = ( 1, 2, )

over:

@dataclass
class Point():
x:float
y:float

is about as easily justified as preferring "studs" over the complete
word "students".

YMMV!
(excepting Code Review expectations)

* will an AI-Assistant code this for us, and thus remove any 'amount of
typing' complaint?

> I do recall odd methods sometimes used way back when I programmed in C/C++
> or similar languages when some method was used to declare small constants
> like:
>
> #define FIRSTNAME 1
> #define LASTNAME 2
>
> Or concepts like "const GPA = 3"
>
> And so on, so code asking for student_record[LASTNAME] would be a tad more
> readable and if the order of entries somehow were different, just redefine
> the constant.

I've been known to do this in Python too! This example is congruent with
what was mentioned (elsewhere/earlier): that LASTNAME is considerably
more meaningful than 2.

Programming principles includes advice that all 'magic constants' should
be hoisted to the top of the code (along with import-statements). Aren't
those positional indices 'magic constants'?

> In some sense, some of the data structures we are discussing, under the
> hood, actually may do something very similar as they remap the name to a
> small integer offset. Others may do much more or be slower but often add
> value in other ways. A full-blown class may not just encapsulate the names
> of components of an object but verify the validity of the contents or do
> logging or any number of other things. Using a list or tuple does nothing
> else.

Not in Python: database keys must be hashable values - for that reason.

Argh! The docs (https://docs.python.org/3/tutorial/datastructures.html)
don't say that - or don't say it any more. Did it change when key-order
became guaranteed, or do I mis-remember?

Those docs say "immutable" - but whilst "hashable" and "immutable" have
related meanings, they are not exactly the same in effect.

Alternately, the wiki (https://wiki.python.org/moin/DictionaryKeys) does
say "hashable"!

> So if you need nothing else, they are often suitable and sometimes even
> preferable.

Yes, (make a conscious choice to) use the best tool for the job - but
don't let bias cloud your judgement, don't take the ideas of the MD's
nephew as 'Gospel', and DO design the way forward...

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

RE: Newline (NuBe Question) [ In reply to ]

python-list at python

Nov 27, 2023, 7:01 AM

Post #29 of 29 (125 views)

Permalink

Dave, I gave an example, again, and make no deep claims so your comments may be valid, without any argument.

I mentioned CSV and a related family such as TSV as they were a common and simple data format that has long been used. There are oodles of others and yes, these days many people can read directly from formats like some from EXCEL. But for data that can be shared to almost anyone using anything, something like Comma Separated Values is often used.

And some programs that generate such data simply keep appending a line at a time to a file and do not have any header line. There are even some programs that may not tolerate a file with a header line, or comments or other optional things, and some where header lines you can create would cause problems such as using an extended character set or escaped characters.

I have worked with these files in many languages and environments and my thought process here focused on recent work in R, albeit much applies everywhere. My point was really not about CSV but the convenience and advantages of data structures you can access by name when you want and sometimes also by position when you want. Too many errors can happen when humans doing programming are not able to concentrate. It is similar to arguments about file names. In the old UNIX days, and the same for other systems like VMS, a filename tended to have a format where relatively few characters were allowed and it might have two parts with the latter being an extension of up to 3 characters, or whatever. So file names like A321G12.dat were common and also next to it similar unpronounceable other file names. It was easy to confuse them and even people who worked with them regularly would forget what it might mean or use the wrong one.

Well, if I load in a CSV in a language like R and there is no header line, as with some other data structures, it may make up a placeholder set of names like V1, V2 and so on. Yes, there are ways to specify the names as they are read in or afterward and they can be changed. But I have seen lots of CSV files offered with way too many columns and no names as well as documentation suggesting what names can be added if you wish.

This may be a bit off topic, but I want to add a bit in this context about additional concepts regarding name. As mentioned, there is a whole set of add-ons people sometimes use and in R, I like the tidyverse family and it allows some fairly sophisticated things to be done using names. There are ways to specify you want a subset of a data.frame (sometimes a version called a tibble) and you can ask for say all columns starting with "xyz" or containing it or ending with it. That can be very helpful if say we wave columns containing the height and weight and other metrics of say people in three clinics and your column names embed the name of the clinic, or other such examples, and you want to select one grouping for processing. You cannot easily do that without external info is it is just positional.

An extension of this is how compactly you can do fairly complex things such as asking to create lots of new columns using calculations. You can specify, as above, which sets of columns to do this too and that you want the results for each XYY in XYZ.mean and XYZ.std and so on. You can skip oodles of carefully crafted and nested loops because of the ability to manipulate using column names at a high and often abstract level.

And, just FYI, many other structures such as lists in R also support names for components. It can be very useful. But the overall paradigm compared to Python has major differences and I see strengths and weaknesses and tradeoffs.

Your dictionary example is one of them as numpy/pandas often make good use of them as part of dealing with similar data.frame type structures that are often simpler or easier to code with.

There is lots of AI discussion these days and some of what you say is applicable in that additional info besides names might be useful in the storage format to make processing it more useful. That is available in formats related to XML where fairly arbitrary markup can be made available.

Have to head out as this is already long enough.

-----Original Message-----
From: 'DL Neil' <PythonList@danceswithmice.info>
Sent: Monday, November 27, 2023 2:49 AM
To: avi.e.gross@gmail.com; python-list@python.org
Subject: Re: Newline (NuBe Question)

Avi,

On 11/27/2023 4:15 PM, avi.e.gross@gmail.com wrote:
> Dave,
>
> Back on a hopefully more serious note, I want to make a bit of an analogy
> with what happens when you save data in a format like a .CSV file.
>
> Often you have a choice of including a header line giving names to the
> resulting columns, or not.
>
> If you read in the data to some structure, often to some variation I would
> loosely call a data.frame or perhaps something like a matrix, then without
> headers you have to specify what you want positionally or create your own
> names for columns to use. If names are already there, your program can
> manipulate things by using the names and if they are well chosen, with no
> studs among them, the resulting code can be quite readable. More
> importantly, if the data being read changes and includes additional columns
> or in a different order, your original program may run fine as long as the
> names of the columns you care about remain the same.
>
> Positional programs can be positioned to fail in quite subtle ways if the
> positions no longer apply.

Must admit to avoiding .csv files, if possible, and working directly
with the .xls? original (cf expecting the user to export the .csv - and
NOT change the worksheet thereafter).

However, have recently been using the .csv format (as described) as a
placeholder or introduction to formatting data for an RDBMS.

In a tabular structure, the expectation is that every field (column/row
intersection) will contain a value. In the RDBMS-world, if the value is
not-known then it will be recorded as NULL (equivalent of Python's None).

Accordingly, two points:
1 the special case of missing/unavailable data can be handled with ease,
2 most 'connector' interfaces will give the choice of retrieving data
into a tuple or a dictionary (where the keys are the column-names). The
latter easing data-identification issues (as described) both in terms of
improving over relational-positioning and name-continuity (or column
changes/expansions).

The point about data 'appearing' without headings should be considered
carefully. The phrase "create your own names for columns" only vaguely
accesses the problem. If someone else has created/provided the data,
then we need to know the exact design (schema = rules). What is the
characteristic of each component? Not only column-names, but also what
is the metric (eg the infamous confusion between feet and meters)...

> As I see it, many situations where some aspects are variable are not ideal
> for naming. A dictionary is an example that is useful when you have no idea
> how many items with unknown keys may be present. You can iterate over the
> names that are there, or use techniques that detect and deal with keys from
> your list that are not present. Not using names/keys here might involve a
> longer list with lots of empty slots to designate missing items, This
> clearly is not great when the data present is sparse or when the number of
> items is not known in advance or cannot be maintained in the right order.

Agreed, and this is the draw-back incurred by folk who wish to take
advantage of the schema-less (possibility) NoSQL DBs. The DB enjoys
flexibility, but the downstream-coder has to contort and flex to cope.

In this case, JSON files are an easy place-holder/intro for NoSQL DBs -
in fact, Python dicts and MongoDB go hand-in-glove.

The next issue raised is sparseness. In a table, the assumption is that
all fields, or at least most of them, will be filled with values.
However, a sparse matrix would make such very 'expensive' in terms of
storage-space (efficacy).

Accordingly, there are other ways of doing things. All of these involve
labeling each data-item (thus, the data expressed as a table needs to be
at least 50% empty to justify the structural change).

In this case, one might consider a tree-type of structure - and if we
have to continue the pattern, we might look at a Network Database
methodology (as distinct from a DB on a network!)

> There are many other situations with assorted tradeoffs and to insist on
> using lists/tuples exclusively would be silly but at the same time, if you
> are using a list to hold the real and imaginary parts of a complex number,
> or the X/Y[/Z] coordinates of a point where the order is almost universally
> accepted, then maybe it is not worth using a data structure more complex or
> derived as the use may be obvious.

No argument (in case anyone thought I might...)

See @Peter's earlier advice.

Much of the consideration (apart from mutable/immutable) is likely to be
ease of coding. Getting down 'into the weeds' is probably pointless
unless questions are being asked about (execution-time) performance...

Isn't the word "obvious" where this discussion started? Whereas "studs"
might be an "obvious" abbreviation for "students" to some, it is not to
others (quite aside from the abbreviation being unnecessary in this
day-and-age).

Curiously, whereas I DO happen to think a point as ( x, y, ) or ( x, y,
z, ) and thus quite happily interpret ( 1, 2, 3, ) as a location in 3D
space, I had a trainee bring a 'problem' on this exact assumption:-

He had two positions ( x1, y1, ) and ( x2, y2, ) and was computing the
vector between them ( x2 - x1, y2 - y1 ), accordingly:

def compute_distance( x1, x2, y1, y2, ):
# with return calculated as above

Trouble is, the function-call was:

result = compute_distance( x1, y1, x2, y2, )

In other words, the function's signature was consistent with the
calculation. Whereas, the function-call was consistent with the way the
data had 'arrived'. Oops!

As soon as a (data)class Point( x, y, ) was created, the function's
signature became:

def compute_distance( starting_point:Point, ending_point:Point, ):
# with amended return calculation

and the function-call became congruent, naturally.
(in fact, the function was moved into the dataclass to become a method
which simplified the signature and call(s) )

Thus, what was "obvious" to the same guy's brain when he was writing the
function, and what seemed "obvious" when the function was being used,
were materially (and catastrophically) different!

So, even though we (two) might think in terms of "universally", we
are/were wrong!

Thus, a DESIGNED data-type helps to avoid errors, and even when the
data-usage seems "obvious", offers advantage!

Once again, am tempted to suggest that the saving of:

point = ( 1, 2, )

over:

@dataclass
class Point():
x:float
y:float

is about as easily justified as preferring "studs" over the complete
word "students".

YMMV!
(excepting Code Review expectations)

* will an AI-Assistant code this for us, and thus remove any 'amount of
typing' complaint?

> I do recall odd methods sometimes used way back when I programmed in C/C++
> or similar languages when some method was used to declare small constants
> like:
>
> #define FIRSTNAME 1
> #define LASTNAME 2
>
> Or concepts like "const GPA = 3"
>
> And so on, so code asking for student_record[LASTNAME] would be a tad more
> readable and if the order of entries somehow were different, just redefine
> the constant.

I've been known to do this in Python too! This example is congruent with
what was mentioned (elsewhere/earlier): that LASTNAME is considerably
more meaningful than 2.

Programming principles includes advice that all 'magic constants' should
be hoisted to the top of the code (along with import-statements). Aren't
those positional indices 'magic constants'?

> In some sense, some of the data structures we are discussing, under the
> hood, actually may do something very similar as they remap the name to a
> small integer offset. Others may do much more or be slower but often add
> value in other ways. A full-blown class may not just encapsulate the names
> of components of an object but verify the validity of the contents or do
> logging or any number of other things. Using a list or tuple does nothing
> else.

Not in Python: database keys must be hashable values - for that reason.

Argh! The docs (https://docs.python.org/3/tutorial/datastructures.html)
don't say that - or don't say it any more. Did it change when key-order
became guaranteed, or do I mis-remember?

Those docs say "immutable" - but whilst "hashable" and "immutable" have
related meanings, they are not exactly the same in effect.

Alternately, the wiki (https://wiki.python.org/moin/DictionaryKeys) does
say "hashable"!

> So if you need nothing else, they are often suitable and sometimes even
> preferable.

Yes, (make a conscious choice to) use the best tool for the job - but
don't let bias cloud your judgement, don't take the ideas of the MD's
nephew as 'Gospel', and DO design the way forward...

--
Regards =dn

--
https://mail.python.org/mailman/listinfo/python-list