Mailing List Archive

Organizing modules and their code
Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.

Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
---- Tolerate change,
---- Are easy to understand, and
---- Are the basis of components that can be used in many software systems.

I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.

I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
albeit at a much higher level (SA Martin).

One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)

SEVERAL DESIGNS FOR COMPARISON

DESIGN I:

---- manage_the_etl_pipeline.py
---- etl_helpers
---- extract.py
---- transform.py
---- load.py

Of course one could also

DESIGN II:

---- manage_the_etl_pipeline.py
---- etl_helpers
---- extract_transform_load.py

or probably even:

DESIGN III:

---- manage_the_etl_pipeline.py
---- extract_transform_load.py
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 2023-02-03 at 13:18:46 -0800,
transreductionist <transreductionist@gmail.com> wrote:

> Here is the situation. There is a top-level module (see designs below)
> containing code, that as the name suggests, manages an ETL pipeline. A
> directory is created called etl_helpers that organizes several modules
> responsible for making up the pipeline. The discussion concerns the
> Python language, which supports OOP as well as Structural/Functional
> approaches to programming.

> I am interested in opinions on which design adheres best to standard
> architectural practices and the SOLID principles. I understand that
> this is one of those topics where people may have strong opinions one
> way or the other. I am interested in those opinions.

Okay, I'll start: unless one of extract, transform, or load is already,
or will certainly at some point become, complex/complicated enough to be
its own architectural module with its own architectural substructure; or
you're constructing specific ETL pipelines for specific ETL jobs at the
times the jobs are defined; then I think you're overthinking it.

Note that I say that speaking as a notorious overthinker. ;-)

Keep It Simple: Put all four modules at the top level, and run with it
until you falsify it. Yes, I would give you that same advice no matter
what language you're using.

FWIW, I'm not a big fan of OO, but based on what little I know about
your ETL pipelines, I agree with you that it probably doesn't make a big
difference at this level. Define solid (in pretty much any/every sense
of the word, capitalized or not) interfaces between your modules, and
write your code against those interfaces, whether OO or any other
paradigm.
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 2/3/2023 4:18 PM, transreductionist wrote:
> Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.
>
> I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.

Well, you have pretty well stacked the deck to make DESIGN 1 the
obviously preferred choice. I don't think it has much to do with Python
per se, or even with OO vs imperative style.

As a practical matter, once you got into working with
extract_transform_load.py (for the other designs), I would expect that
you would start wanting to refactor it and eventually end up more like
DESIGN 1. So you might as well start out that way.

The reasons are 1) what you said about separation of concerns, 2) a
desire to keep each module or file relatively coherent and easy to read,
and 3, as you also suggested, making each of them easier to test.
Decoupling is important too (one of the SOLID prescriptions), but you
can violate that with any architecture if you don't think carefully
about what you are doing.

On the subject of OO, I think it is a very good approach to think about
architecture and design in object terms - meaning conceptual objects
from the users' point of view. For example, here you have a pipeline (a
metaphorical or userland object). It will need functionality to load,
transform, and output data so logically it can be composed of a loader,
one or more transformers, and one or more output formatters (more
objects). You may also need a scheduler and a configuration manager
(more objects).

(*Please* let's not have any quibbling about "class" vs "object". We
are at a conceptual level here!)

When it comes to implementation, you can choose to implement those
userland objects with either imperative, OO, or functional techniques,
or a mixture.


> Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.
>
> I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
> ---- Tolerate change,
> ---- Are easy to understand, and
> ---- Are the basis of components that can be used in many software systems.
>
> I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.
>
> I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
> albeit at a much higher level (SA Martin).
>
> One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.
>
> Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf
>
> ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)
>
> SEVERAL DESIGNS FOR COMPARISON
>
> DESIGN I:
>
> ---- manage_the_etl_pipeline.py
> ---- etl_helpers
> ---- extract.py
> ---- transform.py
> ---- load.py
>
> Of course one could also
>
> DESIGN II:
>
> ---- manage_the_etl_pipeline.py
> ---- etl_helpers
> ---- extract_transform_load.py
>
> or probably even:
>
> DESIGN III:
>
> ---- manage_the_etl_pipeline.py
> ---- extract_transform_load.py

--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On Friday, February 3, 2023 at 5:31:56 PM UTC-5, Thomas Passin wrote:
> On 2/3/2023 4:18 PM, transreductionist wrote:
> > Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.
> >
> > I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.
> Well, you have pretty well stacked the deck to make DESIGN 1 the
> obviously preferred choice. I don't think it has much to do with Python
> per se, or even with OO vs imperative style.
>
> As a practical matter, once you got into working with
> extract_transform_load.py (for the other designs), I would expect that
> you would start wanting to refactor it and eventually end up more like
> DESIGN 1. So you might as well start out that way.
>
> The reasons are 1) what you said about separation of concerns, 2) a
> desire to keep each module or file relatively coherent and easy to read,
> and 3, as you also suggested, making each of them easier to test.
> Decoupling is important too (one of the SOLID prescriptions), but you
> can violate that with any architecture if you don't think carefully
> about what you are doing.
>
> On the subject of OO, I think it is a very good approach to think about
> architecture and design in object terms - meaning conceptual objects
> from the users' point of view. For example, here you have a pipeline (a
> metaphorical or userland object). It will need functionality to load,
> transform, and output data so logically it can be composed of a loader,
> one or more transformers, and one or more output formatters (more
> objects). You may also need a scheduler and a configuration manager
> (more objects).
>
> (*Please* let's not have any quibbling about "class" vs "object". We
> are at a conceptual level here!)
>
> When it comes to implementation, you can choose to implement those
> userland objects with either imperative, OO, or functional techniques,
> or a mixture.
> > Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.
> >
> > I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
> > ---- Tolerate change,
> > ---- Are easy to understand, and
> > ---- Are the basis of components that can be used in many software systems.
> >
> > I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.
> >
> > I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
> > albeit at a much higher level (SA Martin).
> >
> > One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.
> >
> > Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf
> >
> > ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)
> >
> > SEVERAL DESIGNS FOR COMPARISON
> >
> > DESIGN I:
> >
> > ---- manage_the_etl_pipeline.py
> > ---- etl_helpers
> > ---- extract.py
> > ---- transform.py
> > ---- load.py
> >
> > Of course one could also
> >
> > DESIGN II:
> >
> > ---- manage_the_etl_pipeline.py
> > ---- etl_helpers
> > ---- extract_transform_load.py
> >
> > or probably even:
> >
> > DESIGN III:
> >
> > ---- manage_the_etl_pipeline.py
> > ---- extract_transform_load.py


On point that I think is worth making ,and I forgot to make it, is that namespaces are ubiquitous in Python: Built-in, Global, Function, and Enclosing namespaces, as well as user namespaces, e.g. dictionaries, the SimpleNamespace, and DataClasses to list just a few. Modules ARE namespaces. Namespaces organize programming constructs like classes, functions, variables, etc. into coherent groups of "things". To have a namespace that complects extract constructs with transform constructs, and load constructs in one module seems un-pythonistic.
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 2/3/2023 5:14 PM, 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
> Keep It Simple: Put all four modules at the top level, and run with it
> until you falsify it. Yes, I would give you that same advice no matter
> what language you're using.

In my recent message I supported DESIGN 1. But I really don't care much
about the directory organization. It's designing modules whose business
is to handle various kinds of operations that counts, not so much the
actual directory organization.

--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 04/02/2023 16.24, Thomas Passin wrote:
> On 2/3/2023 5:14 PM, 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
>> Keep It Simple:  Put all four modules at the top level, and run with it
>> until you falsify it.  Yes, I would give you that same advice no matter
>> what language you're using.
>
> In my recent message I supported DESIGN 1.  But I really don't care much
> about the directory organization.  It's designing modules whose business
> is to handle various kinds of operations that counts, not so much the
> actual directory organization.

+1 (and to comments made in preceding post)

With ETL the 'reasons to change' (SRP) come from different 'actors'. For
example, the data-source may be altered either in format or by changing
the tool you'll utilise to access. Accordingly, the virtue of keeping it
separate from other parts. If you have multiple data-sources, then each
should be separate for the same reason.

The transform is likely dictated by your client's specification. So,
another separation. Hence Design 1.

There is a strong argument for suggesting that we're going out of our
way to imagine problems or future-changes (which may never happen). If
this is (definitely?) a one-off, then why-bother? If permanence is
likely, (so many 'temporary' solutions end-up lasting years!) then
re-use can?should be considered.

Thus, when it comes to loading the data into your own DB; perhaps this
should be separate, because it is highly likely that the mechanisms you
build for loading will be matched by at least one 'someone else' wanting
to access the same data for the desired end-purposes. Accordingly, a
shareable module and/or class for that.


We can't see the code-structure, so some of the other parts of your
question(s) are too broad. Here's hoping you and Liskov have a good time
together...


My preference is for (what I term) the 'circles' diagram (see copy at
https://mahu.rangi.cloud/CraftingSoftware/CleanArchitecture.jpg). This
illustrates the 'rule' that code handling the inner functionality not
know what happens at the more detailed/lower-level functional level of
the outer rings.

With ETL, there's precious little to embody various circles, but the
content of the outer ring is obvious. The "T" rules comprise the inner
"Use Case", even if you eschew "Entities" insofar as OOP-avoidance is
concerned. This 'inversion', where the inner controls don't need to care
about the details of outer-ring implementation (is it an RDBMS, MySQL or
Postgres; or is it some NoSQL system?) brings to life the "D" of SOLID,
ie Dependency Inversion.


You may pick-up some ideas or reassurance from "Making a Simple Data
Pipeline Part 1: The ETL Pattern"
(https://www.codeproject.com/Articles/5324207/Making-a-Simple-Data-Pipeline-Part-1-The-ETL-Patte).

Let us know how it turns-out...
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 2/4/2023 12:24 AM, dn via Python-list wrote:
> The transform is likely dictated by your client's specification. So,
> another separation. Hence Design 1.
>
> There is a strong argument for suggesting that we're going out of our
> way to imagine problems or future-changes (which may never happen). If
> this is (definitely?) a one-off, then why-bother? If permanence is
> likely, (so many 'temporary' solutions end-up lasting years!) then
> re-use can?should be considered.

With practice, it gets to be more automatic to set things up from the
beginning to more-or-less honor separation of concerns, decoupled
modules and APIs, and so forth. Doing this does not require a full,
future-proof suite of alternative database adapters, for example, right
from the start. On top of everything else, you can't know the future
perfectly. And you can't know enough at the beginning to get every
design and architectural path optimal. You learn as you go.

I have a Tomcat application where I separated the output formatting from
the calculation of results. At the time I wrote only an XML formatter.
A decade later, here comes JSON, and customers are asking about it. I
was able to write a JSON formatter with the same API in about half an
hour, and now we have optional JSON output. Separating out the
formatting functionality with its own API was not an example of wasting
time with YAGNI (You Aren't Going To Need It), it was just plain good
practice that probably didn't even cost me any more development time -
since it simplified the calculation code.

OTOH, you may be just trying to learn how to do the bits and pieces. You
may be learning how to connect to the database. You may be learning how
to make a pipeline multithreaded. You have to experiment a thousand
ways, and in a hurry. Until you learn how to do the basic techniques,
sure, quick and dirty is fine. But it shouldn't be the way you design
your actual product, unless it's just for you and needs to be done
quickly, and will probably be ephemeral.

Why do I get the feeling that the OP was asking about a homework problem?

--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
You?re overthinking it. It doesn?t really matter. Having small chunks of codes in separate files can be hassle when trying to find out what the program does. Having one file with 2,000 lines in it can be a hassle. This is art / opinion, not science.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of transreductionist <transreductionist@gmail.com>
Date: Friday, February 3, 2023 at 4:48 PM
To: python-list@python.org <python-list@python.org>
Subject: Organizing modules and their code
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.

Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
---- Tolerate change,
---- Are easy to understand, and
---- Are the basis of components that can be used in many software systems.

I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.

I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
albeit at a much higher level (SA Martin).

One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

---- (https://urldefense.com/v3/__https://www.cs.drexel.edu/*yfcai/papers/2019/tse2019.pdf__;fg!!Cn_UX_p3!jcpCdxiLoPobR0IdlyJHwyPiNP4_iVC6dAMtg_HsLr5hStszx-WnYyZQHJ-4pJTOGsw4-6pEGJyDpSytZQqfpvATg06FMA$ )

SEVERAL DESIGNS FOR COMPARISON

DESIGN I:

---- manage_the_etl_pipeline.py
---- etl_helpers
---- extract.py
---- transform.py
---- load.py

Of course one could also

DESIGN II:

---- manage_the_etl_pipeline.py
---- etl_helpers
---- extract_transform_load.py

or probably even:

DESIGN III:

---- manage_the_etl_pipeline.py
---- extract_transform_load.py
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jcpCdxiLoPobR0IdlyJHwyPiNP4_iVC6dAMtg_HsLr5hStszx-WnYyZQHJ-4pJTOGsw4-6pEGJyDpSytZQqfpvBaJ2e2VA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jcpCdxiLoPobR0IdlyJHwyPiNP4_iVC6dAMtg_HsLr5hStszx-WnYyZQHJ-4pJTOGsw4-6pEGJyDpSytZQqfpvBaJ2e2VA$>
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
Thank you for all the helpful replies and consideration. I do hope for other opinions

I would rather say it is more like engineering than art. Whether it is a matter of overthinking, or under thinking, is another matter. I enjoyed Dijkstra's letter to his colleagues on the role of scientific thought in computer programming. It is located at:

---- https://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD447.html

It is my academic training in physics that makes me enjoy picking up an idea and examining it from all sides, and sharing thoughts with friends. Just inquisitive curiosity, and not a homework problem,. Thanks for the great link to the ETL site. That was a good read. A few years back I built a prod ETL application in Golang using gRPC with a multiprocessing pipeline throughout. It handled GB of data and was fast.

This analogy came to me the other day. For me, I would rather walk into a grocery store where the bananas, apples, and oranges are separated in to their own bins, instead of one common crate.


On Friday, February 3, 2023 at 4:18:57 PM UTC-5, transreductionist wrote:
> Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.
>
> I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.
>
> Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.
>
> I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
> ---- Tolerate change,
> ---- Are easy to understand, and
> ---- Are the basis of components that can be used in many software systems.
>
> I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.
>
> I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
> albeit at a much higher level (SA Martin).
>
> One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.
>
> Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf
>
> ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)
>
> SEVERAL DESIGNS FOR COMPARISON
>
> DESIGN I:
>
> ---- manage_the_etl_pipeline.py
> ---- etl_helpers
> ---- extract.py
> ---- transform.py
> ---- load.py
>
> Of course one could also
>
> DESIGN II:
>
> ---- manage_the_etl_pipeline.py
> ---- etl_helpers
> ---- extract_transform_load.py
>
> or probably even:
>
> DESIGN III:
>
> ---- manage_the_etl_pipeline.py
> ---- extract_transform_load.py
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 5/02/23 11:18 am, transreductionist wrote:
> This analogy came to me the other day. For me, I would rather walk into a grocery store where the bananas, apples, and oranges are separated in to their own bins, instead of one common crate.

On the other hand, if the store has an entire aisle devoted to each
fruit, but only ever one crate of fruit in each aisle, one would think
they could make better use of their shelf space.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
Well, first of all, while there is no doubt as to Dijkstra?s contribution to computer science, I don?t think his description of scientific thought is correct. The acceptance of Einstein?s theory of relativity has nothing to do with internal consistency or how easy or difficult to explain but rather repeatedly experimental results validating it. Or, more precisely, not disproving it. See Feynmann: https://www.youtube.com/watch?v=0KmimDq4cSU


Engineering is simply maximizing the ratio: benefit / cost. Highly recommend To Engineer is Human by Henry Petroski.

Regarding the initial question: none of the suggested designs would work because they lack __init__.py file.

Once the __init__.py is added, the construct of the import statements within it will determine how the API looks. All three of Design I / Design II and Design III can be implemented with the same API. (I?m pretty sure that?s true. If it?s not, I?d be interested in a counterexample).





From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of transreductionist <transreductionist@gmail.com>
Date: Saturday, February 4, 2023 at 7:42 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Organizing modules and their code
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Thank you for all the helpful replies and consideration. I do hope for other opinions

I would rather say it is more like engineering than art. Whether it is a matter of overthinking, or under thinking, is another matter. I enjoyed Dijkstra's letter to his colleagues on the role of scientific thought in computer programming. It is located at:

---- https://urldefense.com/v3/__https://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD447.html__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3jd5Pi2fw$<https://urldefense.com/v3/__https:/www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD447.html__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3jd5Pi2fw$>

It is my academic training in physics that makes me enjoy picking up an idea and examining it from all sides, and sharing thoughts with friends. Just inquisitive curiosity, and not a homework problem,. Thanks for the great link to the ETL site. That was a good read. A few years back I built a prod ETL application in Golang using gRPC with a multiprocessing pipeline throughout. It handled GB of data and was fast.

This analogy came to me the other day. For me, I would rather walk into a grocery store where the bananas, apples, and oranges are separated in to their own bins, instead of one common crate.


On Friday, February 3, 2023 at 4:18:57 PM UTC-5, transreductionist wrote:
> Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.
>
> I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.
>
> Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.
>
> I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
> ---- Tolerate change,
> ---- Are easy to understand, and
> ---- Are the basis of components that can be used in many software systems.
>
> I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.
>
> I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
> albeit at a much higher level (SA Martin).
>
> One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.
>
> Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf
>
> ---- (https://urldefense.com/v3/__https://www.cs.drexel.edu/*yfcai/papers/2019/tse2019.pdf__;fg!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3jaresNFQ$ )
>
> SEVERAL DESIGNS FOR COMPARISON
>
> DESIGN I:
>
> ---- manage_the_etl_pipeline.py
> ---- etl_helpers
> ---- extract.py
> ---- transform.py
> ---- load.py
>
> Of course one could also
>
> DESIGN II:
>
> ---- manage_the_etl_pipeline.py
> ---- etl_helpers
> ---- extract_transform_load.py
>
> or probably even:
>
> DESIGN III:
>
> ---- manage_the_etl_pipeline.py
> ---- extract_transform_load.py
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3hpaHTfyQ$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3hpaHTfyQ$>
--
https://mail.python.org/mailman/listinfo/python-list
Re: Organizing modules and their code [ In reply to ]
On 6/02/23 4:23 am, Weatherby,Gerard wrote:
> Well, first of all, while there is no doubt as to Dijkstra’s contribution to computer science, I don’t think his description of scientific thought is correct. The acceptance of Einstein’s theory of relativity has nothing to do with internal consistency or how easy or difficult to explain but rather repeatedly experimental results validating it.

I don't think Dijkstra was claiming that what he was talking about
was a *complete* description of scientific thought, only that the
ability to separate out independent concerns is an important part
of it, and that was something he saw his colleagues failing to do.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list