Mailing List Archive

help: pandas and 2d table
Hi everyone.
I state that I don't know anything about 'pandas' but I intuited that
it could do what I want. I get, through the "read_excel" method, a
table similar to this:

obj| foo1 foo2 foo3 foo4 foo5 foo6
-----------------------------------
foo1| aa ab zz ad ae af
|
foo2| ba bb bc bd zz bf
|
foo3| ca zz cc cd ce zz
|
foo4| da db dc dd de df
|
foo5| ea eb ec zz ee ef
|
foo6| fa fb fc fd fe ff


And I would like to get a result similar to this:

{
'zz':[('foo1','foo3'),
('foo2','foo5'),
('foo3','foo2'),
('foo3','foo6'),
('foo5','foo4')
]
}

Would you show me the path, please?
Thank you in advance.

--
https://mail.python.org/mailman/listinfo/python-list
Re: help: pandas and 2d table [ In reply to ]
Stefan Ram ha scritto:
> jak <nospam@please.ty> wrote or quoted:
>> Would you show me the path, please?
>
> I was not able to read xls here, so I used csv instead; Warning:
> the script will overwrite file "file_20240412201813_tmp_DML.csv"!
>
> import pandas as pd
>
> with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
> print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
> foo1,aa,ab,zz,ad,ae,af
> foo2,ba,bb,bc,bd,zz,bf
> foo3,ca,zz,cc,cd,ce,zz
> foo4,da,db,dc,dd,de,df
> foo5,ea,eb,ec,zz,ee,ef
> foo6,fa,fb,fc,fd,fe,ff''', file=out )
>
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )
>
> result = {}
>
> for rownum, row in df.iterrows():
> iterator = row.items()
> _, rowname = next( iterator )
> for colname, value in iterator:
> if value not in result: result[ value ]= []
> result[ value ].append( ( rowname, colname ))
>
> print( result )
>

In reality what I wanted to achieve was this:

what = 'zz'
result = {what: []}

for rownum, row in df.iterrows():
iterator = row.items()
_, rowname = next(iterator)
for colname, value in iterator:
if value == what:
result[what] += [(rowname, colname)]
print(result)

In any case, thank you again for pointing me in the right direction. I
had lost myself looking for a pandas method that would do this in a
single shot or almost.


--
https://mail.python.org/mailman/listinfo/python-list
Re: help: pandas and 2d table [ In reply to ]
On 4/13/24 07:00, jak via Python-list wrote:
> Stefan Ram ha scritto:
>> jak <nospam@please.ty> wrote or quoted:
>>> Would you show me the path, please?
>>
>>    I was not able to read xls here, so I used csv instead; Warning:
>>    the script will overwrite file "file_20240412201813_tmp_DML.csv"!
>>
>> import pandas as pd
>>
>> with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
>>      print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
>> foo1,aa,ab,zz,ad,ae,af
>> foo2,ba,bb,bc,bd,zz,bf
>> foo3,ca,zz,cc,cd,ce,zz
>> foo4,da,db,dc,dd,de,df
>> foo5,ea,eb,ec,zz,ee,ef
>> foo6,fa,fb,fc,fd,fe,ff''', file=out )
>>
>> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )
>>
>> result = {}
>>
>> for rownum, row in df.iterrows():
>>      iterator = row.items()
>>      _, rowname = next( iterator )
>>      for colname, value in iterator:
>>          if value not in result: result[ value ]= []
>>          result[ value ].append( ( rowname, colname ))
>>
>> print( result )
>>
>
> In reality what I wanted to achieve was this:
>
>     what = 'zz'
>     result = {what: []}
>
>     for rownum, row in df.iterrows():
>         iterator = row.items()
>         _, rowname = next(iterator)
>         for colname, value in iterator:
>             if value == what:
>                 result[what] += [(rowname, colname)]
>     print(result)
>
> In any case, thank you again for pointing me in the right direction. I
> had lost myself looking for a pandas method that would do this in a
> single shot or almost.
>
>

doesn't Pandas have a "where" method that can do this kind of thing? Or
doesn't it match what you are looking for? Pretty sure numpy does, but
that's a lot to bring in if you don't need the rest of numpy.


--
https://mail.python.org/mailman/listinfo/python-list
Re: help: pandas and 2d table [ In reply to ]
On Sat, Apr 13, 2024 at 1:10?PM Mats Wichmann via Python-list <
python-list@python.org> wrote:

> On 4/13/24 07:00, jak via Python-list wrote:
>
> doesn't Pandas have a "where" method that can do this kind of thing? Or
> doesn't it match what you are looking for? Pretty sure numpy does, but
> that's a lot to bring in if you don't need the rest of numpy.
>
> pandas.DataFrame.where — pandas 2.2.2 documentation (pydata.org)
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html#pandas.DataFrame.where>
--
https://mail.python.org/mailman/listinfo/python-list
Re: help: pandas and 2d table [ In reply to ]
Stefan Ram ha scritto:
> df = df.where( df == 'zz' ).stack().reset_index()
> result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.
--
https://mail.python.org/mailman/listinfo/python-list
Re: help: pandas and 2d table [ In reply to ]
Stefan Ram ha scritto:
> jak <nospam@please.ty> wrote or quoted:
>> Stefan Ram ha scritto:
>>> df = df.where( df == 'zz' ).stack().reset_index()
>>> result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
>> Since I don't know Pandas, I will need a month at least to understand
>> these 2 lines of code. Thanks again.
>
> Here's a technique to better understand such code:
>
> Transform it into a program with small statements and small
> expressions with no more than one call per statement if possible.
> (After each litte change check that the output stays the same.)
>
> import pandas as pd
>
> # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
> with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
> print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
> foo1,aa,ab,zz,ad,ae,af
> foo2,ba,bb,bc,bd,zz,bf
> foo3,ca,zz,cc,cd,ce,zz
> foo4,da,db,dc,dd,de,df
> foo5,ea,eb,ec,zz,ee,ef
> foo6,fa,fb,fc,fd,fe,ff''', file=out )
> # Note the "index_col=0" below, which is important here!
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
>
> selection = df.where( df == 'zz' )
> selection_stack = selection.stack()
> df = selection_stack.reset_index()
> df0 = df.iloc[ :, 0 ]
> df1 = df.iloc[ :, 1 ]
> z = zip( df0, df1 )
> l = list( z )
> result ={ 'zz': l }
> print( result )
>
> I suggest to next insert print statements to print each intermediate
> value:
>
> # Note the "index_col=0" below, which is important here!
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
> print( 'df = \n', type( df ), ':\n"', df, '"\n' )
>
> selection = df.where( df == 'zz' )
> print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
> selection, '"\n' )
>
> selection_stack = selection.stack()
> print( 'result of stack() = \n', type( selection_stack ), ':\n"',
> selection_stack, '"\n' )
>
> df = selection_stack.reset_index()
> print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )
>
> df0 = df.iloc[ :, 0 ]
> print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )
>
> df1 = df.iloc[ :, 1 ]
> print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )
>
> z = zip( df0, df1 )
> print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )
>
> l = list( z )
> print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )
>
> result ={ 'zz': l }
> print( "value of { 'zz': l }= \n", type( result ), ':\n"',
> result, '"\n' )
>
> print( result )
>
> Now you can see what each single step does!
>
> df =
> <class 'pandas.core.frame.DataFrame'> :
> " foo1 foo2 foo3 foo4 foo5 foo6
> obj
> foo1 aa ab zz ad ae af
> foo2 ba bb bc bd zz bf
> foo3 ca zz cc cd ce zz
> foo4 da db dc dd de df
> foo5 ea eb ec zz ee ef
> foo6 fa fb fc fd fe ff "
>
> result of where( df == 'zz' ) =
> <class 'pandas.core.frame.DataFrame'> :
> " foo1 foo2 foo3 foo4 foo5 foo6
> obj
> foo1 NaN NaN zz NaN NaN NaN
> foo2 NaN NaN NaN NaN zz NaN
> foo3 NaN zz NaN NaN NaN zz
> foo4 NaN NaN NaN NaN NaN NaN
> foo5 NaN NaN NaN zz NaN NaN
> foo6 NaN NaN NaN NaN NaN NaN "
>
> result of stack() =
> <class 'pandas.core.series.Series'> :
> " obj
> foo1 foo3 zz
> foo2 foo5 zz
> foo3 foo2 zz
> foo6 zz
> foo5 foo4 zz
> dtype: object "
>
> result of reset_index() =
> <class 'pandas.core.frame.DataFrame'> :
> " obj level_1 0
> 0 foo1 foo3 zz
> 1 foo2 foo5 zz
> 2 foo3 foo2 zz
> 3 foo3 foo6 zz
> 4 foo5 foo4 zz "
>
> value of .iloc[ :, 0 ]=
> <class 'pandas.core.series.Series'> :
> " 0 foo1
> 1 foo2
> 2 foo3
> 3 foo3
> 4 foo5
> Name: obj, dtype: object "
>
> value of .iloc[ :, 1 ] =
> <class 'pandas.core.series.Series'> :
> " 0 foo3
> 1 foo5
> 2 foo2
> 3 foo6
> 4 foo4
> Name: level_1, dtype: object "
>
> result of zip( df0, df1 )=
> <class 'zip'> :
> " <zip object at 0x000000000B3B9548>"
>
> result of list( z )=
> <class 'list'> :
> " [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]"
>
> value of { 'zz': l }=
> <class 'dict'> :
> " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}"
>
> {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}
>
> The script reads a CSV file and stores the data in a Pandas
> DataFrame object named "df". The "index_col=0" parameter tells
> Pandas to use the first column as the index for the DataFrame,
> which is kinda like column headers.
>
> The "where" creates a new DataFrame selection that contains
> the same data as df, but with all values replaced by NaN (Not
> a Number) except for the values that are equal to 'zz'.
>
> "stack" returns a Series with a multi-level index created
> by pivoting the columns. Here it gives a Series with the
> row-col-addresses of a all the non-NaN values. The general
> meaning of "stack" might be the most complex operation of
> this script. It's explained in the pandas manual (see there).
>
> "reset_index" then just transforms this Series back into a
> DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
> first and second column, respectively, of that DataFrame. These
> then are zipped to get the desired form as a list of pairs.
>

And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow it easily because I have already written a version of this code
in C without any kind of external library that uses the .CSV version of
the table as data ( 234 code lines :^/ ).


--
https://mail.python.org/mailman/listinfo/python-list