Mailing List Archive

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
I tried to convert a xls file into csv with the following command, but failed:

$ in2csv --sheet 'Sheet1' 2021-2022-1.xls
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

The above testing file is located at here [1].

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls

Any hints for fixing this problem?

Regards,
HZ
--
https://mail.python.org/mailman/listinfo/python-list
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On 29/09/2021 10.22, hongy...@gmail.com wrote:
> I tried to convert a xls file into csv with the following command, but failed:
>
> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>
> The above testing file is located at here [1].
>
> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
>
> Any hints for fixing this problem?

You need to delete the 13 first lines in the file or you see to that
your code does first trim the data before start xml parse it.

--

//Aho
--
https://mail.python.org/mailman/listinfo/python-list
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 10.22, hongy...@gmail.com wrote:
> > I tried to convert a xls file into csv with the following command, but failed:
> >
> > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >
> > The above testing file is located at here [1].
> >
> > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >
> > Any hints for fixing this problem?
> You need to delete the 13 first lines in the file

Yes. After deleting the top 3 lines, the problem has been fixed.

> or you see to that your code does first trim the data before start xml parse it.

Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?

HZ
--
https://mail.python.org/mailman/listinfo/python-list
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On 29/09/2021 13.10, hongy...@gmail.com wrote:
> On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
>> On 29/09/2021 10.22, hongy...@gmail.com wrote:
>>> I tried to convert a xls file into csv with the following command, but failed:
>>>
>>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
>>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>>>
>>> The above testing file is located at here [1].
>>>
>>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
>>>
>>> Any hints for fixing this problem?
>> You need to delete the 13 first lines in the file
>
> Yes. After deleting the top 3 lines, the problem has been fixed.
>
>> or you see to that your code does first trim the data before start xml parse it.
>
> Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?


You could do something like loading the XML into a string (myxmlstr) and
then find the fist < in that string

xmlstart = myxmlstr.find('<')

xmlstr = myxmlstr[xmlstart:]

then use the xmlstr in the xml parser, sure not as convenient as loading
the file directly to the xml parser.

I don't say this is the best way of doing it, I'm sure some python wiz
here would have a smarter solution.

--

//Aho

--
https://mail.python.org/mailman/listinfo/python-list
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On Wednesday, September 29, 2021 at 8:12:08 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 13.10, hongy...@gmail.com wrote:
> > On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> >> On 29/09/2021 10.22, hongy...@gmail.com wrote:
> >>> I tried to convert a xls file into csv with the following command, but failed:
> >>>
> >>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> >>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >>>
> >>> The above testing file is located at here [1].
> >>>
> >>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >>>
> >>> Any hints for fixing this problem?
> >> You need to delete the 13 first lines in the file
> >
> > Yes. After deleting the top 3 lines, the problem has been fixed.
> >
> >> or you see to that your code does first trim the data before start xml parse it.
> >
> > Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?
> You could do something like loading the XML into a string (myxmlstr)

How to do this operation? As you have seen, the file refused to be loaded at all.

> and then find the fist < in that string
>
> xmlstart = myxmlstr.find('<')
>
> xmlstr = myxmlstr[xmlstart:]
>
> then use the xmlstr in the xml parser, sure not as convenient as loading
> the file directly to the xml parser.
>
> I don't say this is the best way of doing it, I'm sure some python wiz
> here would have a smarter solution.

Another very strange thing: I trimmed the first 3 lines in the original file and saved it into a new one named as 2021-2022-1-trimmed-top-3-lines.xls. [1]

Then I read the file with the following python script named as pandas-excel.py:

------
import pandas as pd

excel_file='2021-2022-1-trimmed-top-3-lines.xls'

#print(pd.ExcelFile(excel_file).sheet_names)

newpd=pd.read_excel(excel_file, sheet_name='Sheet1')

for i in newpd.index:
if i >1:
for j in newpd.columns:
if int(j.split()[1]) > 2:
if not pd.isnull(newpd.loc[i][j]):
print(newpd.loc[i][j])
------

$ python pandas-excel.py | sort -u
?????? [1-8]? 1-4? 38 ???413???????II ??1932?
?????????????? [1-12]? 1-4? 38 ???416?????????? ??1932?

OTOH, I also tried to read the file with in2csv as follows:

$ in2csv --sheet Sheet1 2021-2022-1-trimmed-top-3-lines.xls 2>/dev/null |tr ',' '\n' | \
sed -re '/^$/d' | sort -u | awk '{print length($0),$0}' | sort -k1n | tail -3 | cut -d ' ' -f2-
?????? [1-8]? 1-4? 38 ???413???????II ??1932?
???????? [1-8]? 6-9? 45 ???511????????? ??1931?
?????????????? [1-12]? 1-4? 38 ???416?????????? ??1932?

As you can see, the above two methods give different results. I'm very puzzled by this phenomenon. Any hints/tips/comments will be greatly appreciated.

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1-trimmed-top-3-lines.xls

Regards,
HZ
--
https://mail.python.org/mailman/listinfo/python-list
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> I tried to convert a xls file into csv with the following command, but failed:
>
> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>
> The above testing file is located at here [1].
>
> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls

Why is that file name .xls when it's obviously an HTML file?

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote:
> On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> > I tried to convert a xls file into csv with the following command, but failed:
> >
> > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >
> > The above testing file is located at here [1].
> >
> > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> Why is that file name .xls when it's obviously an HTML file?

Good catch! Thank you for pointing this out. This file is automatically exported from my university's teaching management system, and it was assigned the .xls extension by default.

HZ
--
https://mail.python.org/mailman/listinfo/python-list
Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]
On Thursday, September 30, 2021 at 9:20:37 AM UTC+8, hongy...@gmail.com wrote:
> On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote:
> > On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> > > I tried to convert a xls file into csv with the following command, but failed:
> > >
> > > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> > >
> > > The above testing file is located at here [1].
> > >
> > > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> > Why is that file name .xls when it's obviously an HTML file?
> Good catch! Thank you for pointing this out. This file is automatically exported from my university's teaching management system, and it was assigned the .xls extension by default.

According to the above comment, after I change the extension to html, the following python code will do the trick:


import sys
import pandas as pd

if len(sys.argv) != 2:
print('Usage: ' + sys.argv[0] + ' input-file')
exit(1)

myhtml_pd = pd.read_html(sys.argv[1])
#In [25]: len(myhtml_pd)
#Out[25]: 3

for i in myhtml_pd[2].index:
if i > 0:
for j in myhtml_pd[2].columns:
if j >1 and not pd.isnull(myhtml_pd[2].loc[i][j]):
print(myhtml_pd[2].loc[i][j])

HZ
--
https://mail.python.org/mailman/listinfo/python-list