Mailing List Archive: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

Sep 29, 2021, 1:22 AM

Post #1 of 8 (140 views)

I tried to convert a xls file into csv with the following command, but failed:

$ in2csv --sheet 'Sheet1' 2021-2022-1.xls
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

The above testing file is located at here [1].

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls

Any hints for fixing this problem?

Regards,
HZ
--
https://mail.python.org/mailman/listinfo/python-list

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

user at example

Sep 29, 2021, 2:40 AM

Post #2 of 8 (140 views)

Permalink

On 29/09/2021 10.22, hongy...@gmail.com wrote:
> I tried to convert a xls file into csv with the following command, but failed:
>
> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>
> The above testing file is located at here [1].
>
> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
>
> Any hints for fixing this problem?

You need to delete the 13 first lines in the file or you see to that
your code does first trim the data before start xml parse it.

--

//Aho
--
https://mail.python.org/mailman/listinfo/python-list

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

hongyi.zhao at gmail

Sep 29, 2021, 4:10 AM

Post #3 of 8 (140 views)

Permalink

On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 10.22, hongy...@gmail.com wrote:
> > I tried to convert a xls file into csv with the following command, but failed:
> >
> > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >
> > The above testing file is located at here [1].
> >
> > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >
> > Any hints for fixing this problem?
> You need to delete the 13 first lines in the file

Yes. After deleting the top 3 lines, the problem has been fixed.

> or you see to that your code does first trim the data before start xml parse it.

Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?

HZ
--
https://mail.python.org/mailman/listinfo/python-list

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

user at example

Sep 29, 2021, 5:11 AM

Post #4 of 8 (140 views)

Permalink

On 29/09/2021 13.10, hongy...@gmail.com wrote:
> On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
>> On 29/09/2021 10.22, hongy...@gmail.com wrote:
>>> I tried to convert a xls file into csv with the following command, but failed:
>>>
>>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
>>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>>>
>>> The above testing file is located at here [1].
>>>
>>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
>>>
>>> Any hints for fixing this problem?
>> You need to delete the 13 first lines in the file
>
> Yes. After deleting the top 3 lines, the problem has been fixed.
>
>> or you see to that your code does first trim the data before start xml parse it.
>
> Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?

You could do something like loading the XML into a string (myxmlstr) and
then find the fist < in that string

xmlstart = myxmlstr.find('<')

xmlstr = myxmlstr[xmlstart:]

then use the xmlstr in the xml parser, sure not as convenient as loading
the file directly to the xml parser.

I don't say this is the best way of doing it, I'm sure some python wiz
here would have a smarter solution.

--

//Aho

--
https://mail.python.org/mailman/listinfo/python-list

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

hongyi.zhao at gmail

Sep 29, 2021, 6:22 AM

Post #5 of 8 (140 views)

Permalink

On Wednesday, September 29, 2021 at 8:12:08 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 13.10, hongy...@gmail.com wrote:
> > On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> >> On 29/09/2021 10.22, hongy...@gmail.com wrote:
> >>> I tried to convert a xls file into csv with the following command, but failed:
> >>>
> >>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> >>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >>>
> >>> The above testing file is located at here [1].
> >>>
> >>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >>>
> >>> Any hints for fixing this problem?
> >> You need to delete the 13 first lines in the file
> >
> > Yes. After deleting the top 3 lines, the problem has been fixed.
> >
> >> or you see to that your code does first trim the data before start xml parse it.
> >
> > Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?
> You could do something like loading the XML into a string (myxmlstr)

How to do this operation? As you have seen, the file refused to be loaded at all.

> and then find the fist < in that string
>
> xmlstart = myxmlstr.find('<')
>
> xmlstr = myxmlstr[xmlstart:]
>
> then use the xmlstr in the xml parser, sure not as convenient as loading
> the file directly to the xml parser.
>
> I don't say this is the best way of doing it, I'm sure some python wiz
> here would have a smarter solution.

Another very strange thing: I trimmed the first 3 lines in the original file and saved it into a new one named as 2021-2022-1-trimmed-top-3-lines.xls. [1]

Then I read the file with the following python script named as pandas-excel.py:

------
import pandas as pd

excel_file='2021-2022-1-trimmed-top-3-lines.xls'

#print(pd.ExcelFile(excel_file).sheet_names)

newpd=pd.read_excel(excel_file, sheet_name='Sheet1')

for i in newpd.index:
if i >1:
for j in newpd.columns:
if int(j.split()[1]) > 2:
if not pd.isnull(newpd.loc[i][j]):
print(newpd.loc[i][j])
------

$ python pandas-excel.py | sort -u
?????? [1-8]? 1-4? 38 ???413???????II ??1932?
?????????????? [1-12]? 1-4? 38 ???416?????????? ??1932?

OTOH, I also tried to read the file with in2csv as follows:

$ in2csv --sheet Sheet1 2021-2022-1-trimmed-top-3-lines.xls 2>/dev/null |tr ',' '\n' | \
sed -re '/^$/d' | sort -u | awk '{print length($0),$0}' | sort -k1n | tail -3 | cut -d ' ' -f2-
?????? [1-8]? 1-4? 38 ???413???????II ??1932?
???????? [1-8]? 6-9? 45 ???511????????? ??1931?
?????????????? [1-12]? 1-4? 38 ???416?????????? ??1932?

As you can see, the above two methods give different results. I'm very puzzled by this phenomenon. Any hints/tips/comments will be greatly appreciated.

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1-trimmed-top-3-lines.xls

Regards,
HZ
--
https://mail.python.org/mailman/listinfo/python-list

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

hjp-python at hjp

Sep 29, 2021, 2:19 PM

Post #6 of 8 (140 views)

Permalink

On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> I tried to convert a xls file into csv with the following command, but failed:
>
> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
>
> The above testing file is located at here [1].
>
> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls

Why is that file name .xls when it's obviously an HTML file?

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

hongyi.zhao at gmail

Sep 29, 2021, 6:20 PM

Post #7 of 8 (140 views)

Permalink

On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote:
> On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> > I tried to convert a xls file into csv with the following command, but failed:
> >
> > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >
> > The above testing file is located at here [1].
> >
> > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> Why is that file name .xls when it's obviously an HTML file?

Good catch! Thank you for pointing this out. This file is automatically exported from my university's teaching management system, and it was assigned the .xls extension by default.

HZ
--
https://mail.python.org/mailman/listinfo/python-list

Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' [ In reply to ]

hongyi.zhao at gmail

Sep 29, 2021, 8:53 PM

Post #8 of 8 (139 views)

Permalink

On Thursday, September 30, 2021 at 9:20:37 AM UTC+8, hongy...@gmail.com wrote:
> On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote:
> > On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote:
> > > I tried to convert a xls file into csv with the following command, but failed:
> > >
> > > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> > > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> > >
> > > The above testing file is located at here [1].
> > >
> > > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> > Why is that file name .xls when it's obviously an HTML file?
> Good catch! Thank you for pointing this out. This file is automatically exported from my university's teaching management system, and it was assigned the .xls extension by default.

According to the above comment, after I change the extension to html, the following python code will do the trick:

import sys
import pandas as pd

if len(sys.argv) != 2:
print('Usage: ' + sys.argv[0] + ' input-file')
exit(1)

myhtml_pd = pd.read_html(sys.argv[1])
#In [25]: len(myhtml_pd)
#Out[25]: 3

for i in myhtml_pd[2].index:
if i > 0:
for j in myhtml_pd[2].columns:
if j >1 and not pd.isnull(myhtml_pd[2].loc[i][j]):
print(myhtml_pd[2].loc[i][j])

HZ
--
https://mail.python.org/mailman/listinfo/python-list

Mailing List Archive

Mailing List Archive

Attached Files: