Mailing List Archive: [OT] Differences between wget and browser file retrieval?

[OT] Differences between wget and browser file retrieval?

waltdnes at waltdnes

Jan 14, 2021, 12:49 PM

Post #1 of 8 (524 views)

I'm bored, so I do a regular daily report at the DSL Reports "CanChat"
sub-forum, on the Covid-19 case counts for Ontario, using provincial
data. I download 2 files daily as source data. One of them is a PDF
file, which is run through "pdftotext" and then parsed by a bash script
(don't ask). Today, the command...

wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

...returns a zero-byte file. *BUT*, sticking the URL into the URL bar
of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the
PDF file just fine. Is "wget" being blocked? I have to do extra steps
to get from the browser-invoked PDF to get the PDF file saved to the
standard work area where my script expects it to be, so it can work its
magic and parse out the daily breakdown by PHU (Public Health Unit).
BTW, today's posts requiring the PDF file are...
https://www.dslreports.com/forum/r33002718-
https://www.dslreports.com/forum/r33002752-

I've tried setting --user-agent= with my browser's string as shown by
https://www.whatismybrowser.com/detect/what-is-my-user-agent but no
luck. Is there some way to get around this? I have not updated this
past week, so I don't think the problem is at my end.

--
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications

Re: [OT] Differences between wget and browser file retrieval? [ In reply to ]

ostroffjh at users

Jan 14, 2021, 1:10 PM

Post #2 of 8 (524 views)

On 2021.01.14 15:49, Walter Dnes wrote:
> I'm bored, so I do a regular daily report at the DSL Reports
> "CanChat"
> sub-forum, on the Covid-19 case counts for Ontario, using provincial
> data. I download 2 files daily as source data. One of them is a PDF
> file, which is run through "pdftotext" and then parsed by a bash
> script
> (don't ask). Today, the command...
>
> wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>
> ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar
> of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up
> the
> PDF file just fine. Is "wget" being blocked? I have to do extra
> steps
> to get from the browser-invoked PDF to get the PDF file saved to the
> standard work area where my script expects it to be, so it can work
> its
> magic and parse out the daily breakdown by PHU (Public Health Unit).
> BTW, today's posts requiring the PDF file are...
> https://www.dslreports.com/forum/r33002718-
> https://www.dslreports.com/forum/r33002752-
>
> I've tried setting --user-agent= with my browser's string as shown
> by
> https://www.whatismybrowser.com/detect/what-is-my-user-agent but no
> luck. Is there some way to get around this? I have not updated this
> past week, so I don't think the problem is at my end.

I just copy/pasted that wget command into my terminal, and it got me a
1.7M PDF doc. I'm in the US, but I have no idea if location/IP is an
issue or not.

Jack

Re: [OT] Differences between wget and browser file retrieval? [ In reply to ]

finkandreas at web

Jan 14, 2021, 1:36 PM

Post #3 of 8 (524 views)

On Thu, 14 Jan 2021 16:10:09 -0500
Jack <ostroffjh@users.sourceforge.net> wrote:

> On 2021.01.14 15:49, Walter Dnes wrote:
> > I'm bored, so I do a regular daily report at the DSL Reports
> > "CanChat"
> > sub-forum, on the Covid-19 case counts for Ontario, using provincial
> > data. I download 2 files daily as source data. One of them is a PDF
> > file, which is run through "pdftotext" and then parsed by a bash
> > script
> > (don't ask). Today, the command...
> >
> > wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> >
> > ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar
> > of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up
> > the
> > PDF file just fine. Is "wget" being blocked? I have to do extra
> > steps
> > to get from the browser-invoked PDF to get the PDF file saved to the
> > standard work area where my script expects it to be, so it can work
> > its
> > magic and parse out the daily breakdown by PHU (Public Health Unit).
> > BTW, today's posts requiring the PDF file are...
> > https://www.dslreports.com/forum/r33002718-
> > https://www.dslreports.com/forum/r33002752-
> >
> > I've tried setting --user-agent= with my browser's string as shown
> > by
> > https://www.whatismybrowser.com/detect/what-is-my-user-agent but no
> > luck. Is there some way to get around this? I have not updated this
> > past week, so I don't think the problem is at my end.
>
> I just copy/pasted that wget command into my terminal, and it got me a
> 1.7M PDF doc. I'm in the US, but I have no idea if location/IP is an
> issue or not.
>
> Jack
>

I could download the file too with the wget command that you posted. If
you still have trouble, you could try using curl and pretend that
you're a firefox:
curl 'https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en,de;q=0.7,en-US;q=0.3' --compressed -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > moh-covid-19-report-en-2021-01-14.pdf

Andreas

Re: [OT] Differences between wget and browser file retrieval? [ In reply to ]

gentoo at dhaller

Jan 14, 2021, 2:00 PM

Post #4 of 8 (524 views)

Hello,

On Thu, 14 Jan 2021, Walter Dnes wrote:
> I'm bored, so I do a regular daily report at the DSL Reports "CanChat"
>sub-forum, on the Covid-19 case counts for Ontario, using provincial
>data. I download 2 files daily as source data. One of them is a PDF
>file, which is run through "pdftotext" and then parsed by a bash script
>(don't ask). Today, the command...
>
> wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>
>...returns a zero-byte file. *BUT*, sticking the URL into the URL bar
>of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the
>PDF file just fine. Is "wget" being blocked?
[..]
> I've tried setting --user-agent= with my browser's string as shown by
>https://www.whatismybrowser.com/detect/what-is-my-user-agent but no
>luck. Is there some way to get around this? I have not updated this
>past week, so I don't think the problem is at my end.

I could download that file just fine just now[1]. Try running 'wget'
with the '-S' option. Oh and:

[..]
WARNING: cannot verify files.ontario.ca's certificate, issued by
[..]

If you sent stderr to /dev/null ...

So, try:

wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

BTW: you know that you can let date format that URL? e.g.:

wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
"$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

There just are no unescaped '%' allowed besides the format strings for
the date/time. So if an URL contains one, you need to escape those
with another '%', as in e.g.
$(date '+foo%%20bar-%Y-%m-%d.pdf')
^^ this fella

In your case, the URL is clean ;)

HTH,
-dnh

[1] $ TZ=America/Toronto date
Thu Jan 14 16:50:15 EST 2021

--
"Airplane travel is nature's way of making you look like your passport
photo." -- Al Gore

Re: [OT] Differences between wget and browser file retrieval? [ In reply to ]

Jan 14, 2021, 11:40 PM

Post #5 of 8 (522 views)

210114 David Haller wrote:
> On Thu, 14 Jan 2021, Walter Dnes wrote:
>> I download daily a PDF. Today, the command ...
>> wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>> returns a zero-byte file. *BUT*, sticking the URL into the URL bar
> >of Pale Moon and Google Chrome brings up the PDF file just fine.
>> Is "wget" being blocked ?
> I could download that file just fine just now[1].
> Try running 'wget' with the '-S' option.
> Oh and :
>> WARNING: cannot verify files.ontario.ca's certificate, issued by
> So, try:
> wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
> https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> BTW: you know that you can let date format that URL? e.g.:
> wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
> "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

Here in Toronto, I get the same result as Walter via his URL
& similar results from the 2 longer versions above,
except that the escaped version give "ERROR 403: Forbidden".

When I drop Walter's URL into the address bar of Firefox, no problem :
a 1,75 MB PDF which appears to have all the info.

It looks as if the site is refusing 'wget' requests from Ontario,
but allowing them from eg Germany (!).

What Walter is doing is well worthwhile. Press reports are very shallow
& the Ontario government doesn't appear to have any clear idea
just where & how the virus is being spread between humans. HTH.

--
========================,,============================================
SUPPORT ___________//___, Philip Webb
ELECTRIC /] [] [] [] [] []| Cities Centre, University of Toronto
TRANSIT `-O----------O---' purslowatcadotinterdotnet

Re: [OT] Differences between wget and browser file retrieval? [ In reply to ]

waltdnes at waltdnes

Jan 15, 2021, 12:24 AM

Post #6 of 8 (522 views)

On Thu, Jan 14, 2021 at 11:00:38PM +0100, David Haller wrote

> So, try:
>
> wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
> https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

No luck. For DNS, I use my ISP's servers (Teksavvy) with fallback to
Google 8.8.8.8.

########################################################################
[i3][waltdnes][/dev/shm] wget -S --no-check-certificate -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0' https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
--2021-01-15 02:15:30-- https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
Resolving files.ontario.ca... 13.33.160.117, 13.33.160.123, 13.33.160.45, ...
Connecting to files.ontario.ca|13.33.160.117|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Content-Type: application/pdf
Content-Length: 0
Connection: keep-alive
Date: Thu, 14 Jan 2021 15:15:50 GMT
Last-Modified: Thu, 14 Jan 2021 15:15:50 GMT
ETag: "d41d8cd98f00b204e9800998ecf8427e"
x-amz-meta-ctime: 1610637349
x-amz-meta-mode: 33188
x-amz-meta-gid: 500
x-amz-meta-uid: 500
x-amz-meta-mtime: 1610637349
Accept-Ranges: bytes
Server: AmazonS3
X-Cache: Hit from cloudfront
Via: 1.1 47dbad48e25df8c5ccf2822e46c2aaa6.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: YTO50-C3
X-Amz-Cf-Id: ARgHfF6QMVfUtkxqkr0AL5ljxIfE7Yd5xPmA4eDMx46NdPXOwIftnQ==
Age: 57573
Length: 0 [application/pdf]
Saving to: 'moh-covid-19-report-en-2021-01-14.pdf'

moh-covid-19-report [ <=> ] 0 --.-KB/s in 0s

2021-01-15 02:15:30 (0.00 B/s) - 'moh-covid-19-report-en-2021-01-14.pdf' saved [0/0]
########################################################################

> BTW: you know that you can let date format that URL? e.g.:
>
> wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
> "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

Nice, but civil servants get stat holidays off. I downloaded Dec 25th
and 26th PDFs on the 26th. Monday Dec 28th was a lieu day for Boxing
day, so I downloaded the 28th and 29th PDFs on the 29th. And of course
Jan 1st and 2nd PDFs on Jan 2nd. That's why I can't automate the date.
I have a script "getone"...

[i3][waltdnes][~/covid] cat getone
#!/bin/bash
wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-${1}.pdf

On the 14th it was invoked as "../getone 14" (called from the working
directory, one level below the main "covid" directory). I tweak the
script once a month to match year+month. In a worst-case scenario. I
can go to
https://covid-19.ontario.ca/covid-19-epidemiologic-summaries-public-health-ontario#daily
to manually retrieve a daily PDF. Note that on this page, they list
the date that the report is up to. The report issued 10:15 AM on the
14th shows up in the listing as "COVID-19 in Ontario: January 13, 2021".
That's because it contains data up to the 13th.

--
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications

Re: [OT] Differences between wget and browser file retrieval? [ In reply to ]

waltdnes at waltdnes

Jan 15, 2021, 7:09 AM

Post #7 of 8 (522 views)

On Fri, Jan 15, 2021 at 02:40:51AM -0500, Philip Webb wrote
>
> Here in Toronto, I get the same result as Walter via his URL
> & similar results from the 2 longer versions above,
> except that the escaped version give "ERROR 403: Forbidden".

I get "ERROR 403: Forbidden" when downloading a non-existant file,
e.g. when I make a typo, or when the government site is late updating
and they haven't posted the file by the time I request it.

--
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications

Re: [OT SOLVED] Differences between wget and browser file retrieval? [ In reply to ]

waltdnes at waltdnes

Jan 15, 2021, 8:28 AM

Post #8 of 8 (522 views)

It looks like a temporary server hiccup yesterday. wget correctly
pulled down the PDF file for the 15th today. I checked and it also
pulled down the file for the 14th.

--
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications