Mailing List Archive: Problem when scraping the 100 Movie titles.

Problem when scraping the 100 Movie titles.

Feb 18, 2021, 7:22 AM

Post #1 of 4 (245 views)

I'm learning Scraping actually and would like to scrape the movie titles
from https://www.empireonline.com/movies/features/best-movies-2 .
In the course I was learning I was supposed to do it with bs4:
titles = soup.find_all(name = 'h3', class_ = 'title')

but after after a while I guess the site has changed and now the class
is: jsx-2692754980

<h3 class="jsx-2692754980">100) Stand By Me</h3>

but anyway if I do try get those titles by name and class, my list is
empty:
titles = soup.find_all(name = 'h3', class_ = 'jsx-2692754980')

I tried also selenium and manage get those titles with:
driver.get('https://www.empireonline.com/movies/features/best-movies-2')
#driver.find_element_by_xpath('/html/body/div/div[3]/div[5]/button[2]').click()

titles = driver.find_elements_by_css_selector("h3.jsx-2692754980")

tit=[]
for e in titles:
tit.append(e.text)

print(tit)

But in Chrome I get a popup asking to accept cookies and I need to
click to accept them.

Is someone here who knows how can I get those titles with BeautifulSoup and how to deal with
cookies if using Selenium?

--
Thanks
--
https://mail.python.org/mailman/listinfo/python-list

Re: Problem when scraping the 100 Movie titles. [ In reply to ]

aakashjana2002 at gmail

Feb 18, 2021, 9:43 AM

Post #2 of 4 (245 views)

Permalink

I have done some webscraping before i think you need to get a slightly more
tactical way to get these titles scraped .
Try to see what classes identify the cards (in which movie title is given)
and then try to pull the heading out of those.
Try to get the divs in a list , something like this "<div
class="jsx-2692754980 listicle-item-image ">" in my case and then try to
pull
the h3 tag out of it . Onething to note is react os single page heavy
webapps have seemed to be difficult to scrape maybe beautiful
isnt made for JSX .

On Thu, Feb 18, 2021 at 9:09 PM Bischoop <Bischoop@vimart.net> wrote:

>
> I'm learning Scraping actually and would like to scrape the movie titles
> from https://www.empireonline.com/movies/features/best-movies-2 .
> In the course I was learning I was supposed to do it with bs4:
> titles = soup.find_all(name = 'h3', class_ = 'title')
>
> but after after a while I guess the site has changed and now the class
> is: jsx-2692754980
>
> <h3 class="jsx-2692754980">100) Stand By Me</h3>
>
> but anyway if I do try get those titles by name and class, my list is
> empty:
> titles = soup.find_all(name = 'h3', class_ = 'jsx-2692754980')
>
> I tried also selenium and manage get those titles with:
> driver.get('https://www.empireonline.com/movies/features/best-movies-2')
>
> #driver.find_element_by_xpath('/html/body/div/div[3]/div[5]/button[2]').click()
>
> titles = driver.find_elements_by_css_selector("h3.jsx-2692754980")
>
> tit=[]
> for e in titles:
> tit.append(e.text)
>
> print(tit)
>
> But in Chrome I get a popup asking to accept cookies and I need to
> click to accept them.
>
> Is someone here who knows how can I get those titles with BeautifulSoup
> and how to deal with
> cookies if using Selenium?
>
> --
> Thanks
> --
> https://mail.python.org/mailman/listinfo/python-list
>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Problem when scraping the 100 Movie titles. [ In reply to ]

mats at wichmann

Feb 18, 2021, 10:21 AM

Post #3 of 4 (243 views)

Permalink

On 2/18/21 10:43 AM, Aakash Jana wrote:
> I have done some webscraping before i think you need to get a slightly more
> tactical way to get these titles scraped .
> Try to see what classes identify the cards (in which movie title is given)
> and then try to pull the heading out of those.
> Try to get the divs in a list , something like this "<div
> class="jsx-2692754980 listicle-item-image ">" in my case and then try to
> pull
> the h3 tag out of it . Onething to note is react os single page heavy
> webapps have seemed to be difficult to scrape maybe beautiful
> isnt made for JSX .
>
> On Thu, Feb 18, 2021 at 9:09 PM Bischoop <Bischoop@vimart.net> wrote:
>
>>
>> I'm learning Scraping actually and would like to scrape the movie titles
>> from https://www.empireonline.com/movies/features/best-movies-2 .
>> In the course I was learning I was supposed to do it with bs4:

Just in general, most websites don't want you to scrape them, and some
go to considerable efforts to make it difficult, and some explicitly
disallow downloading any content except for caching purposes. If the
website provides an API, that's how they expect you go consume data that
isn't render through a web browser.

Just sayin' ... there's no reason not to learn the concepts of web
scraping but should ALSO be aware of terms of use.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Problem when scraping the 100 Movie titles. [ In reply to ]

fabianjoseph063 at gmail

Sep 22, 2022, 12:36 AM

Post #4 of 4 (202 views)

Permalink

#Try using, it's save in json format of the website:
import json
import requests
from bs4 import BeautifulSoup

url = "https://www.empireonline.com/movies/features/best-movies-2/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])

# uncomment this to print all data:
#print(json.dumps(data, indent=4))

def find_articles(data):
if isinstance(data, dict):
for k, v in data.items():
if k.startswith("ImageMeta:"):
yield v['image']['name']
else:
yield from find_articles(v)
elif isinstance(data, list):
for i in data:
yield from find_articles(i)

for a in find_articles(data):
print(a)
--
https://mail.python.org/mailman/listinfo/python-list