I am looking to scrape the data that feeds the SVG elements of this page:
https://www.beinsports.com/au/livescores/match-center/2019/23/1074885
The page appears to be Javascript rendered, so traditional uses of BeautifulSoup in Python are not working. I have refreshed in the Inspect Network XHR and it does not appear the page stores the data in JSON either. However, when refreshing the JS page in the Network, I see F24_8.js, which in the preview shows exactly what I would want to capture that is feeding the SVG elements:
Is there a way to run a script from selenium as an example to mimic that javascript rendering and retrieve the backend data at issue here?
Per a request in the comments, I've included a script below that worked against a similar page that is no longer supported by the domain. In that case, executing the script to serialize the XML was more straightforward given the existence of the XML script, and that page did not display the row-level detail but the script fed the row-level data that was used in the rendered tools:
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
import selenium
from selenium import webdriver
import re
import math
import time
games=[]
browser = webdriver.PhantomJS()
browser.get("http://www.squawka.com/match-results")
WebDriverWait(browser,10)
mySelect = Select(browser.find_element_by_id("league-filter-list"))
mySelect.select_by_visible_text("German Bundesliga")
seasons=['Season 2012/2013','Season 2013/2014','Season 2014/2015','Season 2015/2016','Season 2016/2017','Season 2017/2018']
for season in seasons:
nextSelect = Select(browser.find_element_by_id("league-season-list"))
nextSelect.select_by_visible_text(season)
source=browser.page_source
soup=BeautifulSoup(source,'html.parser')
games.extend([a.get('href') for a in soup.find_all('a',attrs={'href':re.compile('matches')})])
pages=math.ceil(float(soup.find('span',{'class':'displaying-num'}).get_text().split('of')[-1].strip())/30)
for page in range(2,int(pages)+1):
browser.find_element_by_xpath('//a[contains(@href,"pg='+str(page)+'")]').click()
source=browser.page_source
soup=BeautifulSoup(source,'html.parser')
games.extend([a.get('href') for a in soup.find_all('a',attrs={'href':re.compile('matches')})])
print '---------\n'+season+' Games Appended'
import pandas as pd
import numpy as np
import lxml.etree as etree
frames=[]
count=0
for game in g2:
try:
url = game
browser = webdriver.PhantomJS()
browser.get(url)
time.sleep(10)
page = browser.execute_script('return new XMLSerializer().serializeToString(squawkaDp.xml);')
root = etree.XML(page.encode('utf-8'))
#events
gm=pd.DataFrame()
for f in root.iter('filters'):
for a in f:
for event in a.iter('event'):
d=event.attrib
records=dict((x,[y]) for x,y in d.items())
new_records=dict((x.tag,[x.text]) for x in event)
r=pd.DataFrame(records)
nr=pd.DataFrame(new_records)
j=r.join(nr)
j['category']=a.tag
gm=gm.append(j)
I recognize the incompleteness of the script but the remaining details are not necessary to the question at hand.
