I have a python scrapper with selenium for scrapping a dynamically loaded javascript website.
Scrapper by itself works ok but pages sometimes fail to load with 404 error.
Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time.
Even weirder is that same javascript http loads in one browser but not in another and vice versa.
I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time.
Public link that doesn't have data I need looks like this: <https://www.sazka.cz/kurzove-sazky/fotbal/*League*/>
.
Javascript link that have data I need looks like this <https://rsb.sazka.cz/fotbal/*League*/>
.
On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly.
I tried to search in page source for some clues but I found nothing.
Can anyone help me find out where might be a problem? Thank you.
Edit: here is my code that i think is relevant
driver = webdriver.Chrome(executable_path='chromedriver',
service_args=['--ssl-protocol=any',
'--ignore-ssl-errors=true'])
driver.maximize_window()
for single_url in urls:
randomLoadTime = random.randint(400, 600)/100
time.sleep(randomLoadTime)
driver1 = driver
driver1.get(single_url)
htmlSourceRedirectCheck = driver1.page_source
# Redirect Check
redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)
if '404 - Page not found' in redirectCheck:
leaguer1 = single_url
leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
print(str(leagueFinal) + '' + '404 - Page not found')
pass
else:
try:
loadedOddsCheck = WebDriverWait(driver1, 25)
loadedOddsCheck.until(EC.element_to_be_clickable \
((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
except TimeoutException:
pass
unloadedOdds = driver1.find_elements_by_xpath \
(".//h3[contains(@data-params, 'loadExpandEvents')]")
for clicking in unloadedOdds:
clicking.click()
randomLoadTime2 = random.randint(50, 100)/100
time.sleep(randomLoadTime2)
matchArr = []
leaguer = single_url
htmlSourceOrig = driver1.page_source