Quantcast
Channel: Active questions tagged selenium - Stack Overflow
Viewing all articles
Browse latest Browse all 97762

WebScraping javascript page in python

$
0
0

I am trying to webscrape a javascript page with Python: https://search.gleif.org/#/search/

However, I am able to retrieve information but not what I am looking for.

<!DOCTYPE html>
<html>
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>LEI Search 2.0</title>
<link href="/static/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://fonts.googleapis.com/css?family=Open+Sans:200,300,400,600,700,900&amp;subset=cyrillic,cyrillic-ext,greek,greek-ext,latin-ext,vietnamese" rel="stylesheet"/>
<link href="/static/css/main.045139db483277222eb714c1ff8c54f2.css" rel="stylesheet"/></head>
<body>
<div id="app"></div>
<script src="/static/js/manifest.2ae2e69a05c33dfc65f8.js" type="text/javascript"></script>
<script src="/static/js/vendor.6bd9028998d5ca3bb72f.js" type="text/javascript"></script>
<script src="/static/js/main.5da23c5198041f0ec5af.js" type="text/javascript"></script>
</body>
</html>

Instead of retrieving the script src="/static/js/manifest.2ae2e69a05c33dfc65f8.js" type="text/javascript"

I would like to have the content of the table in order to store it.

So far, I was able to webscrape other pages with the following code

import requests
from bs4 import BeautifulSoup as bs
from collections import Counter
import urllib.request
import pandas as pd
import numpy as np

# Define differents possible proxies
# Due to firewall, need to provide user and pswd
http_proxy  = "http://username:pwd@proxy"
https_proxy = "https://username:pwd@proxy"
ftp_proxy   = "ftp://username:pwd@proxy"

proxyDict = {"http"  : http_proxy, 
         "https" : https_proxy, 
         "ftp"   : ftp_proxy}

url='https://www.bundesbank.de/dynamic/action/en/homepage/search/statistics/749206/real-time-data'

request = requests.get(url, proxies=proxyDict) 
content = request.content # Pull data 
soup = bs(content, 'html.parser') # Sort Data as html page 
rows = soup.findChildren('tr') 

print(soup)

The previous code provide me the content that I am looking for but not for javascript pages.

I have even tried to use Selenium with Firefox. However, Firefox is requiring a authentification (adding username and pwd) to open the url. Therefore, I cannot access the page.

Please find below the code used.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary


binary = r'C:\Users\user\AppData\Local\Mozilla Firefox\firefox.exe'
gecko = r'C:\Users\user\geckodriver\geckodriver.exe'

options = Options()
options.set_headless(headless=False)
options.binary = binary


cap = DesiredCapabilities().FIREFOX
cap["marionette"] = True #optional

driver = webdriver.Firefox(firefox_options=options, capabilities=cap, executable_path= gecko)
driver.get("http://google.com/")

print ("Headless Firefox Initialized")
driver.quit()

Please do not hesitate to ask if you need more information.

Have a good day


Viewing all articles
Browse latest Browse all 97762

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>