Quantcast
Channel: Active questions tagged selenium - Stack Overflow
Viewing all articles
Browse latest Browse all 98221

Extract content from a continuously updating web page , using Python

$
0
0

I am trying to extract table data from the following page:

http://www.mfinante.gov.ro/patrims.html?adbAdbId=4283

The problem is the page seems to be constantly adding rows, dynamically, and using requests returns only the html without the table. I also tried to use selenium, to wait until the page loads fully (as number of rows is finite), but but selenium waits as page loads until the the browser runs out of memory and crashes (at about 100K rows).

My question is, how do i get the content being send to the page, maybe in chunks, and save it? Is there a way to simulate the call the browser is doing?

Here is what i have managed with selenium, which works for smaller samples (ex: adbAdbId=30):

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

data = ''
delay = 800

options = webdriver.ChromeOptions()
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path="chromedriver.exe")
driver.set_page_load_timeout(1000)
url = 'http://www.mfinante.gov.ro/patrims.html?adbAdbId=30'
driver.get(url)

try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.ID, 'patrims')))
    print("Page is ready!")

except TimeoutException:
        print("Loading took too much time!")


rows = driver.find_elements_by_xpath("//table[@id='patrims']/tbody/tr")
print(len(rows))

listofdicts = []

def builder(outputlist, inputlist):
    #i =0
    for row in inputlist:
        #i+=1
        #print(i)
        soup = BeautifulSoup(row.get_attribute('innerHTML')  , 'html.parser')
        td= soup.find_all('td')


        d = {   "Legend" : soup.find("legend").get_text().strip(),
                "Localitatea" : td[2].get_text().strip(),
                "Strada" : td[4].get_text().strip(),
                "Descriere Tehnica" : td[6].get_text().strip(),
                "Cod de identificare" : td[-7].get_text().strip(),
                "Anul dobandirii sau darii in folosinta " : td[-6].get_text().strip(),
                "Valoare" : td[-5].get_text().strip(),
                "Situatie juridica" : td[-4].get_text().strip(),
                "Situatie juridica actuala" : td[-3].get_text().strip(),
                "Tip bun" : td[-2].get_text().strip(),
                "Stare bun" : td[-1].get_text().strip(),

            }


        outputlist.append(d)
    print('done!')



builder(listofdicts, rows)

print('writing result')
frame = pd.DataFrame(listofdicts)
frame.to_csv(r'output30.csv')


Viewing all articles
Browse latest Browse all 98221

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>