Quantcast
Channel: Active questions tagged selenium - Stack Overflow
Viewing all articles
Browse latest Browse all 97781

Handling timeout with Selenium and Python

$
0
0

can anybody help me with this? I have written a code to scrape articles from a Chinese news site, using Selenium. As many of the urls do not load, I tried to include code to catch timeout exceptions, which works but then the browser seems to stay on page which timed out when loading, rather than moving to try the next url.

I've tried adding driver.quit() and driver.close() after handling the error, but then it doesn't work when continuing to the next loop.

with open('url_list_XB.txt', 'r') as f:
    url_list = f.readlines()

for idx, url in enumerate(url_list):
    status = str(idx)+""+str(url)
    print(status)

    try:
        driver.get(url)
        try:
            tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
            for a in tblnks:
                html = a.get_attribute('innerHTML')
                try:
                    link = re.findall('href="http://comment(.+?)" title', str(html))[0]
                    tb_link = 'http://comment' + link
                    print(tb_link)
                    ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
                    print(ID)
                    with open('tb_links.txt', 'a') as p:
                        p.write(tb_link + '\n')
                    try:
                        text = str(driver.find_element_by_class_name("post_text").text)
                        headline = driver.find_element_by_tag_name('h1').text
                        date = driver.find_elements_by_class_name("post_time_source")
                        for a in date:
                            date = str(a.text)
                            dt = date.split(" 来源")[0]
                            dt2 = dt.replace(":", "_").replace("-", "_").replace("", "_")

                        count = driver.find_element_by_class_name("post_tie_top").text

                        with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt', 'w') as d:
                            d.write(headline)
                            d.write(text + '\n')
                        path = 'SCS_DATA/' + ID
                        os.mkdir(path)

                    except NoSuchElementException as exception:
                        print("Element not found ")
                except IndexError as g:
                    print("Index Error")


            node = [url, tb_link]
            results.append(node)

        except NoSuchElementException as exception:
            print("TB link not found ")
        continue


    except TimeoutException as ex:
        print("Page load time out")

    except WebDriverException:
        print('WD Exception')

I want to the code to move through a list of urls, calling them and grabbing the article text as well as a link to the discussion page. It works until a page times out on loading, then the programme will not move on.


Viewing all articles
Browse latest Browse all 97781

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>