Quantcast
Channel: Active questions tagged selenium - Stack Overflow
Viewing all articles
Browse latest Browse all 98221

Seeking a faster way to scrape these pages

$
0
0

I have 612 pages to scrape. Here is a link to one of them. All links are stored in a list 'links'.

http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=NovakDjokovic&f=ACareerqq

I have some code which successfully scrapes what I want (the large table with headers 'date','tournament','surface'.....etc)

The problem for me is how slow it is.

Here is what I've got at the moment

driver = webdriver.Firefox()
for x in links:

    driver.get(x)

    rows = driver.find_elements_by_xpath('//table[@id="matches"]//tbody//tr')

    data = driver.find_elements_by_xpath('//table[@id="matches"]//tbody//tr//td')

    for i in range(len(rows)-1):
        date = data[17*i].text
        tournament = data[17*i+1].text
        surface = data[17*i+2].text
        rd = data[17*i+3].text
        rk = data[17*i+4].text
        vrk = data[17*i+5].text
        match = data[17*i+6].text
        score = data[17*i+7].text
        more = data[17*i+8].text
        dr = data[17*i+9].text
        ace = data[17*i+10].text
        df = data[17*i+11].text
        first_in = data[17*i+12].text
        first_won = data[17*i+13].text
        second_won = data[17*i+14].text
        bp_saved = data[17*i+15].text
        duration = data[17*i+16].text

        with open('serve.csv','a', encoding = 'utf-8') as r:
            r.write(date + "," + tournament + "," + surface + "," + rd + "," + rk + "," + vrk + "," + match + "," + score + "," + more + "," + dr + "," + ace + "," + df + "," + first_in + "," + first_won + "," + second_won + "," + bp_saved + "," + duration + "\n")

This gives me my desired output, but running it all through the night only yielded 12 page scrapes....

Why is this code so slow and what can I do to improve it?

Thanks


Viewing all articles
Browse latest Browse all 98221

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>