I have 612 pages to scrape. Here is a link to one of them. All links are stored in a list 'links'.
http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=NovakDjokovic&f=ACareerqq
I have some code which successfully scrapes what I want (the large table with headers 'date','tournament','surface'.....etc)
The problem for me is how slow it is.
Here is what I've got at the moment
driver = webdriver.Firefox()
for x in links:
driver.get(x)
rows = driver.find_elements_by_xpath('//table[@id="matches"]//tbody//tr')
data = driver.find_elements_by_xpath('//table[@id="matches"]//tbody//tr//td')
for i in range(len(rows)-1):
date = data[17*i].text
tournament = data[17*i+1].text
surface = data[17*i+2].text
rd = data[17*i+3].text
rk = data[17*i+4].text
vrk = data[17*i+5].text
match = data[17*i+6].text
score = data[17*i+7].text
more = data[17*i+8].text
dr = data[17*i+9].text
ace = data[17*i+10].text
df = data[17*i+11].text
first_in = data[17*i+12].text
first_won = data[17*i+13].text
second_won = data[17*i+14].text
bp_saved = data[17*i+15].text
duration = data[17*i+16].text
with open('serve.csv','a', encoding = 'utf-8') as r:
r.write(date + "," + tournament + "," + surface + "," + rd + "," + rk + "," + vrk + "," + match + "," + score + "," + more + "," + dr + "," + ace + "," + df + "," + first_in + "," + first_won + "," + second_won + "," + bp_saved + "," + duration + "\n")
This gives me my desired output, but running it all through the night only yielded 12 page scrapes....
Why is this code so slow and what can I do to improve it?
Thanks