I'm trying to scrape content from page similar to this: https://www.newsweek.pl/nwpl_2018002_20181231
. It has "More" (pl. Więcej) button at the bottom of the page, which dynamically loads next articles. Preferably I would like to use Scrapy to do the task, because my other spiders use it, but first I need all of the articles urls; so I'm trying to click()
this button with Selenium as follow:
def parse_issue(self, response):
self.logger.info('Parse function called parse_issue on {}'.format(response.url))
self.driver.get(response.url)
while True:
try:
more_button = self.driver.find_element_by_xpath('//div[@class="showMoreBtn"]')
time.sleep(2)
more_button.click()
time.sleep(5)
print('clicked.')
except Exception as e:
print(e)
break
articles_elements = self.driver.find_elements_by_xpath('.//div[@class="pure-u-1-1 pure-u-md-1-4 smallItem"]/a')
articles_url = [element.get_attribute("href") for element in articles_elements]
print(articles_url, response.url)
Unfortunately, as a result I only get urls of articles that are already in the source of the page. Can someone suggest me what I'm doing wrong?