Quantcast
Channel: Active questions tagged selenium - Stack Overflow
Viewing all articles
Browse latest Browse all 99410

Scrapy-Selenium NYTimes issue

$
0
0

I've stuck trying to parse NYTimes page using Scrapy-Selenium. Link to the page: https://www.nytimes.com/2019/03/21/travel/what-to-do-in-hoi-an-vietnam.html

As I can understand, it's a javascript driven page. When I disable javascript with help of a Chrome browser extension I see grey placeholders instead of some photographs.

Javascript enabled Javascript enabled Javascript disabled Javascript disabled

The following snippet is this image with enabled JS:

<div data-testid="lazyimage-container" style="height: auto; cursor: pointer;">
<img alt="" class="css-1h6w7uo e1t57l6r0" src="https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An1/merlin_151549596_96de6b6d-174d-4cdb-add2-b77b5612ffab-articleLarge.jpg?quality=75&amp;auto=webp&amp;disable=upscale" srcset="https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An1/merlin_151549596_96de6b6d-174d-4cdb-add2-b77b5612ffab-articleLarge.jpg?quality=90&amp;auto=webp 600w,https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An1/merlin_151549596_96de6b6d-174d-4cdb-add2-b77b5612ffab-jumbo.jpg?quality=90&amp;auto=webp 1024w,https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An1/merlin_151549596_96de6b6d-174d-4cdb-add2-b77b5612ffab-superJumbo.jpg?quality=90&amp;auto=webp 2048w" sizes="((min-width: 600px) and (max-width: 1004px)) 84vw, (min-width: 1005px) 80vw, 100vw" itemprop="url" itemid="https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An1/merlin_151549596_96de6b6d-174d-4cdb-add2-b77b5612ffab-articleLarge.jpg?quality=75&amp;auto=webp&amp;disable=upscale" style="opacity: 1;">
</div>

Without JS there's just div:

<div data-testid="lazyimage-container" style="height:257.77777777777777px"></div>

My Scrapy spider:

import scrapy
from scrapy_selenium import SeleniumRequest


from pprint import pprint

class NytimesSpider(scrapy.Spider):
    name = "nyt"

    start_urls = ["https://www.nytimes.com/2019/03/21/travel/what-to-do-in-hoi-an-vietnam.html"]

    def start_requests(self):
        for url in self.start_urls:
            yield SeleniumRequest(url=url, callback=self.parse_result)

    def parse_result(self, response):
        print("=" * 60)
        imgs = response.css("img::attr(src)").getall()
        for img in imgs:
            print(img)
            print("")
        print("=" * 60)

The output:

============================================================
https://static01.nyt.com/images/2019/03/24/travel/21Hours-Hoi-An6/merlin_151545219_ba2c9daa-c40a-4d52-80fe-ba679f3a98c2-articleLarge.jpg?quality=75&auto=webp&disable=upscale

https://static01.nyt.com/images/2019/03/24/travel/21Hours-Hoi-An6/merlin_151545219_ba2c9daa-c40a-4d52-80fe-ba679f3a98c2-articleLarge.jpg?quality=75&auto=webp&disable=upscale

https://static01.nyt.com/images/2018/02/25/travel/25vietnam1/merlin_133277466_698b9b08-f2d5-43c4-a44e-978ddc23cbac-videoLarge.jpg

https://static01.nyt.com/images/2018/12/26/travel/26PTG-LAOS-COMBO-promo/26PTG-LAOS-COMBO-promo-threeByTwoSmallAt2X-v6.jpg

https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An2/merlin_151543719_ee268c49-2cac-47a6-855c-dedcb8fc7676-articleLarge.jpg?quality=75&auto=webp&disable=upscale

https://static01.nyt.com/images/2019/01/07/travel/52-PROMO/52-PROMO-articleLarge.jpg

https://mwcm.nyt.com/dam/mkt_assets/exo/img/nyt-logo-379x64.svg

https://et.nytimes.com/pixel?url=https://www.nytimes.com/2019/03/21/travel/what-to-do-in-hoi-an-vietnam.html&referrer=&subject=module-interactions&moduleData=%7B%22module%22%3A%22nyt-vi-page-pixel%22%2C%22pgType%22%3A%22%22%2C%22eventName%22%3A%22Impression%22%2C%22action%22%3A%22Impression%22%7D&sourceApp=nyt-vi&instant=1&_=1553234896724

https://et.nytimes.com/pixel.gif?subject=ab-expose&test=PER_MoreIn_World&variant=3_au_most_popular&url=https%3A%2F%2Fwww.nytimes.com%2F2019%2F03%2F21%2Ftravel%2Fwhat-to-do-in-hoi-an-vietnam.html&instant=1&skipAugment=true&gtm=GTM-P528B3-284-Production&et2_pageview_id=yrkmw_cn5c1oW40tVV_VdoTl

============================================================

The problem is that there's no required picture in the result list. The photo src is https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An1/merlin_151549596_96de6b6d-174d-4cdb-add2-b77b5612ffab-articleLarge.jpg?quality=75&auto=webp&disable=upscale

The whole command line log is:

(nlp2) D:\Python\_Project\Scraping_train_data\snyt>scrapy crawl nyt
2019-03-22 09:08:11 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: snyt)
2019-03-22 09:08:11 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-03-22 09:08:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'snyt', 'NEWSPIDER_MODULE': 'snyt.spiders', 'SPIDER_MODULES': ['snyt.spiders']}
2019-03-22 09:08:11 [scrapy.extensions.telnet] INFO: Telnet Password: 4d9b971e8de9258e
2019-03-22 09:08:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-03-22 09:08:14 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:56203/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true, "moz:firefoxOptions": {"args": ["--headless"]}}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true, "marionette": true, "moz:firefoxOptions": {"args": ["--headless"]}}}
2019-03-22 09:08:14 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:56203
2019-03-22 09:08:16 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56203 "POST /session HTTP/1.1" 200 702
2019-03-22 09:08:16 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2019-03-22 09:08:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy_selenium.SeleniumMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-22 09:08:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-22 09:08:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-22 09:08:16 [scrapy.core.engine] INFO: Spider opened
2019-03-22 09:08:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-22 09:08:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-03-22 09:08:16 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:56203/session/fa7fe711-db01-4b58-8d86-2efd31b23529/url {"url": "https://www.nytimes.com/2019/03/21/travel/what-to-do-in-hoi-an-vietnam.html"}
2019-03-22 09:08:24 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56203 "POST /session/fa7fe711-db01-4b58-8d86-2efd31b23529/url HTTP/1.1" 200 14
2019-03-22 09:08:24 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2019-03-22 09:08:24 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:56203/session/fa7fe711-db01-4b58-8d86-2efd31b23529/source {}
2019-03-22 09:08:24 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56203 "GET /session/fa7fe711-db01-4b58-8d86-2efd31b23529/source HTTP/1.1" 200 1971834
2019-03-22 09:08:24 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2019-03-22 09:08:24 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:56203/session/fa7fe711-db01-4b58-8d86-2efd31b23529/url {}
2019-03-22 09:08:24 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56203 "GET /session/fa7fe711-db01-4b58-8d86-2efd31b23529/url HTTP/1.1" 200 87
2019-03-22 09:08:24 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2019-03-22 09:08:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nytimes.com/2019/03/21/travel/what-to-do-in-hoi-an-vietnam.html> (referer: None)
============================================================
https://static01.nyt.com/images/2019/03/24/travel/21Hours-Hoi-An6/merlin_151545219_ba2c9daa-c40a-4d52-80fe-ba679f3a98c2-articleLarge.jpg?quality=75&auto=webp&disable=upscale

https://static01.nyt.com/images/2019/03/24/travel/21Hours-Hoi-An6/merlin_151545219_ba2c9daa-c40a-4d52-80fe-ba679f3a98c2-articleLarge.jpg?quality=75&auto=webp&disable=upscale

https://static01.nyt.com/images/2018/02/25/travel/25vietnam1/merlin_133277466_698b9b08-f2d5-43c4-a44e-978ddc23cbac-videoLarge.jpg

https://static01.nyt.com/images/2018/12/26/travel/26PTG-LAOS-COMBO-promo/26PTG-LAOS-COMBO-promo-threeByTwoSmallAt2X-v6.jpg

https://static01.nyt.com/images/2019/03/21/travel/21Hours-Hoi-An2/merlin_151543719_ee268c49-2cac-47a6-855c-dedcb8fc7676-articleLarge.jpg?quality=75&auto=webp&disable=upscale

https://static01.nyt.com/images/2019/01/07/travel/52-PROMO/52-PROMO-articleLarge.jpg

https://mwcm.nyt.com/dam/mkt_assets/exo/img/nyt-logo-379x64.svg

https://et.nytimes.com/pixel?url=https://www.nytimes.com/2019/03/21/travel/what-to-do-in-hoi-an-vietnam.html&referrer=&subject=module-interactions&moduleData=%7B%22module%22%3A%22nyt-vi-page-pixel%22%2C%22pgType%22%3A%22%22%2C%22eventName%22%3A%22Impression%22%2C%22action%22%3A%22Impression%22%7D&sourceApp=nyt-vi&instant=1&_=1553234896724

https://et.nytimes.com/pixel.gif?subject=ab-expose&test=PER_MoreIn_World&variant=3_au_most_popular&url=https%3A%2F%2Fwww.nytimes.com%2F2019%2F03%2F21%2Ftravel%2Fwhat-to-do-in-hoi-an-vietnam.html&instant=1&skipAugment=true&gtm=GTM-P528B3-284-Production&et2_pageview_id=yrkmw_cn5c1oW40tVV_VdoTl

============================================================
2019-03-22 09:08:25 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-22 09:08:25 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:56203/session/fa7fe711-db01-4b58-8d86-2efd31b23529 {}
2019-03-22 09:08:26 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56203 "DELETE /session/fa7fe711-db01-4b58-8d86-2efd31b23529 HTTP/1.1" 200 14
2019-03-22 09:08:26 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2019-03-22 09:08:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 1915145,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 3, 22, 6, 8, 25, 30708),
 'log_count/DEBUG': 18,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 3, 22, 6, 8, 16, 33466)}
2019-03-22 09:08:26 [scrapy.core.engine] INFO: Spider closed (finished)

I added these lines to settings.py accordingly to the instruction (https://github.com/clemfromspace/scrapy-selenium):

from shutil import which

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

I'm new to scraping javascript based sites but I've successfully parsed https://edition.cnn.com/search/?q=war page with Scrapy-Selenium. Probably, Scrapy project settings are right.

Where is my mistake, why doesn't the spider see all the pictures?

Thank you in advance.


Viewing all articles
Browse latest Browse all 99410

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>