I'm trying to get Selenium-Wire to work in an AWS Lambda. I've seen very few StackOverflow entries about it, but it kinda seems some people were successful. My lambda is stateless and doesn't even need to use any other AWS feature (such as S3). It'd scrape a certain thing an d I'd capture a specific JSON response of a specific AJAX call on a page.
Here is my Dockerfile
:
FROM public.ecr.aws/lambda/python:3.9# Should I go with python:3.8 instead?# Install the function's dependencies using file requirements.txt# from your project folder.RUN yum makecache# https://stackoverflow.com/questions/73056540/no-module-named-amazon-linux-extras-when-running-amazon-linux-extras-install-epeRUN yum install -y amazon-linux-extras# https://stackoverflow.com/questions/72077341/how-do-you-install-chrome-on-amazon-linux-2RUN PYTHON=python2 amazon-linux-extras install epel -y# https://stackoverflow.com/questions/72850004/no-package-zbar-available-in-lambda-layerRUN yum makecacheRUN yum install -y chromiumENV CHROMIUM_PATH=/usr/bin/chromium-browser# or RUN yum install -y google-chrome-stable# or https://intoli.com/blog/installing-google-chrome-on-centos/# curl https://intoli.com/install-google-chrome.sh | bash# https://devopsqa.wordpress.com/2018/03/08/install-google-chrome-and-chromedriver-in-amazon-linux-machine/# https://www.usessionbuddy.com/post/How-To-Install-Selenium-Chrome-On-Centos-7/RUN yum install -y chromedriverRUN pip install --upgrade pipCOPY requirements.txt .RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"# Copy function codeCOPY app.py ${LAMBDA_TASK_ROOT}# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)CMD [ "app.handler" ]
My requirements.txt
, pretty minimal:
selenium-wire==5.1.0
And my Lambda function:
from seleniumwire import webdriverfrom selenium.webdriver.chrome.service import Servicedef handler(event, context): # https://gist.github.com/rengler33/f8b9d3f26a518c08a414f6f86109863c # https://github.com/wkeeling/selenium-wire/issues/131 chrome_options = webdriver.ChromeOptions() chrome_option_list = {"disable-extensions","disable-gpu","no-sandbox","headless", # for Jenkins"disable-dev-shm-usage", # Jenkins"window-size=800x600", # Jenkins"window-size=800,600","disable-setuid-sandbox","allow-insecure-localhost","no-cache","user-data-dir=/tmp/user-data","hide-scrollbars","enable-logging","log-level=0","single-process","data-path=/tmp/data-path","ignore-certificate-errors","homedir=/tmp","disk-cache-dir=/tmp/cache-dir","start-maximized","disable-software-rasterizer","ignore-certificate-errors-spki-list","ignore-ssl-errors", } for chrome_option in chrome_option_list: chrome_options.add_argument(f"--{chrome_option}") selenium_options = {"request_storage_base_dir": "/tmp", # Use /tmp to store captured data"exclude_hosts": "" } ser = Service("/usr/bin/chromedriver") ser.service_args=["--verbose", "--log-path=test.log"] driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options) # The meat # ... return result
I built an image from the docker file and uploaded it to AWS ECR. The Docker image passes the "it works on my machine (TM)" classic test: it scrapes fine in my laptop Docker container. However it returns error when I try to run it as lambda (based on my own image):
START RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b Version: $LATEST[ERROR] WebDriverException: Message: Service /usr/bin/chromedriver unexpectedly exited. Status code was: 1Traceback (most recent call last): File "/var/task/app.py", line 43, in handler driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options) File "/var/task/seleniumwire/webdriver.py", line 218, in __init__ super().__init__(*args, **kwargs) File "/var/task/selenium/webdriver/chrome/webdriver.py", line 80, in __init__ super().__init__( File "/var/task/selenium/webdriver/chromium/webdriver.py", line 101, in __init__ self.service.start() File "/var/task/selenium/webdriver/common/service.py", line 104, in start self.assert_process_still_running() File "/var/task/selenium/webdriver/common/service.py", line 117, in assert_process_still_running raise WebDriverException(f"Service {self.path} unexpectedly exited. Status code was: {return_code}")END RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3bREPORT RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b Duration: 758.10 ms Billed Duration: 1361 ms Memory Size: 128 MB Max Memory Used: 91 MB Init Duration: 602.74 ms
I was also experimenting with other Chrome switches such as mentioned in selenium.common.exceptions.webdriverexception: message: 'chromedriver.exe' unexpectedly exited.status code was: 1 with no luck. I always get Status code 1, but I couldn't find any documentation what is that exactly. I assume it's some very blatant error.
Does anyone have a working image / Dockerfile + skeleton function I can try?