When downloading PDF URLs from an Excel file using Selenium in Python, a common issue arises where files with the same filename get overwritten during the download process. This problem occurs because multiple PDFs have identical filenames, leading to the last downloaded file overwriting the previously downloaded ones. As a result, at the end of the download process, only one file remains with the shared filename, discarding all the others. The task at hand is to devise a solution using Selenium in Python to prevent the overwriting of downloaded PDFs with identical filenames. Currently, the code loops through the dataframe, downloads each PDF URL, and saves it into a designated destination folder. However, due to the shared filenames, the files overwrite each other, leading to a loss of data.
What I Have Tried:I am using Selenium in Python to download PDFs from a list of URLs stored in an Excel file. I have set up the necessary options for the Chrome webdriver and implemented the logic to download the PDFs one by one by looping through a DataFrame containing the URLs and corresponding filenames.
Expectations:
My goal is to modify the downloading process to avoid overwriting PDFs with the same filename. Currently, when multiple PDFs have the same filename, the last downloaded file overwrites the previously downloaded ones, resulting in a loss of data. Instead, I expect the downloaded files to be saved with unique filenames, such as filename.pdf, filename (1).pdf, filename (2).pdf, and so on, for each consecutive download, to preserve all the files.
Packages and Options Imported:
Here are the packages and options I have imported for my Selenium script:
from selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.support.ui import Selectfrom webdriver_manager.chrome import ChromeDriverManagerfrom selenium.webdriver.chrome.options import Optionsimport timeimport osimport warningsimport pyperclipimport Xlib.displayfrom pyvirtualdisplay.display import Displayoptions = Options()options.add_argument("--ignore-certificate-errors")options.add_argument("--headless")options.add_argument('disable-infobars')options.add_argument("--no-sandbox")options.page_load_strategy = 'normal'options.add_argument("--disable-cache")options.add_argument("--disable-gpu")options.add_experimental_option("excludeSwitches", ["enable-automation"])options.add_experimental_option('prefs', {'download.default_directory': file_path})