I have a Python script to scrape data from a webpage with infinite scroll. Currently, I use a loop that scrolls down to the bottom of the page and waits for 5 seconds each time to allow the page to load new content:
for i in range(100):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
However, waiting a fixed 5 seconds each time may not be the most efficient approach, as the page could finish loading new content much faster.
Is there a way to detect when the page has finished loading new content after scrolling down? This would allow me to scroll again only when necessary, making the process more time-efficient.
The WebDriver automatically waits for a page to load completely when using the .get() method.
If you’re specifically looking for a certain element, it’s best to use WebDriverWait from Selenium’s expected_conditions module to wait until that element appears on the page:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
# Set a delay of 3 seconds
delay = 3
try:
# Wait until the element with ID 'IdOfMyElement' is present on the page
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print("Page is ready!")
except TimeoutException:
print("Loading took too much time!")
This code snippet uses WebDriverWait to wait up to 3 seconds (delay) for the element with ID ‘IdOfMyElement’ to be present on the page. If the element is found within the specified time, it prints “Page is ready!”. If not, it prints “Loading took too much time!” after a TimeoutException occurs.
It’s important to note that while WebDriver waits for the initial page load by default, it doesn’t wait for elements loaded inside frames or for AJAX requests.
In such cases, you should use WebDriverWait with appropriate expected conditions to handle waiting for specific elements or conditions on the page.
You can use readyState() method that checks the page’s readyState using JavaScript. It attempts to verify if the current page has fully loaded by executing a script that retrieves the readyState of the document.
However, this approach may not be entirely reliable as it could return true prematurely if the browser hasn’t fully processed a click event.
def page_has_loaded(self):
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
page_state = self.driver.execute_script('return document.readyState;')
return page_state == 'complete'
id Comparison
This method compares the id attribute of the new page with the old one to determine if the page has loaded. It attempts to locate the element on the page and checks if its id has changed from the old page. However, this approach may not be as effective due to potential issues with stale reference exceptions.
def page_has_loaded_id(self):
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
try:
new_page = browser.find_element_by_tag_name('html')
return new_page.id != old_page.id
except NoSuchElementException:
return False
staleness_of
This method uses the staleness_of method in conjunction with a context manager to wait for the page to reload. It captures the current element as old_page, waits for a specified timeout period, and checks for the staleness of old_page to indicate that the page has been refreshed.
@contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
old_page = self.find_element_by_tag_name('html')
yield
WebDriverWait(self, timeout).until(staleness_of(old_page))
Each method tackles the challenge of determining if a page has fully loaded after an action like a click, using different techniques such as checking readyState, comparing id attributes, and waiting for staleness. These approaches cater to different scenarios and considerations regarding page loading in Selenium WebDriver.
As of recent updates, the method find_element_by_tag_name in Selenium is deprecated. Instead, you should use driver.find_element(By.TAG_NAME, ‘html’), which is the updated approach for locating elements by their tag name in Selenium WebDriver.
This change reflects updates in Selenium’s API to maintain compatibility and best practices in element locating strategies.