I have learned to accomplish (using python) building some different web scrapers with the purpose of scraping image URLs from one of our part manufacturer’s websites for the sake of mass uploading a load sheet of products, with one of the columns consisting of the image URLs.
Since the URLs aren’t simple (I can’t simply iterate through a list of product numbers and append it to each new URL nor any of the simpler methods; I’m here because I have to be here) and since the site doesn’t have a “search by product number” function, I went to the lists on their site. They had some really handy tools! You can add products by product number, and when you’re done you can export that list as a
.csv with the option to include the links to all of the corresponding product pages. Which was great, until I built my script and found out the hard way that they have a 250 item limit per list. For perspective, I have a little under 5,000 products to scrape (meaning I will need about 20 lists, with 19 full and the last one nearly full).
I mention all of this as the context for it is relevant to the code and issue at hand.
My goal now that I have really no other options is to take my code and modify it a bit to achieve the scraping through 20 separate lists. Right now, at the stage that is relevant, it gets a URL that goes to the link of their website for a list I have named
testlist and it then refreshes the page just to make sure all of the elements are in order.
We were on the right page when I needed one list, but there’s issue one: We can’t just use one link anymore, as we will have to set something up to iterate through 250 items and create a new list about 20 times (or I can manually create the lists and have specific URLs to point to).
Our second issue at hand is the item limit itself. My for loop is one large one designed to iterate through the entire list of about 4,800 product numbers that I have, adding them one by one into the list on the same page. We need to break this up into chunks of 250 items per page, at most, and have it load up another list URL. I could go create those lists manually so that I would have specific URLs to point to, but, if it will be easier to add a function that just clicks and names it, that would be awesome. I can figure that part out myself, probably.
I don’t know where to go from here. I have code that is built to handle one website list, on one URL, iterating through the product numbers in my python list, and then export that at the end.
I need my script to iterate through the same python list, stopping after 250 product numbers to load the next URL before continuing the process.
The part of my code that gets our list URL and then onwards into the scraper portion is as follows.
select_am = Select(driver.find_element_by_css_selector('#listActions')) alert_accept() print("Found it. Selecting...") select_am.select_by_value('addItems') print('Selected. Next...') # paste our item number into the box paste it print('Locating model number search....') inputidbox = driver.find_element_by_id('model-number-search') print('Located? Pasting model number...') inputidbox.send_keys(number) # finally add our item additembutton = driver.find_element_by_css_selector('.gtmAddItemToList') print('Located add item button...') additembutton.click() print('Item number added. Next...') print('Locating blank space...') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#addItemsToListModal > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > button:nth-child(1) > svg:nth-child(1) > path:nth-child(1)"))) time.sleep(1) xbutton = driver.find_element_by_css_selector('#addItemsToListModal > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > button:nth-child(1) > svg:nth-child(1) > path:nth-child(1)') xbutton.click() time.sleep(1) # now we find the "export excel" option to get our csv for that list listactions = Select(driver.find_element_by_css_selector('#listActions')) listactions.select_by_value('exportExcel') # clicky clicky. user dialog will show up on screen asking if you want to save the file. user must manually click on save exportbutton = driver.find_element_by_css_selector('#btnExportToExcel') exportbutton.click()
My question is, how can I rearrange and/or modify this code to accomplish what I need? Is this the most efficient method of accomplishing this? What would you do, how would you handle this, and what code can I implement in order to achieve my goal if there are no better options?
It would be pretty useless to share the actual website links as you need an account with them in order to access lists and such.