I wanted to scrape the images of that website: https://example.com/.
To do that I realized that I needed to understand the website architecture.
I found that it has a few regions that you can choose from which is what I want.
I made a list of all the regions by copy pasting URLs:
https://example.com/collections/affiches-auvergne-rhone-alpes
https://example.com/collections/affiches-de-bretagne
https://example.com/collections/affiches-de-paca
https://example.com/collections/affiches-des-hauts-de-france
https://example.com/collections/affiches-occitanie
https://example.com/collections/affiches-de-normandie
https://example.com/collections/affiches-de-nouvelle-aquitaine
https://example.com/collections/affiches-grand-est
https://example.com/collections/affiches-de-bourgogne-franche-comte
https://example.com/collections/affiches-pays-de-la-loire
Each region has about 3 pages so I added the pages selector:
...
"https://example.com/collections/affiches-pays-de-la-loire",
...
"https://example.com/collections/affiches-pays-de-la-loire?page=2"
...
"https://example.com/collections/affiches-pays-de-la-loire?page=3"
Now I put all that text into a python scraping file that an AI provided:
import requests
from bs4 import BeautifulSoup
# List of pages to scrape
urls = [
<URLs here>
]
all_links = []
for url in urls:
print(f"Scraping: {url}")
response = requests.get(url)
# Check if request succeeded
if response.status_code != 200:
print(f"Failed to load {url}")
continue
soup = BeautifulSoup(response.text, "html.parser")
# Find the div with the specific ID
product_grid = soup.find("div", id="CollectionProductGrid")
if not product_grid:
print(f"No 'CollectionProductGrid' found on {url}")
continue
# Extract all <a> tags inside the div
links = product_grid.find_all("a", href=True)
for link in links:
href = link["href"]
# You can make this absolute if needed
if href.startswith("/"):
from urllib.parse import urljoin
href = urljoin(url, href)
all_links.append(href)
# Display results
print("\nCollected links:")
for link in all_links:
print(link)
This program will find all image links over all the links that I’ve provided. And now the first few links look like:
https://example.com/files/image1.jpg?v=1729157116&width=480
https://example.com/files/image2.jpg?v=1726953900&width=480
https://example.com/files/image3.jpg?v=1729157109&width=480
Which is great! There’s a few parameters as you can see, these might cause problems further down the line. However, I’ve found that when you download a picture, it’s always better if you specify width=2048 it always gives you the biggest image possible. So I’ll just replace width=[0-9]{3,4} with width=2048.
cat unique_urls.txt | sed -E 's/width=[0-9]+/width=2048/g' > link-adjusted-size.txt
There’s a script to remove duplicates:
#!/bin/bash
if [ $# -lt 1 ]; then
echo "Usage: $0 <file>"
exit 1
fi
input_file="$1"
output_file="links-adjusted-size-no-duplicates.txt"
# Remove duplicates while preserving order
awk '!seen[$0]++' "$input_file" > "$output_file"
echo "Unique URLs saved to $output_file"
So let’s remove the duplicates:
./filter-unique.sh link-adjusted-size.txt
Unique URLs saved to links-adjusted-size-no-duplicates.txt
Now, I realize there are still some .png files in there that I don’t want, let’s remove them:
cat links-adjusted-size-no-duplicates.txt | grep jpg > links.txt
And there we go.
Now, for some reason the following while loop didn’t work:
while read url ; do curl -O '$url' && sleep 1 ; done < link-size-duplicates-jpg.txt
It kept erroring about bad names or whatever. So I used an AI’s python script:
import os
import requests
from urllib.parse import urlparse, unquote
from time import sleep
# Path to your file with URLs
INPUT_FILE = "link-size-duplicates-jpg.txt"
OUTPUT_DIR = "downloaded_images"
# Create output directory if not exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Read all URLs from file
with open(INPUT_FILE, "r", encoding="utf-8") as f:
urls = [line.strip().replace("\\", "") for line in f if line.strip()]
print(f"Found {len(urls)} URLs to process.\n")
downloaded = set()
for i, url in enumerate(urls, start=1):
if url in downloaded:
print(f"[{i}] Skipping duplicate: {url}")
continue
print(f"[{i}] Downloading: {url}")
try:
response = requests.get(url, stream=True, timeout=15)
response.raise_for_status()
# Extract filename from URL path
path = urlparse(url).path
filename = os.path.basename(path)
# Remove query params like ?v=xxx
if "?" in filename:
filename = filename.split("?")[0]
filename = unquote(filename) # Decode %20 etc.
file_path = os.path.join(OUTPUT_DIR, filename)
# Save image
with open(file_path, "wb") as f_out:
for chunk in response.iter_content(8192):
f_out.write(chunk)
print(f"✅ Saved as: {file_path}")
downloaded.add(url)
sleep(1) # polite delay between downloads
except requests.RequestException as e:
print(f"❌ Error downloading {url}: {e}")
print("\nAll done!")
And there it worked! It’s downloading now.