Image Scraping

I wanted to scrape the images of that website: https://example.com/.

To do that I realized that I needed to understand the website architecture.

I found that it has a few regions that you can choose from which is what I want.

I made a list of all the regions by copy pasting URLs:

https://example.com/collections/affiches-auvergne-rhone-alpes
https://example.com/collections/affiches-de-bretagne
https://example.com/collections/affiches-de-paca
https://example.com/collections/affiches-des-hauts-de-france
https://example.com/collections/affiches-occitanie
https://example.com/collections/affiches-de-normandie
https://example.com/collections/affiches-de-nouvelle-aquitaine
https://example.com/collections/affiches-grand-est
https://example.com/collections/affiches-de-bourgogne-franche-comte
https://example.com/collections/affiches-pays-de-la-loire

Each region has about 3 pages so I added the pages selector:

...
    "https://example.com/collections/affiches-pays-de-la-loire",
...
    "https://example.com/collections/affiches-pays-de-la-loire?page=2"
...
    "https://example.com/collections/affiches-pays-de-la-loire?page=3"

Now I put all that text into a python scraping file that an AI provided:

import requests
from bs4 import BeautifulSoup

# List of pages to scrape
urls = [
    <URLs here>
]

all_links = []

for url in urls:
    print(f"Scraping: {url}")
    response = requests.get(url)
    
    # Check if request succeeded
    if response.status_code != 200:
        print(f"Failed to load {url}")
        continue
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find the div with the specific ID
    product_grid = soup.find("div", id="CollectionProductGrid")
    
    if not product_grid:
        print(f"No 'CollectionProductGrid' found on {url}")
        continue
    
    # Extract all <a> tags inside the div
    links = product_grid.find_all("a", href=True)
    
    for link in links:
        href = link["href"]
        # You can make this absolute if needed
        if href.startswith("/"):
            from urllib.parse import urljoin
            href = urljoin(url, href)
        all_links.append(href)

# Display results
print("\nCollected links:")
for link in all_links:
    print(link)

This program will find all image links over all the links that I’ve provided. And now the first few links look like:

https://example.com/files/image1.jpg?v=1729157116&width=480

https://example.com/files/image2.jpg?v=1726953900&width=480

https://example.com/files/image3.jpg?v=1729157109&width=480

Which is great! There’s a few parameters as you can see, these might cause problems further down the line. However, I’ve found that when you download a picture, it’s always better if you specify width=2048 it always gives you the biggest image possible. So I’ll just replace width=[0-9]{3,4} with width=2048.

cat unique_urls.txt | sed -E 's/width=[0-9]+/width=2048/g' > link-adjusted-size.txt

There’s a script to remove duplicates:

#!/bin/bash

if [ $# -lt 1 ]; then

echo "Usage: $0 <file>"

exit 1

fi

input_file="$1"
output_file="links-adjusted-size-no-duplicates.txt"

# Remove duplicates while preserving order

awk '!seen[$0]++' "$input_file" > "$output_file"
echo "Unique URLs saved to $output_file"

So let’s remove the duplicates:

./filter-unique.sh link-adjusted-size.txt
Unique URLs saved to links-adjusted-size-no-duplicates.txt

Now, I realize there are still some .png files in there that I don’t want, let’s remove them:

cat links-adjusted-size-no-duplicates.txt | grep jpg > links.txt

And there we go.

Now, for some reason the following while loop didn’t work:

while read url ; do curl -O '$url' && sleep 1 ; done < link-size-duplicates-jpg.txt

It kept erroring about bad names or whatever. So I used an AI’s python script:

import os
import requests
from urllib.parse import urlparse, unquote
from time import sleep

# Path to your file with URLs
INPUT_FILE = "link-size-duplicates-jpg.txt"
OUTPUT_DIR = "downloaded_images"

# Create output directory if not exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Read all URLs from file
with open(INPUT_FILE, "r", encoding="utf-8") as f:
    urls = [line.strip().replace("\\", "") for line in f if line.strip()]

print(f"Found {len(urls)} URLs to process.\n")

downloaded = set()

for i, url in enumerate(urls, start=1):
    if url in downloaded:
        print(f"[{i}] Skipping duplicate: {url}")
        continue

    print(f"[{i}] Downloading: {url}")

    try:
        response = requests.get(url, stream=True, timeout=15)
        response.raise_for_status()

        # Extract filename from URL path
        path = urlparse(url).path
        filename = os.path.basename(path)
        # Remove query params like ?v=xxx
        if "?" in filename:
            filename = filename.split("?")[0]
        filename = unquote(filename)  # Decode %20 etc.
        file_path = os.path.join(OUTPUT_DIR, filename)

        # Save image
        with open(file_path, "wb") as f_out:
            for chunk in response.iter_content(8192):
                f_out.write(chunk)

        print(f"✅ Saved as: {file_path}")

        downloaded.add(url)
        sleep(1)  # polite delay between downloads

    except requests.RequestException as e:
        print(f"❌ Error downloading {url}: {e}")

print("\nAll done!")

And there it worked! It’s downloading now.