Raspberry Pi Web Scraper: Python Automation and Scheduling

A Raspberry Pi running a Python web scraper is one of the most cost-effective ways to automate data collection. Whether you’re monitoring competitor prices, tracking job postings, archiving news articles, or watching stock levels on e-commerce sites, a Pi can run your scraper 24/7 on a few watts of power, store results locally, and alert you when something interesting happens — all without paying for cloud compute.

This tutorial covers the full stack: static site scraping with BeautifulSoup, dynamic JavaScript-rendered sites with Playwright, scheduling with cron and systemd timers, local data storage, and practical anti-detection techniques to keep your scraper running reliably.

Choosing the Right Scraping Tools

Web scraping tools fall into two main categories based on the type of site you’re targeting:

Static Sites (HTML delivered server-side)

Sites where the content is fully present in the HTML response — government data portals, Wikipedia, most news sites, job boards with server-side rendering. For these, use:

requests — HTTP library for fetching pages
BeautifulSoup4 — HTML parser with an intuitive API
lxml — faster parser backend for BeautifulSoup

Dynamic Sites (JavaScript-rendered content)

Sites like modern e-commerce platforms, social media, or any site where content loads after the initial HTML via AJAX calls. You need a real browser (or browser emulator) that executes JavaScript:

Playwright — Microsoft’s browser automation library; supports Chromium, Firefox, WebKit; excellent ARM64 support
Selenium — older alternative; works but Playwright is generally preferred for new projects

For many dynamic sites, you can skip the browser entirely by inspecting the site’s network requests in browser DevTools, finding the underlying API endpoint, and calling it directly with requests. This is significantly faster than running a full browser.

Recommended: Raspberry Pi 5 Model 4GB RAM — for running Playwright (Chromium) on Raspberry Pi, the 4GB RAM model is the minimum comfortable configuration. Chromium’s renderer processes consume 200–400MB each, and running multiple parallel browser instances on 2GB RAM will cause constant swapping.

Setting Up Your Raspberry Pi for Scraping

Start with a fresh Raspberry Pi OS Lite (64-bit) installation on your Pi 5. The Lite version skips the desktop environment, saving RAM and CPU cycles for your scrapers. Enable SSH via Raspberry Pi Imager’s Advanced Options so you can manage it headlessly.

After first boot, install the core Python dependencies:

sudo apt update && sudo apt install -y python3-pip python3-venv chromium-browser chromium-chromedriver

# Create a virtual environment for your scraper project
cd ~
python3 -m venv scrapers_env
source scrapers_env/bin/activate

# Install Python packages
pip install requests beautifulsoup4 lxml playwright sqlite-utils pandas

Install Playwright browsers (uses bundled Chromium compatible with ARM64):

playwright install chromium
playwright install-deps chromium

Verify everything works:

python3 -c "from playwright.sync_api import sync_playwright; print('Playwright OK')"
python3 -c "import bs4; print('BS4 OK')"

Static Scraping with Requests and BeautifulSoup

Let’s build a real example: scraping electronics component prices from a product category page. This pattern applies to most e-commerce and catalogue sites.

import requests
from bs4 import BeautifulSoup
import sqlite3
from datetime import datetime
import time
import random
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(message)s',
    handlers=[
        logging.FileHandler('/var/log/scraper.log'),
        logging.StreamHandler()
    ]
)

SESSION = requests.Session()
SESSION.headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-IN,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
})

def init_db(db_path='/home/pi/scraper_data.db'):
    conn = sqlite3.connect(db_path)
    conn.execute('''
        CREATE TABLE IF NOT EXISTS products (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            scraped_at TEXT,
            url TEXT,
            name TEXT,
            price_inr REAL,
            availability TEXT,
            sku TEXT
        )
    ''')
    conn.commit()
    return conn

def scrape_product_page(url):
    try:
        resp = SESSION.get(url, timeout=15)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, 'lxml')
        
        # These selectors are examples — adjust to your target site
        name = soup.select_one('h1.product-title, h1.entry-title')
        price = soup.select_one('.price .woocommerce-Price-amount bdi,
                                  .product-price .amount')
        availability = soup.select_one('.stock')
        sku_elem = soup.select_one('.sku')
        
        return {
            'url': url,
            'name': name.get_text(strip=True) if name else None,
            'price_inr': float(price.get_text(strip=True).replace('₹','').replace(',','')) if price else None,
            'availability': availability.get_text(strip=True) if availability else 'Unknown',
            'sku': sku_elem.get_text(strip=True) if sku_elem else None,
        }
    except Exception as e:
        logging.error(f'Failed to scrape {url}: {e}')
        return None

def scrape_category_urls(category_url):
    """Extract all product URLs from a category page, handling pagination."""
    urls = []
    page = 1
    while True:
        paged_url = f'{category_url}page/{page}/' if page > 1 else category_url
        resp = SESSION.get(paged_url, timeout=15)
        if resp.status_code == 404:
            break
        soup = BeautifulSoup(resp.text, 'lxml')
        product_links = soup.select('a.woocommerce-loop-product__link')
        if not product_links:
            break
        urls.extend([a['href'] for a in product_links])
        page += 1
        time.sleep(random.uniform(1.5, 3.5))  # Polite delay
    return list(set(urls))  # Deduplicate

def run_scraper(category_url):
    conn = init_db()
    logging.info(f'Starting scrape of {category_url}')
    
    product_urls = scrape_category_urls(category_url)
    logging.info(f'Found {len(product_urls)} products')
    
    for url in product_urls:
        data = scrape_product_page(url)
        if data:
            data['scraped_at'] = datetime.now().isoformat()
            conn.execute(
                'INSERT INTO products (scraped_at,url,name,price_inr,availability,sku) '
                'VALUES (:scraped_at,:url,:name,:price_inr,:availability,:sku)',
                data
            )
            conn.commit()
            logging.info(f'Saved: {data["name"]} — ₹{data["price_inr"]}')
        time.sleep(random.uniform(2, 5))  # Polite inter-request delay
    
    conn.close()
    logging.info('Scrape complete')

if __name__ == '__main__':
    run_scraper('https://example-electronics-site.in/product-category/sensors/')

Dynamic Sites with Playwright on Raspberry Pi

For sites that load content via JavaScript, or have anti-bot measures that block plain HTTP requests, Playwright launches a real Chromium browser (headless on the Pi) and controls it programmatically:

from playwright.sync_api import sync_playwright
import sqlite3
import time

def scrape_with_playwright(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                '--no-sandbox',
                '--disable-dev-shm-usage',  # Important for Pi (limited /dev/shm)
                '--disable-gpu',
                '--disable-extensions',
                '--disable-background-networking',
            ]
        )
        context = browser.new_context(
            user_agent='Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36',
            viewport={'width': 1280, 'height': 720},
            locale='en-IN',
            timezone_id='Asia/Kolkata',
        )
        page = context.new_page()
        
        # Block images and media to speed up loading
        page.route('**/*.{png,jpg,jpeg,gif,webp,svg,mp4,woff,woff2}',
                   lambda route: route.abort())
        
        page.goto(url, wait_until='networkidle', timeout=30000)
        
        # Wait for specific element if needed
        page.wait_for_selector('.product-price', timeout=10000)
        
        # Extract data using page.evaluate() for complex JS interactions
        product_data = page.evaluate('''
            () => ({
                name: document.querySelector('h1.product-title')?.innerText,
                price: document.querySelector('.product-price .amount')?.innerText,
                stock: document.querySelector('.stock')?.innerText,
            })
        ''')
        
        # Or take a screenshot for debugging
        # page.screenshot(path='/tmp/debug_screenshot.png')
        
        browser.close()
        return product_data

# For infinite scroll pages:
def scrape_infinite_scroll(url, max_scrolls=10):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True, args=['--no-sandbox'])
        page = browser.new_page()
        page.goto(url)
        
        all_items = []
        for _ in range(max_scrolls):
            # Scroll to bottom
            page.keyboard.press('End')
            time.sleep(2)
            
            # Extract currently visible items
            items = page.query_selector_all('.product-card')
            all_items = [item.inner_text() for item in items]
            
            # Check if "Load More" button exists and click it
            load_more = page.query_selector('button.load-more')
            if load_more:
                load_more.click()
                page.wait_for_load_state('networkidle')
            else:
                break
        
        browser.close()
        return all_items

Performance note: Playwright on Raspberry Pi is slower than on a desktop — expect 3–5 seconds per page instead of under 1 second. Keep parallel browser instances to 2 maximum on the Pi 5 4GB to avoid RAM exhaustion. Block images and fonts (as shown above) to cut page load time by 40–60%.

Recommended: Raspberry Pi 5 Model 16GB RAM — if you need to run 5–10 parallel Playwright browser instances for high-throughput scraping, the 16GB model prevents OOM (Out of Memory) kills that interrupt long-running scraping jobs.

Storing Scraped Data: SQLite and CSV

For most Pi scraping projects, SQLite is the ideal storage backend: it’s built into Python, requires no server, handles concurrent reads well, and a single database file is easy to back up or transfer. The sqlite-utils library makes working with it even easier:

import sqlite_utils
from datetime import datetime

db = sqlite_utils.Database('/home/pi/scraper_data.db')

# Auto-create table from dict (no schema needed upfront)
db['products'].insert({
    'scraped_at': datetime.now().isoformat(),
    'name': 'Raspberry Pi 5 4GB',
    'price_inr': 7200.0,
    'url': 'https://example.com/product/rpi5-4gb',
    'in_stock': True,
}, alter=True)  # alter=True adds new columns automatically

# Query with pandas for analysis
import pandas as pd
df = pd.read_sql('SELECT * FROM products ORDER BY scraped_at DESC LIMIT 100',
                 db.conn)
print(df.describe())

# Export to CSV for sharing
df.to_csv('/home/pi/price_report.csv', index=False)

For price tracking specifically, always store historical data rather than updating in place. The pattern above inserts a new row on every scrape run, letting you track price changes over time with SQL queries like:

-- Find products with >10% price drop since last week
SELECT name, 
       first_price, 
       latest_price,
       round((latest_price - first_price) / first_price * 100, 1) as pct_change
FROM (
    SELECT name,
           first_value(price_inr) OVER (PARTITION BY url ORDER BY scraped_at) as first_price,
           last_value(price_inr)  OVER (PARTITION BY url ORDER BY scraped_at
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as latest_price
    FROM products
    WHERE scraped_at > datetime('now', '-7 days')
) WHERE pct_change < -10;

Scheduling: Cron vs Systemd Timers

For recurring scraper runs, you have two main options on Raspberry Pi OS:

Cron (Simple, Classic)

Add a cron job with crontab -e:

# Run scraper every 6 hours
0 */6 * * * /home/pi/scrapers_env/bin/python /home/pi/scraper.py >> /var/log/scraper_cron.log 2>&1

# Run at 8 AM and 8 PM India time (IST = UTC+5:30)
30 2,14 * * * /home/pi/scrapers_env/bin/python /home/pi/scraper.py >> /var/log/scraper_cron.log 2>&1

Systemd Timer (More Robust)

Systemd timers are better than cron for production use: they log to journald, can email on failure, handle missed runs correctly, and allow dependency ordering.

Create /etc/systemd/system/scraper.service:

[Unit]
Description=Product Price Scraper
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
User=pi
Environment=VIRTUAL_ENV=/home/pi/scrapers_env
ExecStart=/home/pi/scrapers_env/bin/python /home/pi/scraper.py
StandardOutput=append:/var/log/scraper.log
StandardError=append:/var/log/scraper_errors.log
TimeoutStartSec=3600

Create /etc/systemd/system/scraper.timer:

[Unit]
Description=Run scraper every 6 hours

[Timer]
OnCalendar=*-*-* 00,06,12,18:00:00
Persistent=true  # Run missed jobs after reboot

[Install]
WantedBy=timers.target

sudo systemctl daemon-reload
sudo systemctl enable --now scraper.timer
systemctl list-timers  # Verify it's scheduled

Anti-Detection: Headers, Delays, and Proxies

Most sites have anti-bot measures ranging from simple User-Agent checks to sophisticated fingerprinting. Here are practical techniques to keep your scraper running:

1. Realistic Request Headers

Beyond User-Agent, send the full browser header set that a real Chromium would send: Accept, Accept-Language, Accept-Encoding, Sec-Fetch-* headers. Tools like curl_cffi impersonate specific browser TLS fingerprints:

pip install curl_cffi
from curl_cffi import requests
resp = requests.get(url, impersonate='chrome120')  # Impersonates Chrome 120 TLS fingerprint

2. Polite Random Delays

Never send requests at a fixed rate. Use random delays drawn from a realistic distribution:

import time, random
# Gaussian-distributed delay, mean 3s, std 1.5s, min 1s
delay = max(1.0, random.gauss(3.0, 1.5))
time.sleep(delay)

3. Respect robots.txt

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if not rp.can_fetch('*', url):
    print(f'robots.txt disallows: {url}')
    return None

4. Rotating Proxies (for blocked IPs)

If your Pi’s IP gets temporarily blocked (HTTP 429 or 403 responses), residential proxy rotation via services like ScraperAPI or BrightData is the most reliable solution. For free rotation, using Tor is an option but Tor IPs are often pre-blocked by e-commerce sites.

Recommended: Raspberry Pi 5 Model 2GB RAM — for lightweight requests-based scrapers (no Playwright), the Pi 5 2GB is more than sufficient. You can run 10+ concurrent scraping threads on 2GB RAM with the requests + BeautifulSoup stack.

Recommended: 18650 Battery Holder Development Board V3 for Raspberry Pi — add a UPS module to your scraping Pi so power cuts don’t interrupt long-running jobs or corrupt your SQLite database mid-write.

Frequently Asked Questions

Is web scraping legal in India?

Web scraping is generally legal in India for publicly accessible data, provided you: respect the site’s robots.txt, don’t bypass authentication or CAPTCHA systems in a way that violates the Computer Misuse provisions of the IT Act, don’t scrape copyrighted content for republication, and don’t use scraped data for anti-competitive purposes. Scraping your own data, public government data, or general publicly accessible information for personal or research use is well within legal limits. Always review a site’s Terms of Service — some explicitly prohibit automated access.

How fast can a Raspberry Pi scrape websites?

With requests + BeautifulSoup and 10 concurrent threads, a Pi 5 can process 50–200 pages per minute depending on site response times and your delay settings. With Playwright (headless Chromium), expect 10–30 pages per minute. Network latency is usually the bottleneck, not Pi CPU — a fast internet connection improves throughput more than upgrading Pi RAM.

How do I handle CAPTCHAs in my scraper?

CAPTCHA solving options from easiest to hardest: (1) Find the underlying API endpoint and bypass the page entirely. (2) Use a CAPTCHA solving service like 2captcha or Anti-Captcha (about $2–3 per 1,000 solves). (3) Use Playwright with slow, human-like interaction patterns to avoid triggering CAPTCHA challenges. (4) Complete manual CAPTCHA solving via a browser extension that routes the challenge to your Raspberry Pi’s queue. Never try to implement your own CAPTCHA solver — it’s not worth the effort.

Can I run the scraper while the Raspberry Pi is sleeping?

The Raspberry Pi doesn’t have a traditional sleep/wake cycle like a laptop — it’s either powered on or powered off. You can reduce power consumption while the scraper is idle by underclocking the CPU via config.txt. For intermittent use, consider a relay or smart plug to power the Pi on when needed, triggered by a scheduled automation.

How do I get notified when the scraper finds a price drop or stock alert?

After storing the scraped data, add a notification step that compares the new price against the previous reading. Send alerts via Telegram Bot API (free, instant push to phone), email via SMTP with Python’s smtplib, or a service like ntfy.sh (self-hosted push notifications). A Telegram bot integration is 20 lines of Python and gives you instant phone notifications for price drops.

Power your automation projects with the right Raspberry Pi hardware. Shop all Pi 5 models and accessories at zbotic.in/product-category/raspberry-pi/ — India’s trusted electronics component store.