A Raspberry Pi running a Python web scraper is one of the most cost-effective ways to automate data collection. Whether you’re monitoring competitor prices, tracking job postings, archiving news articles, or watching stock levels on e-commerce sites, a Pi can run your scraper 24/7 on a few watts of power, store results locally, and alert you when something interesting happens — all without paying for cloud compute.
This tutorial covers the full stack: static site scraping with BeautifulSoup, dynamic JavaScript-rendered sites with Playwright, scheduling with cron and systemd timers, local data storage, and practical anti-detection techniques to keep your scraper running reliably.
Choosing the Right Scraping Tools
Web scraping tools fall into two main categories based on the type of site you’re targeting:
Static Sites (HTML delivered server-side)
Sites where the content is fully present in the HTML response — government data portals, Wikipedia, most news sites, job boards with server-side rendering. For these, use:
- requests — HTTP library for fetching pages
- BeautifulSoup4 — HTML parser with an intuitive API
- lxml — faster parser backend for BeautifulSoup
Dynamic Sites (JavaScript-rendered content)
Sites like modern e-commerce platforms, social media, or any site where content loads after the initial HTML via AJAX calls. You need a real browser (or browser emulator) that executes JavaScript:
- Playwright — Microsoft’s browser automation library; supports Chromium, Firefox, WebKit; excellent ARM64 support
- Selenium — older alternative; works but Playwright is generally preferred for new projects
For many dynamic sites, you can skip the browser entirely by inspecting the site’s network requests in browser DevTools, finding the underlying API endpoint, and calling it directly with requests. This is significantly faster than running a full browser.
Setting Up Your Raspberry Pi for Scraping
Start with a fresh Raspberry Pi OS Lite (64-bit) installation on your Pi 5. The Lite version skips the desktop environment, saving RAM and CPU cycles for your scrapers. Enable SSH via Raspberry Pi Imager’s Advanced Options so you can manage it headlessly.
After first boot, install the core Python dependencies:
sudo apt update && sudo apt install -y python3-pip python3-venv chromium-browser chromium-chromedriver
# Create a virtual environment for your scraper project
cd ~
python3 -m venv scrapers_env
source scrapers_env/bin/activate
# Install Python packages
pip install requests beautifulsoup4 lxml playwright sqlite-utils pandas
Install Playwright browsers (uses bundled Chromium compatible with ARM64):
playwright install chromium
playwright install-deps chromium
Verify everything works:
python3 -c "from playwright.sync_api import sync_playwright; print('Playwright OK')"
python3 -c "import bs4; print('BS4 OK')"
Static Scraping with Requests and BeautifulSoup
Let’s build a real example: scraping electronics component prices from a product category page. This pattern applies to most e-commerce and catalogue sites.
import requests
from bs4 import BeautifulSoup
import sqlite3
from datetime import datetime
import time
import random
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s',
handlers=[
logging.FileHandler('/var/log/scraper.log'),
logging.StreamHandler()
]
)
SESSION = requests.Session()
SESSION.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-IN,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
})
def init_db(db_path='/home/pi/scraper_data.db'):
conn = sqlite3.connect(db_path)
conn.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
scraped_at TEXT,
url TEXT,
name TEXT,
price_inr REAL,
availability TEXT,
sku TEXT
)
''')
conn.commit()
return conn
def scrape_product_page(url):
try:
resp = SESSION.get(url, timeout=15)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, 'lxml')
# These selectors are examples — adjust to your target site
name = soup.select_one('h1.product-title, h1.entry-title')
price = soup.select_one('.price .woocommerce-Price-amount bdi,
.product-price .amount')
availability = soup.select_one('.stock')
sku_elem = soup.select_one('.sku')
return {
'url': url,
'name': name.get_text(strip=True) if name else None,
'price_inr': float(price.get_text(strip=True).replace('₹','').replace(',','')) if price else None,
'availability': availability.get_text(strip=True) if availability else 'Unknown',
'sku': sku_elem.get_text(strip=True) if sku_elem else None,
}
except Exception as e:
logging.error(f'Failed to scrape {url}: {e}')
return None
def scrape_category_urls(category_url):
"""Extract all product URLs from a category page, handling pagination."""
urls = []
page = 1
while True:
paged_url = f'{category_url}page/{page}/' if page > 1 else category_url
resp = SESSION.get(paged_url, timeout=15)
if resp.status_code == 404:
break
soup = BeautifulSoup(resp.text, 'lxml')
product_links = soup.select('a.woocommerce-loop-product__link')
if not product_links:
break
urls.extend([a['href'] for a in product_links])
page += 1
time.sleep(random.uniform(1.5, 3.5)) # Polite delay
return list(set(urls)) # Deduplicate
def run_scraper(category_url):
conn = init_db()
logging.info(f'Starting scrape of {category_url}')
product_urls = scrape_category_urls(category_url)
logging.info(f'Found {len(product_urls)} products')
for url in product_urls:
data = scrape_product_page(url)
if data:
data['scraped_at'] = datetime.now().isoformat()
conn.execute(
'INSERT INTO products (scraped_at,url,name,price_inr,availability,sku) '
'VALUES (:scraped_at,:url,:name,:price_inr,:availability,:sku)',
data
)
conn.commit()
logging.info(f'Saved: {data["name"]} — ₹{data["price_inr"]}')
time.sleep(random.uniform(2, 5)) # Polite inter-request delay
conn.close()
logging.info('Scrape complete')
if __name__ == '__main__':
run_scraper('https://example-electronics-site.in/product-category/sensors/')
Dynamic Sites with Playwright on Raspberry Pi
For sites that load content via JavaScript, or have anti-bot measures that block plain HTTP requests, Playwright launches a real Chromium browser (headless on the Pi) and controls it programmatically:
from playwright.sync_api import sync_playwright
import sqlite3
import time
def scrape_with_playwright(url):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
'--no-sandbox',
'--disable-dev-shm-usage', # Important for Pi (limited /dev/shm)
'--disable-gpu',
'--disable-extensions',
'--disable-background-networking',
]
)
context = browser.new_context(
user_agent='Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36',
viewport={'width': 1280, 'height': 720},
locale='en-IN',
timezone_id='Asia/Kolkata',
)
page = context.new_page()
# Block images and media to speed up loading
page.route('**/*.{png,jpg,jpeg,gif,webp,svg,mp4,woff,woff2}',
lambda route: route.abort())
page.goto(url, wait_until='networkidle', timeout=30000)
# Wait for specific element if needed
page.wait_for_selector('.product-price', timeout=10000)
# Extract data using page.evaluate() for complex JS interactions
product_data = page.evaluate('''
() => ({
name: document.querySelector('h1.product-title')?.innerText,
price: document.querySelector('.product-price .amount')?.innerText,
stock: document.querySelector('.stock')?.innerText,
})
''')
# Or take a screenshot for debugging
# page.screenshot(path='/tmp/debug_screenshot.png')
browser.close()
return product_data
# For infinite scroll pages:
def scrape_infinite_scroll(url, max_scrolls=10):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, args=['--no-sandbox'])
page = browser.new_page()
page.goto(url)
all_items = []
for _ in range(max_scrolls):
# Scroll to bottom
page.keyboard.press('End')
time.sleep(2)
# Extract currently visible items
items = page.query_selector_all('.product-card')
all_items = [item.inner_text() for item in items]
# Check if "Load More" button exists and click it
load_more = page.query_selector('button.load-more')
if load_more:
load_more.click()
page.wait_for_load_state('networkidle')
else:
break
browser.close()
return all_items
Performance note: Playwright on Raspberry Pi is slower than on a desktop — expect 3–5 seconds per page instead of under 1 second. Keep parallel browser instances to 2 maximum on the Pi 5 4GB to avoid RAM exhaustion. Block images and fonts (as shown above) to cut page load time by 40–60%.
Storing Scraped Data: SQLite and CSV
For most Pi scraping projects, SQLite is the ideal storage backend: it’s built into Python, requires no server, handles concurrent reads well, and a single database file is easy to back up or transfer. The sqlite-utils library makes working with it even easier:
import sqlite_utils
from datetime import datetime
db = sqlite_utils.Database('/home/pi/scraper_data.db')
# Auto-create table from dict (no schema needed upfront)
db['products'].insert({
'scraped_at': datetime.now().isoformat(),
'name': 'Raspberry Pi 5 4GB',
'price_inr': 7200.0,
'url': 'https://example.com/product/rpi5-4gb',
'in_stock': True,
}, alter=True) # alter=True adds new columns automatically
# Query with pandas for analysis
import pandas as pd
df = pd.read_sql('SELECT * FROM products ORDER BY scraped_at DESC LIMIT 100',
db.conn)
print(df.describe())
# Export to CSV for sharing
df.to_csv('/home/pi/price_report.csv', index=False)
For price tracking specifically, always store historical data rather than updating in place. The pattern above inserts a new row on every scrape run, letting you track price changes over time with SQL queries like:
-- Find products with >10% price drop since last week
SELECT name,
first_price,
latest_price,
round((latest_price - first_price) / first_price * 100, 1) as pct_change
FROM (
SELECT name,
first_value(price_inr) OVER (PARTITION BY url ORDER BY scraped_at) as first_price,
last_value(price_inr) OVER (PARTITION BY url ORDER BY scraped_at
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as latest_price
FROM products
WHERE scraped_at > datetime('now', '-7 days')
) WHERE pct_change < -10;
Scheduling: Cron vs Systemd Timers
For recurring scraper runs, you have two main options on Raspberry Pi OS:
Cron (Simple, Classic)
Add a cron job with crontab -e:
# Run scraper every 6 hours
0 */6 * * * /home/pi/scrapers_env/bin/python /home/pi/scraper.py >> /var/log/scraper_cron.log 2>&1
# Run at 8 AM and 8 PM India time (IST = UTC+5:30)
30 2,14 * * * /home/pi/scrapers_env/bin/python /home/pi/scraper.py >> /var/log/scraper_cron.log 2>&1
Systemd Timer (More Robust)
Systemd timers are better than cron for production use: they log to journald, can email on failure, handle missed runs correctly, and allow dependency ordering.
Create /etc/systemd/system/scraper.service:
[Unit]
Description=Product Price Scraper
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=pi
Environment=VIRTUAL_ENV=/home/pi/scrapers_env
ExecStart=/home/pi/scrapers_env/bin/python /home/pi/scraper.py
StandardOutput=append:/var/log/scraper.log
StandardError=append:/var/log/scraper_errors.log
TimeoutStartSec=3600
Create /etc/systemd/system/scraper.timer:
[Unit]
Description=Run scraper every 6 hours
[Timer]
OnCalendar=*-*-* 00,06,12,18:00:00
Persistent=true # Run missed jobs after reboot
[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable --now scraper.timer
systemctl list-timers # Verify it's scheduled
Anti-Detection: Headers, Delays, and Proxies
Most sites have anti-bot measures ranging from simple User-Agent checks to sophisticated fingerprinting. Here are practical techniques to keep your scraper running:
1. Realistic Request Headers
Beyond User-Agent, send the full browser header set that a real Chromium would send: Accept, Accept-Language, Accept-Encoding, Sec-Fetch-* headers. Tools like curl_cffi impersonate specific browser TLS fingerprints:
pip install curl_cffi
from curl_cffi import requests
resp = requests.get(url, impersonate='chrome120') # Impersonates Chrome 120 TLS fingerprint
2. Polite Random Delays
Never send requests at a fixed rate. Use random delays drawn from a realistic distribution:
import time, random
# Gaussian-distributed delay, mean 3s, std 1.5s, min 1s
delay = max(1.0, random.gauss(3.0, 1.5))
time.sleep(delay)
3. Respect robots.txt
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if not rp.can_fetch('*', url):
print(f'robots.txt disallows: {url}')
return None
4. Rotating Proxies (for blocked IPs)
If your Pi’s IP gets temporarily blocked (HTTP 429 or 403 responses), residential proxy rotation via services like ScraperAPI or BrightData is the most reliable solution. For free rotation, using Tor is an option but Tor IPs are often pre-blocked by e-commerce sites.
Frequently Asked Questions
Is web scraping legal in India?
Web scraping is generally legal in India for publicly accessible data, provided you: respect the site’s robots.txt, don’t bypass authentication or CAPTCHA systems in a way that violates the Computer Misuse provisions of the IT Act, don’t scrape copyrighted content for republication, and don’t use scraped data for anti-competitive purposes. Scraping your own data, public government data, or general publicly accessible information for personal or research use is well within legal limits. Always review a site’s Terms of Service — some explicitly prohibit automated access.
How fast can a Raspberry Pi scrape websites?
With requests + BeautifulSoup and 10 concurrent threads, a Pi 5 can process 50–200 pages per minute depending on site response times and your delay settings. With Playwright (headless Chromium), expect 10–30 pages per minute. Network latency is usually the bottleneck, not Pi CPU — a fast internet connection improves throughput more than upgrading Pi RAM.
How do I handle CAPTCHAs in my scraper?
CAPTCHA solving options from easiest to hardest: (1) Find the underlying API endpoint and bypass the page entirely. (2) Use a CAPTCHA solving service like 2captcha or Anti-Captcha (about $2–3 per 1,000 solves). (3) Use Playwright with slow, human-like interaction patterns to avoid triggering CAPTCHA challenges. (4) Complete manual CAPTCHA solving via a browser extension that routes the challenge to your Raspberry Pi’s queue. Never try to implement your own CAPTCHA solver — it’s not worth the effort.
Can I run the scraper while the Raspberry Pi is sleeping?
The Raspberry Pi doesn’t have a traditional sleep/wake cycle like a laptop — it’s either powered on or powered off. You can reduce power consumption while the scraper is idle by underclocking the CPU via config.txt. For intermittent use, consider a relay or smart plug to power the Pi on when needed, triggered by a scheduled automation.
How do I get notified when the scraper finds a price drop or stock alert?
After storing the scraped data, add a notification step that compares the new price against the previous reading. Send alerts via Telegram Bot API (free, instant push to phone), email via SMTP with Python’s smtplib, or a service like ntfy.sh (self-hosted push notifications). A Telegram bot integration is 20 lines of Python and gives you instant phone notifications for price drops.
Power your automation projects with the right Raspberry Pi hardware. Shop all Pi 5 models and accessories at zbotic.in/product-category/raspberry-pi/ — India’s trusted electronics component store.
Add comment