AI Tinyml with Person Detection on ESP32-CAM Offline

Running TinyML person detection on ESP32-CAM offline brings AI vision capabilities to a ₹500 device without any cloud dependency. TensorFlow Lite Micro runs a quantised MobileNet model directly on the ESP32’s CPU, enabling real-time person detection at the edge. This tutorial covers model deployment, inference code, and practical applications for Indian makers interested in edge AI.

What is TinyML?
Pre-trained Person Detection Model
Arduino Setup for TFLite Micro
Inference Code
Interpreting Output and Triggering Actions
Optimisation Tips
Frequently Asked Questions

What is TinyML?

TinyML (Tiny Machine Learning) refers to machine learning models optimised to run on microcontrollers and edge devices with limited memory and compute. Key characteristics:

Models are quantised to INT8 or INT4 to reduce size (from MBs to KBs)
No internet connection required — inference runs locally
Low power consumption (milliwatts vs watts for server-based AI)
Latency is predictable — no network variability
Privacy: no images sent to cloud

The ESP32-CAM with 4MB PSRAM can run models up to ~500KB in size — enough for simple object detection and classification tasks.

Recommended: Arducam 2MP OV2640 Camera Shield for Arduino — For higher-quality TinyML input images on Arduino, the Arducam SPI camera provides cleaner JPEG frames than the ESP32-CAM’s onboard camera.

Pre-trained Person Detection Model

Google’s open-source person detection model for microcontrollers is the starting point. It’s included in the TensorFlow Lite for Microcontrollers examples repository:

Input: 96×96 greyscale image
Output: Two scores — person / no person
Model size: ~300KB (INT8 quantised)
Accuracy: ~89% on test set (real-world lower in variable conditions)
Inference time on ESP32: ~200–400ms per frame

Arduino Setup for TFLite Micro

// Install these libraries in Arduino IDE Library Manager:
// 1. "TensorFlow Lite for Microcontrollers" by EloquentTinyML
//    (or the Harvard Edge Impulse version)
// 2. ESP32 board support (Espressif)

// Board settings for ESP32-CAM:
// Board: AI Thinker ESP32-CAM
// CPU Freq: 240 MHz (max for inference speed)
// Upload Speed: 115200

// Memory requirements:
// Model: ~300KB Flash
// Tensor arena: ~100KB PSRAM
// Total: ~400KB - fits in ESP32-CAM 4MB Flash

Inference Code

#include "esp_camera.h"
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "person_detect_model_data.h"  // From TFLite examples

// AI Thinker pin definitions (same as before)
#define PWDN_GPIO_NUM 32
// ... (same camera config as RTSP example)

constexpr int kTensorArenaSize = 100 * 1024;
uint8_t tensor_arena[kTensorArenaSize];

tflite::AllOpsResolver resolver;
const tflite::Model* model;
tflite::MicroInterpreter* interpreter;
TfLiteTensor* input;

void setup() {
  Serial.begin(115200);
  
  // Camera init (same config, but GRAYSCALE, 96x96)
  camera_config_t config;
  // ... fill in pin definitions ...
  config.pixel_format = PIXFORMAT_GRAYSCALE;
  config.frame_size = FRAMESIZE_96X96;
  config.fb_count = 1;
  esp_camera_init(&config);
  
  // Load TFLite model
  model = tflite::GetModel(g_person_detect_model_data);
  interpreter = new tflite::MicroInterpreter(
    model, resolver, tensor_arena, kTensorArenaSize
  );
  interpreter->AllocateTensors();
  input = interpreter->input(0);
  
  Serial.println("TinyML Person Detection Ready");
}

void loop() {
  camera_fb_t *fb = esp_camera_fb_get();
  if (!fb) { Serial.println("Camera capture failed"); return; }
  
  // Copy 96x96 grayscale frame to model input
  for (int i = 0; i data.int8[i] = (int8_t)fb->buf[i] - 128;
  }
  esp_camera_fb_return(fb);
  
  // Run inference
  TfLiteStatus status = interpreter->Invoke();
  if (status != kTfLiteOk) {
    Serial.println("Inference failed");
    return;
  }
  
  // Get output scores
  TfLiteTensor* output = interpreter->output(0);
  int8_t no_person_score = output->data.int8[0];
  int8_t person_score = output->data.int8[1];
  
  Serial.printf("Person: %d, No Person: %d
", 
                person_score, no_person_score);
  
  if (person_score > no_person_score) {
    Serial.println("PERSON DETECTED!");
    // Trigger GPIO, LED, buzzer, relay, etc.
    digitalWrite(LED_GPIO_NUM, HIGH);
  } else {
    digitalWrite(LED_GPIO_NUM, LOW);
  }
  
  delay(100);
}

Interpreting Output and Triggering Actions

The model outputs two INT8 scores. A score above 100 (out of 127) indicates high confidence. Typical thresholds:

person_score > 100: High confidence person detected → trigger security alert, unlock door, turn on light
person_score > 50: Moderate confidence → log detection, increment counter
person_score ≤ 0: No person

Recommended: Arducam 8MP PTZ Camera for Raspberry Pi — For more advanced TinyML projects needing higher resolution and pan-tilt tracking when a person is detected.

Optimisation Tips

Set CPU to 240 MHz (maximum) in setCpuFrequencyMhz(240)
Use PSRAM for tensor arena: uint8_t* tensor_arena = (uint8_t*)heap_caps_malloc(kTensorArenaSize, MALLOC_CAP_SPIRAM)
Run inference on every 3rd frame — person positions change slowly
Consider Edge Impulse (edgeimpulse.com) for training custom models and deploying to ESP32 — they have an excellent free tier and Arduino library export

Frequently Asked Questions

Can I train my own custom TinyML model for ESP32-CAM?

Yes — Edge Impulse is the best platform for this. Collect images, annotate them in the web interface, train a model, and export an Arduino library. The entire process from data collection to deployment takes a few hours. Edge Impulse’s free tier supports models that fit on ESP32. Popular custom models: face detection, specific product detection, gesture recognition, defect detection for Indian manufacturing QC.

How accurate is the person detection model on ESP32-CAM?

In good lighting with a frontal view of a person, accuracy is typically 80–90%. In challenging conditions (backlit, partial view, unusual angles, Indian sarees/traditional clothing), accuracy drops to 60–75%. For security applications, combine person detection as a first filter with a higher-quality secondary check (Telegram alert + human review) rather than using it as a sole access control mechanism.

What other TinyML models can run on ESP32-CAM?

Models that fit in ~300KB INT8 quantised and process 96×96 images: keyword spotting (with microphone, not camera), gesture recognition (hand gestures), digit recognition, simple fruit/object classification. Models requiring higher resolution (96×96 is very low) or more processing power are better suited for Raspberry Pi with TensorFlow Lite or Jetson Nano with CUDA acceleration.

Shop Camera & Vision Modules at Zbotic →