Build an ESP32 offline voice assistant with keyword detection that works without internet connectivity, perfect for Indian homes where privacy matters and connectivity varies. This tutorial shows how to detect wake words like “Hey Zbotic” directly on the ESP32 using Espressif’s ESP-SR library and TensorFlow Lite Micro, then trigger home automation actions over WiFi MQTT.
Table of Contents
- Why Offline Keyword Detection for India?
- Hardware Requirements and Wiring
- Getting Started with ESP-SR Framework
- TensorFlow Lite Micro on ESP32
- Training a Custom Wake Word
- Triggering MQTT Actions After Detection
- Performance Tips for Indian Accents
- Frequently Asked Questions
Why Offline Keyword Detection for India?
Cloud-based voice assistants send audio to US/EU data centres, adding 200-400ms latency on Indian broadband. An offline ESP32 wake-word detector responds in under 100ms, works during internet outages, costs under Rs 1,500 total, and keeps audio entirely within your home network. After fine-tuning on Indian accent samples, accuracy rivals commercial cloud systems.
Recommended: UNO WiFi R3 (ATmega328P + ESP8266)
The UNO WiFi R3 provides the WiFi backbone for your voice assistant MQTT command dispatch. Pair it with an ESP32 for voice detection and use this board as the home controller hub.
Hardware Requirements and Wiring
Recommended hardware (India prices 2025):
- ESP32-S3-DevKitC-1 (Rs 400-600) – more RAM than original ESP32
- INMP441 I2S MEMS microphone (Rs 150) – better noise rejection than MAX9814
- MAX98357A I2S amplifier + 3W speaker (Rs 200) for audio feedback
INMP441 ESP32
VDD -> 3.3V
GND -> GND
SD -> GPIO 32 (I2S Data)
SCK -> GPIO 14 (I2S Clock)
WS -> GPIO 15 (I2S Word Select)
L/R -> GND (left channel)
Getting Started with ESP-SR Framework
Espressif’s ESP-SR provides pre-trained wake word models optimised for ESP32-S3:
git clone --recursive https://github.com/espressif/esp-idf.git
cd esp-idf && ./install.sh esp32s3 && . ./export.sh
git clone https://github.com/espressif/esp-sr.git
cd esp-sr/examples/wake_word_detection
idf.py set-target esp32s3
idf.py menuconfig # Enable WakeNet7, select wn7_hiesp
idf.py build flash monitor
Say “Hi ESP” and the console prints WAKE_UP detected. WakeNet7 achieves <2% false positive rate in typical Indian home noise environments (ceiling fan, TV at moderate volume).
TensorFlow Lite Micro on ESP32
For Arduino-based workflows, use TFLite Micro (install via Library Manager: TensorFlowLite_ESP32):
#include <TensorFlowLite_ESP32.h>
#include "model_data.h" // .tflite model as C array
const int kArenaSize = 60 * 1024;
uint8_t tensor_arena[kArenaSize];
tflite::AllOpsResolver resolver;
const tflite::Model* model = tflite::GetModel(g_model_data);
tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kArenaSize);
// Feed 1-second 16kHz audio window as MFCC features
// Output: probabilities per keyword class
TfLiteTensor* output = interpreter.output(0);
float wake_prob = output->data.f[0]; // "Hey Zbotic"
float lights_prob = output->data.f[1]; // "Lights on"
Recommended: Mega WiFi R3 (ATmega2560 + ESP8266)
The Mega WiFi R3 handles complex voice automation setups with 54 I/O pins. Connect relays for lights and fans, RGB LEDs for visual feedback, and use the built-in ESP8266 for MQTT command forwarding.
Training a Custom Wake Word
Use Edge Impulse (free tier at edgeimpulse.com) for custom wake word training:
- Record 50-100 utterances at 16kHz, 16-bit mono WAV. Include background noise (TV, ceiling fan).
- Upload to Edge Impulse, extract MFCC features, train a small neural net (<50KB model).
- Export as TFLite and convert:
xxd -i model.tflite > model_data.h - Indian accent tip: Record from multiple speakers mixing Hindi-medium and English-medium accents. Aim for <5% false positive rate in typical home noise.
Triggering MQTT Actions After Detection
#include <WiFi.h>
#include <PubSubClient.h>
void onWakeWord(const char* command) {
if (!mqtt.connected()) reconnect();
if (strcmp(command, "LIGHTS_ON") == 0)
mqtt.publish("home/livingroom/light", "ON");
else if (strcmp(command, "FAN_OFF") == 0)
mqtt.publish("home/bedroom/fan", "OFF");
else if (strcmp(command, "GOOD_NIGHT") == 0) {
mqtt.publish("home/all/lights", "OFF");
mqtt.publish("home/all/ac", "OFF");
}
}
In Home Assistant, create MQTT-triggered automations responding to these topics. This gives a complete offline voice pipeline: microphone to ESP32 to MQTT to Home Assistant to relays.
Performance Tips for Indian Accents
- Background noise training: Indian homes have distinct noise profiles: ceiling fans (300-500Hz hum), pressure cooker whistles, Bollywood music. Include these as negative training samples.
- Placement: Mount microphone at ear height (~1.2m), away from AC vents. Avoid locations near mixer-grinder area.
- False positives: Common Hindi words (“abhi”, “bhi”) can trigger English wake words. Add a 500ms confirmation window requiring two consecutive detections above 80% confidence.
- Latency: WakeNet7 on ESP32-S3 detects in ~80ms after utterance end, well below the 300ms human perception threshold.
Recommended: 12V 1-Channel Relay Module (RS485/Modbus)
Complete your voice assistant build with this relay module. After keyword detection, the ESP32 sends MQTT commands that trigger this relay to physically control lights, fans, and geysers.
Frequently Asked Questions
- Does ESP32 have enough RAM for voice AI?
- The ESP32-S3 has 8MB PSRAM, sufficient for WakeNet7 (320KB model) and audio buffers. The original ESP32 (520KB SRAM) works with small TFLite models under 50KB only.
- Can I use Hindi commands?
- Yes, but you need Hindi training data. ESP-SR WakeNet supports English and Mandarin natively; Hindi requires custom Edge Impulse training with 100+ Hindi command samples from Indian speakers.
- What is the recognition range?
- With INMP441, reliable detection up to ~3 metres in a quiet room, ~1.5 metres with ceiling fan running. Add a MAX4466 amplification stage for larger rooms.
- Can this work without WiFi?
- The keyword detection is fully offline. For MQTT-less control, connect relays directly to ESP32 GPIO pins and drive them from within the detection callback.
- How do I add custom commands beyond wake words?
- ESP-SR’s MultiNet (command word recognition) runs after wake detection and supports up to 200 commands. Train commands like “turn on light” or “set temperature” and map them to MQTT topics.
Add comment