Voice-Controlled Autonomous Raspberry Pi Robot

Project Type: Autonomous + Voice Controlled Raspberry Pi Robot
Use Case: Obstacle Avoidance, Voice Interaction, Image Understanding, Mobile/Web/Bluetooth Control
Tagline: “Building the Bots of Tomorrow.”


Key Features

Voice Control

  • Robot listens for commands like:
    • forward, backward, left, right, stop
    • scan → scans surroundings with servo-mounted ultrasonic sensor
    • photo → captures an image and describes it in Hindi or Any other local language
    • auto / manual → switch modes
    • game → plays Rock-Paper-Scissors with the user

Laptop runs AI backend using Whisper (voice-to-text) and LLaVA or LLaMA3 (for understanding commands or describing images).


AI Image Analysis (LLaVA or similar model)

  • Robot captures image → sends to Flask API on laptop
  • API uses Ollama + Vision model (like LLaVA or Gemma) to describe the scene
  • Robot speaks the result in Hindi or any other local language

Autonomous Obstacle Avoidance

  • Ultrasonic Sensor on a Servo motor scans left/center/right
  • Robot chooses the clearest path and moves
  • Captures image of obstacle and describes it aloud
  • Can be toggled with voice or a physical button

Web Control Panel

  • Flask web server on Pi with a simple control UI
  • Buttons: Forward, Backward, Left, Right, Stop, Scan, Dance, Auto, Manual, Photo
  • Mobile-friendly interface for easy control

Mobile / Bluetooth Control (Coming Soon or Optional)

  • Option to add Bluetooth controller (HC-05 or phone)
  • Can control basic movements via joystick or Bluetooth terminal app

Fun Features

  • robo_dance() – a predefined dance sequence
  • rock_paper_scissors() – simple game using voice prompts

Hardware Components

ComponentDescription
Raspberry Pi 5Main brain (runs Flask, motor control, etc.)
Ultrasonic Sensor (HC-SR04)For obstacle detection
Servo MotorRotates the ultrasonic sensor for scanning
4x DC Gear MotorsFor movement (connected via L298N)
L298N Motor DriverControls the motors
Pi Camera 2Takes photos for AI processing
SpeakerFor voice output (any local language TTS)
ButtonTo toggle auto/manual mode
Power SupplyBattery or power bank
Wi-FiFor communication with laptop
LaptopRuns Ollama models and Flask APIs

Software Architecture

plaintextCopyEdit            ┌────────────────────┐
│ Voice Input │
│ (Hey Robo, etc.) │
└────────┬───────────┘

┌─────────────────────────────┐
│ Laptop with Whisper + AI │
│ Flask API (Text/Image) │
└──────┬──────────────┬───────┘
↓ ↓
Voice command Image caption
(e.g., "forward") (e.g., "यह एक दरवाजा है।")
↓ ↓
┌─────────────────────────────┐
│ Raspberry Pi (Robot) │
│ Flask server + GPIO Logic │
└─────────────────────────────┘

Technologies Used

  • Python (Flask, GPIO Zero, threading)
  • Spring Boot (Java backend on laptop)
  • Ollama + LLaVA / LLaMA3 / Gemma
  • Whisper (voice to text)
  • Google Translate / gTTS (Hindi TTS)
  • HTML/CSS/JS for control panel

Raspberry Pi setup Code

  • gpiozero, picamera2, pygame, speech_recognition, requests, gtts, pydub, flask
  • arecord for audio recording
  • ffmpeg for audio conversion (MP3 to WAV if needed)
import os
import requests
import time
from gpiozero import Motor, DistanceSensor, AngularServo
from picamera2 import Picamera2
from gtts import gTTS
import pygame
import speech_recognition as sr

# --- GPIO Setup ---
motor_left = Motor(forward=5, backward=6)
motor_right = Motor(forward=13, backward=19)
ultrasonic = DistanceSensor(echo=27, trigger=17)
servo = AngularServo(18, min_angle=0, max_angle=180)

# --- Camera ---
picam = Picamera2()
picam.configure(picam.create_still_configuration())

# --- AI Backend ---
VOICE_API = "http://192.168.215.29:8080/api/ai/voice"
IMAGE_API = "http://192.168.215.29:8080/api/ai/image"

# --- Global State ---
automatic_mode = False

# --- Movement Functions ---
def move_forward(t=1): 
    motor_left.forward()
    motor_right.forward()
    time.sleep(t)
    stop()

def move_backward(t=1): 
    motor_left.backward()
    motor_right.backward()
    time.sleep(t)
    stop()

def turn_left(): 
    motor_left.backward()
    motor_right.forward()
    time.sleep(0.5)
    stop()

def turn_right(): 
    motor_left.forward()
    motor_right.backward()
    time.sleep(0.5)
    stop()

def stop(): 
    motor_left.stop()
    motor_right.stop()

# --- Image Functions ---
def capture_image():
    filename = "/tmp/obstacle.jpg"
    picam.start()
    time.sleep(1)
    picam.capture_file(filename)
    picam.stop()
    return filename

def send_image_to_laptop(filepath):
    with open(filepath, "rb") as f:
        files = {"image": f}
        res = requests.post(IMAGE_API, files=files)
        return res.json().get("caption", "No description found")

# --- Speech Functions ---
def speak(text, lang="en"):
    tts = gTTS(text, lang=lang)
    filename = "/tmp/speak.mp3"
    tts.save(filename)
    pygame.mixer.init()
    pygame.mixer.music.load(filename)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        continue

# --- Sensor Functions ---
def scan_surroundings():
    angles = [0, 45, 90, 135, 180]
    for angle in angles:
        servo.angle = angle
        time.sleep(0.3)
        dist = ultrasonic.distance * 100
        print(f"Angle {angle}°: {dist:.2f} cm")
    servo.angle = 90

# --- Autonomous Mode Loop ---
def auto_mode_loop():
    global automatic_mode
    servo.angle = 90
    while automatic_mode:
        dist = ultrasonic.distance * 100
        print(f"Distance: {dist:.2f} cm")
        if dist < 25:
            stop()
            image = capture_image()
            caption = send_image_to_laptop(image)
            speak(f"Stop! Obstacle ahead: {caption}")
            turn_left()
        else:
            move_forward(0.5)
        time.sleep(0.1)

# --- Voice Recording & AI Interaction ---
def record_and_send_voice():
    os.system("arecord -D plughw:1,0 -f cd -t wav -d 4 -r 16000 /tmp/voice.wav")
    rec = sr.Recognizer()
    with sr.AudioFile("/tmp/voice.wav") as source:
        audio = rec.record(source)
    try:
        text = rec.recognize_google(audio, language="en-US")
        print(f"Recognized: {text}")
        res = requests.post(VOICE_API, json={"text": text})
        reply = res.json().get("reply", "")
        print(f"AI says: {reply}")
        speak(reply, lang="en")
        handle_ai_command(reply)
    except Exception as e:
        print("Error:", e)
        speak("Sorry, I didn't understand.")

# --- Command Handling ---
def handle_ai_command(command):
    global automatic_mode
    if "forward" in command:
        move_forward()
    elif "backward" in command:
        move_backward()
    elif "left" in command:
        turn_left()
    elif "right" in command:
        turn_right()
    elif "stop" in command:
        stop()
    elif "scan" in command:
        scan_surroundings()
    elif "auto" in command:
        automatic_mode = True
        auto_mode_loop()
    elif "manual" in command:
        automatic_mode = False
        stop()

# --- Extras ---
def robo_dance():
    for _ in range(2):
        turn_left()
        turn_right()
    speak("I'm dancing!", lang="en")

def play_rock_paper_scissors():
    speak("Rock, paper, or scissors! I choose: paper", lang="en")

Laptop setup Code

Project Structure

ai_backend/
├── main.py
├── voice_processor.py
├── image_processor.py
├── requirements.txt
└── static/

Flask App

requirements.txt

flask
requests
gtts
pydub

main.py

from flask import Flask, request, jsonify
from voice_processor import handle_voice_command
from image_processor import handle_image_caption

app = Flask(__name__)

@app.route("/api/ai/voice", methods=["POST"])
def voice():
    text = request.json.get("text")
    result = handle_voice_command(text)
    return jsonify({"reply": result})

@app.route("/api/ai/image", methods=["POST"])
def image():
    file = request.files.get("image")
    if not file:
        return jsonify({"error": "No image uploaded"}), 400
    caption = handle_image_caption(file)
    return jsonify({"caption": caption})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

voice_processor.py

import requests

OLLAMA_URL = "http://localhost:11434/api/generate"

def handle_voice_command(text):
    prompt = f"You are a robot controller. Respond to this command: '{text}'"
    payload = {
        "model": "borch/llama3_speed_chat",
        "prompt": prompt,
        "stream": False
    }
    res = requests.post(OLLAMA_URL, json=payload)
    return res.json().get("response", "").strip()

image_processor.py

import requests
import base64
from gtts import gTTS
import os

OLLAMA_URL = "http://localhost:11434/api/generate"

def handle_image_caption(file):
    image_data = base64.b64encode(file.read()).decode("utf-8")
    prompt = "Describe this image."

    payload = {
        "model": "llava",
        "prompt": prompt,
        "images": [image_data],
        "stream": False
    }

    res = requests.post(OLLAMA_URL, json=payload)
    hindi_caption = res.json().get("response", "").strip()

    # Speak it
    tts = gTTS(hindi_caption, lang="hi")
    filename = "/tmp/output.mp3"
    tts.save(filename)
    os.system(f"mpg123 {filename} &")  # Or another player

    return caption

Future Enhancements

  • Add GPS for location tracking
  • Add camera streaming to web panel
  • Add lidar or IR sensors for better navigation
  • Add NLP model on Pi for local processing
  • Offline Hindi TTS for better reliability
  • Bluetooth/Joystick support

Leave a Reply

Your email address will not be published. Required fields are marked *

More Articles & Posts