Voice-Controlled Autonomous Raspberry Pi Robot

Project Type: Autonomous + Voice Controlled Raspberry Pi Robot
Use Case: Obstacle Avoidance, Voice Interaction, Image Understanding, Mobile/Web/Bluetooth Control
Tagline: “Building the Bots of Tomorrow.”

Key Features

Voice Control

Robot listens for commands like:
- forward, backward, left, right, stop
- scan → scans surroundings with servo-mounted ultrasonic sensor
- photo → captures an image and describes it in Hindi or Any other local language
- auto / manual → switch modes
- game → plays Rock-Paper-Scissors with the user

Laptop runs AI backend using Whisper (voice-to-text) and LLaVA or LLaMA3 (for understanding commands or describing images).

AI Image Analysis (LLaVA or similar model)

Robot captures image → sends to Flask API on laptop
API uses Ollama + Vision model (like LLaVA or Gemma) to describe the scene
Robot speaks the result in Hindi or any other local language

Autonomous Obstacle Avoidance

Ultrasonic Sensor on a Servo motor scans left/center/right
Robot chooses the clearest path and moves
Captures image of obstacle and describes it aloud
Can be toggled with voice or a physical button

Web Control Panel

Flask web server on Pi with a simple control UI
Buttons: Forward, Backward, Left, Right, Stop, Scan, Dance, Auto, Manual, Photo
Mobile-friendly interface for easy control

Mobile / Bluetooth Control (Coming Soon or Optional)

Option to add Bluetooth controller (HC-05 or phone)
Can control basic movements via joystick or Bluetooth terminal app

Fun Features

robo_dance() – a predefined dance sequence
rock_paper_scissors() – simple game using voice prompts

Hardware Components

Component	Description
Raspberry Pi 5	Main brain (runs Flask, motor control, etc.)
Ultrasonic Sensor (HC-SR04)	For obstacle detection
Servo Motor	Rotates the ultrasonic sensor for scanning
4x DC Gear Motors	For movement (connected via L298N)
L298N Motor Driver	Controls the motors
Pi Camera 2	Takes photos for AI processing
Speaker	For voice output (any local language TTS)
Button	To toggle auto/manual mode
Power Supply	Battery or power bank
Wi-Fi	For communication with laptop
Laptop	Runs Ollama models and Flask APIs

Software Architecture

plaintextCopyEdit            ┌────────────────────┐
            │     Voice Input    │
            │  (Hey Robo, etc.)  │
            └────────┬───────────┘
                     ↓
       ┌─────────────────────────────┐
       │  Laptop with Whisper + AI   │
       │   Flask API (Text/Image)    │
       └──────┬──────────────┬───────┘
              ↓              ↓
         Voice command   Image caption
         (e.g., "forward") (e.g., "यह एक दरवाजा है।")
              ↓              ↓
       ┌─────────────────────────────┐
       │     Raspberry Pi (Robot)    │
       │  Flask server + GPIO Logic  │
       └─────────────────────────────┘

Technologies Used

Python (Flask, GPIO Zero, threading)
Spring Boot (Java backend on laptop)
Ollama + LLaVA / LLaMA3 / Gemma
Whisper (voice to text)
Google Translate / gTTS (Hindi TTS)
HTML/CSS/JS for control panel

Raspberry Pi setup Code

gpiozero, picamera2, pygame, speech_recognition, requests, gtts, pydub, flask
arecord for audio recording
ffmpeg for audio conversion (MP3 to WAV if needed)

import os
import requests
import time
from gpiozero import Motor, DistanceSensor, AngularServo
from picamera2 import Picamera2
from gtts import gTTS
import pygame
import speech_recognition as sr

# — GPIO Setup —
motor_left = Motor(forward=5, backward=6)
motor_right = Motor(forward=13, backward=19)
ultrasonic = DistanceSensor(echo=27, trigger=17)
servo = AngularServo(18, min_angle=0, max_angle=180)

# — Camera —
picam = Picamera2()
picam.configure(picam.create_still_configuration())

# — AI Backend —
VOICE_API = “http://192.168.215.29:8080/api/ai/voice”
IMAGE_API = “http://192.168.215.29:8080/api/ai/image”

# — Global State —
automatic_mode = False

# — Movement Functions —
def move_forward(t=1): 
    motor_left.forward()
    motor_right.forward()
    time.sleep(t)
    stop()

def move_backward(t=1): 
    motor_left.backward()
    motor_right.backward()
    time.sleep(t)
    stop()

def turn_left(): 
    motor_left.backward()
    motor_right.forward()
    time.sleep(0.5)
    stop()

def turn_right(): 
    motor_left.forward()
    motor_right.backward()
    time.sleep(0.5)
    stop()

def stop(): 
    motor_left.stop()
    motor_right.stop()

# — Image Functions —
def capture_image():
    filename = “/tmp/obstacle.jpg”
    picam.start()
    time.sleep(1)
    picam.capture_file(filename)
    picam.stop()
    return filename

def send_image_to_laptop(filepath):
    with open(filepath, “rb”) as f:
        files = {“image”: f}
        res = requests.post(IMAGE_API, files=files)
        return res.json().get(“caption”, “No description found”)

# — Speech Functions —
def speak(text, lang=”en”):
    tts = gTTS(text, lang=lang)
    filename = “/tmp/speak.mp3″
    tts.save(filename)
    pygame.mixer.init()
    pygame.mixer.music.load(filename)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        continue

# — Sensor Functions —
def scan_surroundings():
    angles = [0, 45, 90, 135, 180]
    for angle in angles:
        servo.angle = angle
        time.sleep(0.3)
        dist = ultrasonic.distance * 100
        print(f”Angle {angle}°: {dist:.2f} cm”)
    servo.angle = 90

# — Autonomous Mode Loop —
def auto_mode_loop():
    global automatic_mode
    servo.angle = 90
    while automatic_mode:
        dist = ultrasonic.distance * 100
        print(f”Distance: {dist:.2f} cm”)
        if dist < 25:
            stop()
            image = capture_image()
            caption = send_image_to_laptop(image)
            speak(f”Stop! Obstacle ahead: {caption}”)
            turn_left()
        else:
            move_forward(0.5)
        time.sleep(0.1)

# — Voice Recording & AI Interaction —
def record_and_send_voice():
    os.system(“arecord -D plughw:1,0 -f cd -t wav -d 4 -r 16000 /tmp/voice.wav”)
    rec = sr.Recognizer()
    with sr.AudioFile(“/tmp/voice.wav”) as source:
        audio = rec.record(source)
    try:
        text = rec.recognize_google(audio, language=”en-US”)
        print(f”Recognized: {text}”)
        res = requests.post(VOICE_API, json={“text”: text})
        reply = res.json().get(“reply”, “”)
        print(f”AI says: {reply}”)
        speak(reply, lang=”en”)
        handle_ai_command(reply)
    except Exception as e:
        print(“Error:”, e)
        speak(“Sorry, I didn’t understand.”)

# — Command Handling —
def handle_ai_command(command):
    global automatic_mode
    if “forward” in command:
        move_forward()
    elif “backward” in command:
        move_backward()
    elif “left” in command:
        turn_left()
    elif “right” in command:
        turn_right()
    elif “stop” in command:
        stop()
    elif “scan” in command:
        scan_surroundings()
    elif “auto” in command:
        automatic_mode = True
        auto_mode_loop()
    elif “manual” in command:
        automatic_mode = False
        stop()

# — Extras —
def robo_dance():
    for _ in range(2):
        turn_left()
        turn_right()
    speak(“I’m dancing!”, lang=”en”)

def play_rock_paper_scissors():
    speak(“Rock, paper, or scissors! I choose: paper”, lang=”en”)

import os
import requests
import time
from gpiozero import Motor, DistanceSensor, AngularServo
from picamera2 import Picamera2
from gtts import gTTS
import pygame
import speech_recognition as sr

# --- GPIO Setup ---
motor_left = Motor(forward=5, backward=6)
motor_right = Motor(forward=13, backward=19)
ultrasonic = DistanceSensor(echo=27, trigger=17)
servo = AngularServo(18, min_angle=0, max_angle=180)

# --- Camera ---
picam = Picamera2()
picam.configure(picam.create_still_configuration())

# --- AI Backend ---
VOICE_API = "http://192.168.215.29:8080/api/ai/voice"
IMAGE_API = "http://192.168.215.29:8080/api/ai/image"

# --- Global State ---
automatic_mode = False

# --- Movement Functions ---
def move_forward(t=1): 
    motor_left.forward()
    motor_right.forward()
    time.sleep(t)
    stop()

def move_backward(t=1): 
    motor_left.backward()
    motor_right.backward()
    time.sleep(t)
    stop()

def turn_left(): 
    motor_left.backward()
    motor_right.forward()
    time.sleep(0.5)
    stop()

def turn_right(): 
    motor_left.forward()
    motor_right.backward()
    time.sleep(0.5)
    stop()

def stop(): 
    motor_left.stop()
    motor_right.stop()

# --- Image Functions ---
def capture_image():
    filename = "/tmp/obstacle.jpg"
    picam.start()
    time.sleep(1)
    picam.capture_file(filename)
    picam.stop()
    return filename

def send_image_to_laptop(filepath):
    with open(filepath, "rb") as f:
        files = {"image": f}
        res = requests.post(IMAGE_API, files=files)
        return res.json().get("caption", "No description found")

# --- Speech Functions ---
def speak(text, lang="en"):
    tts = gTTS(text, lang=lang)
    filename = "/tmp/speak.mp3"
    tts.save(filename)
    pygame.mixer.init()
    pygame.mixer.music.load(filename)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        continue

# --- Sensor Functions ---
def scan_surroundings():
    angles = [0, 45, 90, 135, 180]
    for angle in angles:
        servo.angle = angle
        time.sleep(0.3)
        dist = ultrasonic.distance * 100
        print(f"Angle {angle}°: {dist:.2f} cm")
    servo.angle = 90

# --- Autonomous Mode Loop ---
def auto_mode_loop():
    global automatic_mode
    servo.angle = 90
    while automatic_mode:
        dist = ultrasonic.distance * 100
        print(f"Distance: {dist:.2f} cm")
        if dist < 25:
            stop()
            image = capture_image()
            caption = send_image_to_laptop(image)
            speak(f"Stop! Obstacle ahead: {caption}")
            turn_left()
        else:
            move_forward(0.5)
        time.sleep(0.1)

# --- Voice Recording & AI Interaction ---
def record_and_send_voice():
    os.system("arecord -D plughw:1,0 -f cd -t wav -d 4 -r 16000 /tmp/voice.wav")
    rec = sr.Recognizer()
    with sr.AudioFile("/tmp/voice.wav") as source:
        audio = rec.record(source)
    try:
        text = rec.recognize_google(audio, language="en-US")
        print(f"Recognized: {text}")
        res = requests.post(VOICE_API, json={"text": text})
        reply = res.json().get("reply", "")
        print(f"AI says: {reply}")
        speak(reply, lang="en")
        handle_ai_command(reply)
    except Exception as e:
        print("Error:", e)
        speak("Sorry, I didn't understand.")

# --- Command Handling ---
def handle_ai_command(command):
    global automatic_mode
    if "forward" in command:
        move_forward()
    elif "backward" in command:
        move_backward()
    elif "left" in command:
        turn_left()
    elif "right" in command:
        turn_right()
    elif "stop" in command:
        stop()
    elif "scan" in command:
        scan_surroundings()
    elif "auto" in command:
        automatic_mode = True
        auto_mode_loop()
    elif "manual" in command:
        automatic_mode = False
        stop()

# --- Extras ---
def robo_dance():
    for _ in range(2):
        turn_left()
        turn_right()
    speak("I'm dancing!", lang="en")

def play_rock_paper_scissors():
    speak("Rock, paper, or scissors! I choose: paper", lang="en")

Laptop setup Code

Project Structure

ai_backend/
├── main.py
├── voice_processor.py
├── image_processor.py
├── requirements.txt
└── static/

Flask App

requirements.txt

flask
requests
gtts
pydub

main.py

from flask import Flask, request, jsonify
from voice_processor import handle_voice_command
from image_processor import handle_image_caption

app = Flask(__name__)

@app.route("/api/ai/voice", methods=["POST"])
def voice():
    text = request.json.get("text")
    result = handle_voice_command(text)
    return jsonify({"reply": result})

@app.route("/api/ai/image", methods=["POST"])
def image():
    file = request.files.get("image")
    if not file:
        return jsonify({"error": "No image uploaded"}), 400
    caption = handle_image_caption(file)
    return jsonify({"caption": caption})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

voice_processor.py

import requests

OLLAMA_URL = "http://localhost:11434/api/generate"

def handle_voice_command(text):
    prompt = f"You are a robot controller. Respond to this command: '{text}'"
    payload = {
        "model": "borch/llama3_speed_chat",
        "prompt": prompt,
        "stream": False
    }
    res = requests.post(OLLAMA_URL, json=payload)
    return res.json().get("response", "").strip()

image_processor.py

import requests
import base64
from gtts import gTTS
import os

OLLAMA_URL = "http://localhost:11434/api/generate"

def handle_image_caption(file):
    image_data = base64.b64encode(file.read()).decode("utf-8")
    prompt = "Describe this image."

    payload = {
        "model": "llava",
        "prompt": prompt,
        "images": [image_data],
        "stream": False
    }

    res = requests.post(OLLAMA_URL, json=payload)
    hindi_caption = res.json().get("response", "").strip()

    # Speak it
    tts = gTTS(hindi_caption, lang="hi")
    filename = "/tmp/output.mp3"
    tts.save(filename)
    os.system(f"mpg123 {filename} &")  # Or another player

    return caption

Future Enhancements

Add GPS for location tracking
Add camera streaming to web panel
Add lidar or IR sensors for better navigation
Add NLP model on Pi for local processing
Offline Hindi TTS for better reliability
Bluetooth/Joystick support

MJT Robotics Projects

Voice-Controlled Autonomous Raspberry Pi Robot

Key Features

Voice Control

AI Image Analysis (LLaVA or similar model)

Autonomous Obstacle Avoidance

Web Control Panel

Mobile / Bluetooth Control (Coming Soon or Optional)

Fun Features

Hardware Components

Software Architecture

Technologies Used

Raspberry Pi setup Code

Laptop setup Code

Project Structure

Flask App

Future Enhancements

Leave a Reply Cancel reply

More Articles & Posts

MJT Robotics Projects

Voice-Controlled Autonomous Raspberry Pi Robot

Key Features

Voice Control

AI Image Analysis (LLaVA or similar model)

Autonomous Obstacle Avoidance

Web Control Panel

Mobile / Bluetooth Control (Coming Soon or Optional)

Fun Features

Hardware Components

Software Architecture

Technologies Used

Raspberry Pi setup Code

Laptop setup Code

Project Structure

Flask App

Future Enhancements

Leave a Reply Cancel reply

More Articles & Posts

Automatic Water Pump Controller

Voice Changer Application

IoT Health Monitoring System