Pixar-Inspired Interactive Tracking Lamp

We present a lamp that follows objects and an assistant which reads a document aloud for you, while working as a source of light.

IntermediateProtip169

Pixar-Inspired Interactive Tracking Lamp

Things used in this project

Hardware components

SparkFun Stepper motor driver board A4988

Stepper Motor, Bipolar

Arduino UNO

RGB Diffused Common Cathode

Web camera

Lamp Holder, Miniature Screw Base Lamps

Software apps and online services

Arduino IDE

Microsoft VS Code

Hand tools and fabrication machines

3D Printer (generic)

Laser cutter (generic)

Story

Overview

We aim to improve the way we interact with lamps by creating an interactive reading and tracking lamp. This interactive tracking idea is inspired by Pixar's animated lamp, integrating features including image-to-text conversion, text-to-speech, object detection, mood detection and stepper motor control using Arduino.

Our current setup comprises an integrated system that incorporates computer vision algorithm, arduino programming, and mechanical design. The heart of the system is a web camera that captures images for both object tracking and document reading. The computer vision algorithm, powered by YOLOv4-tiny, processes these images to detect objects and text. Upon detection, the system calculates the object's position and orientation, translating this data into commands for the stepper motor. This motor adjusts the lamp's position to follow the object or focus on a document for reading.

The text extracted from documents is converted to speech, providing an audible reading to the user. This feature is particularly useful for multitasking, allowing users to listen to documents while engaged in other activities. The system is controlled via a microcontroller, which manages the inputs from the camera and outputs to the motor and speaker.

A cooler upgrade we added is the ability to detect and recognize the mood of individuals. This feature allows the lamp to interact in a personalized manner, such as changing the LED color on the user's detected mood or signaling with a unique sound pattern when a recognized mood of a person is detected.

Related work

Our project was inspired by studies related to affective robots, face recognition, and motion tracking techniques. The design concept is founded on the Disney character, the Pixar Lamp, as we explore the interaction between humans and robots in various contexts and types of human emotion.

Milestone 1 &2 Summary

Our initial idea featured a movable robot that read text from documents as it went from left to right. However, after different conversation we decided to improve this as reading characters line by line was slower and also needed more processing resources compared to capturing the full document and processing everything at once. We also wanted to include an IR remote control to allow users have more playback control which we didn't consider in the next stage of our design. By the second milestone of our project, we had successfully integrated image processing capabilities building on our initial idea of employing a remote control. This advancement was pivotal, allowing for the immediate conversion of visual data into a readable format. Complementing this, we incorporated a Text-to-Speech (TTS) system, which audibly articulated the extracted text, adding a significant layer of interactivity to our project. Furthermore, the implementation of a single stepper motor, initially constrained to horizontal movements, was a critical step towards achieving responsive tracking, essential for aligning the camera's focus as per the user's direction through the remote control. We used a power generator as shown in the image below, ensuring precise control over the electronic components' voltage and current. Although bulkier than desired, this choice was essential for maintaining system stability and reliability. The incorporation of the stepper motor was a foray into mechanical control, vital for the project's tracking aspect. At this stage, our focus was on mastering horizontal movement control, setting a foundation for more complex directional control in the future.

Milestone 3

Hardware

Initially, our lamp's form comprised of a cylindrical 3D structure as shown below, that connected the base of the lamp to the stepper motor. This connection however, helped us achieve only movements in the horizontal directions(left to right & right to left) which we wanted to improve upon in the subsequent milestones.

After multiple iterations and different designs including the design shown below, where we had a screw holding the base of the lamp to a tilt platform, we realized the need to account for the weight of the lamp's head and the connected camera to achieve accurate movements.

Finally we decided on a setup that encompasses a laser cut enclosure that sits the lamp's structure. This lamp structure rests on a 3D printed platform that rotates with the help of a stepper motor and helps the lamp move in a horizontal fashion from left to right. On top of this platform is another 3D printed hinge that houses another stepper motor and helps control its tilt movements. The lamp and a connected web camera is setup on this platform held up by the hinge to allow tilt motions. These components are connected to the arduino and motor drivers placed in a cut box to abstract floating wires and other loose components, however it provides an opening for the rgb led.

Design Workflow

The diagram above captures how data flows for our project. Given our input being the web cam that collects a stream of images, we process the input image streams using python script that helps us figure out what a document contains or how far the detected person is from the cameras center, both in the vertical and horizontal directions. This scripts sends this computed information to the arduino via serial communication. The arduino processes this information in the case of the object tracker and mood detector, by extracting the part of the transmission that correlates to the mood, the angles(vertical deviation and horizontal deviation) as well as the directions. We use the angles, to calculate how many steps the stepper should move. Given some predefined colors, we set the red, green and blue for the rgb to indicate what type of mood was detected.

Software

The processing part of our code that cleans the image stream, measures the deviation of a person from the web cam's mid point and reads the text in the document is written in Python. We used the OCR from Open CV to extract the characters from the image taken with the web camera. After we extract these characters, we use the google Text to speech Library(gTTS) to create an audio version of the extracted text. We use the Pygame library to immediately read out this speech instead of our initial milestone where we opened an audio player on your computer to start this speech. These libraries provide us the flexibility to implement the document reading portion of our task. The mood detection and tracking functionalities are achieved through of multiple libraries and models like deepface, yolov4 and OpenCV. The deepface library provides functions to help us analyze facial expressions captured by the webcam to identify the dominant emotion. DeepFace uses deep learning models to provide accurate emotion analysis, which can be utilized in different interactive applications like ours. Facial expressions such as happiness, sadness, anger, or neutrality is what we detect. This mood information is then sent to the Arduino as a single character code ('h' for happy, 's' for sad, and so on) through serial communication.

For object detection and tracking, we use YOLOv4 in together with OpenCV. YOLOv4, provided a faster, real-time object detection system, and identifies and locates persons in the video stream. Once a person is detected, we draw a bounding box around the detected person and OpenCV helps in tracking the movement and calculating the deviation of the person from the center of the webcam's field of view. movement. The script then calculates the horizontal and vertical deviations of the person from a reference point, typically the center of the camera's field of view. These deviations are then converted into angles. For the horizontal angle, the script determines whether the person is to the left or right of the center and calculates the angle of deviation accordingly. Similarly, for the vertical angle, it assesses whether the person is above or below the central horizontal axis.

To bring the physical interaction into our project, we integrate Arduino using serial communication. Based on the mood and position data processed by Python, we can control the hardware components like the stepper motors and RGB LEDs. This integration bridges the gap between digital and physical realms, allowing for a wide range of creative and practical applications.
Once the Arduino receives this mood data, it triggers a corresponding response through the RGB LED. Each mood is associated with a specific color, allowing the system to visually communicate the detected emotional state. For instance, happiness is represented by a bright yellow, while sadness is represented with a red. This visual representation of mood not only adds an element of interactivity but also aids in making the technology more intuitive and user-friendly. The mood detection system is not just limited to visual feedback but is extended to auditory signals as well. The Python script, using libraries like Pygame, plays specific audio files or sounds corresponding to the detected mood. For example, a cheerful sound or melody might be played when happiness is detected, or a softer, more subdued tone for sadness.

This audio feedback, in conjunction with the Arduino-controlled visual cues, creates a rich, multi-sensory user experience. The combination of sound and light based on emotional analysis allows for a more engaging and empathetic interaction with the user.

Once these angles are calculated, they are sent to the Arduino via serial communication in a structured format, along with the direction of movement required (left/right for horizontal, up/down for vertical). The Arduino, upon receiving this data, translates these angles into a specific number of steps for the stepper motors to execute. The motors are connected to mechanisms that could adjust the orientation of the camera or another device, aligning it with the subject's position. The Arduino controls each motor's direction of rotation and the number of steps to take, enabling precise movements. The horizontal angle information controls one motor, while the vertical angle information guides the other. This dual-axis control allows for a two-dimensional tracking system, capable of following the subject's movements across the camera's plane.

import cv2
import numpy as np
import serial
import time
from deepface import DeepFace
import pygame
from PIL import Image
import pytesseract
from gtts import gTTS
import sys
import os


pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'  

pygame.mixer.init()

cap = cv2.VideoCapture(1) 
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

print("Frame Width:", frame_width)
print("Frame Height:", frame_height)
initial_center_x = frame_width / 2
initial_center_y = frame_height / 2
def capture_image_from_webcam():
    cap = cv2.VideoCapture(1)
    countdown_timer = 3  # Countdown from 5 seconds
    while countdown_timer:
        print(f"Capturing in {countdown_timer} seconds...", end="\r")
        time.sleep(1)
        countdown_timer -= 1

    ret, frame = cap.read()
    if not ret:
        print("Failed to capture image.")
        cap.release()
        return None

    img_name = "captured_document.png"
    cv2.imwrite(img_name, frame)
    print(f"\nImage captured as {img_name}")

    cap.release()
    return img_name

def setup_arduino():
    SERIAL_PORT = 'COM9'
    BAUD_RATE = 115200
    TIMEOUT = 1
    return serial.Serial(SERIAL_PORT, BAUD_RATE, timeout=TIMEOUT)


def text_to_speech(text, language='en'):
    if not text.strip():
        
        print("No text extracted to speak.")
        return
    tts = gTTS(text=text, lang=language, slow=False)
    tts.save("output.mp3")
    pygame.mixer.music.load("output.mp3")
    pygame.mixer.music.play()

   
    while pygame.mixer.music.get_busy():
        pygame.time.Clock().tick(10)

def image_to_text(image_path):
    img = Image.open(image_path)
    text = pytesseract.image_to_string(img)
    return text
def load_audio_files():
    pygame.mixer.init()
    return {
        "happy": pygame.mixer.Sound("clap.wav"),
        "fear": pygame.mixer.Sound("sad.wav"),
        "sad": pygame.mixer.Sound("sad.wav"),
        "angry": pygame.mixer.Sound("sad.wav"),
        "neutral": None
    }


audio_files = load_audio_files()

def load_yolov4():
    net = cv2.dnn.readNet("yolov4-tiny.weights", "yolov4-tiny.cfg")  
    layer_names = net.getLayerNames()
    output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers().flatten()]
    return net, output_layers

def load_classes():
    with open("coco.names", "r") as f:
        return [line.strip() for line in f.readlines()]

def calculate_angle(deviation, frame_width):
    CAMERA_FOV_HORIZONTAL = 90
    return CAMERA_FOV_HORIZONTAL / frame_width * deviation
def calculate_deviation(center_x, center_y):
    # Calculate deviation from the initial center
    horizontal_deviation = center_x - initial_center_x
    vertical_deviation = center_y - initial_center_y

    return horizontal_deviation, vertical_deviation

def send_angle_to_arduino(angle, direction, mood, vertical):
    command = f"{direction}:{angle}:{mood}:{vertical}\n"  
    arduino.write(command.encode('utf-8'))
    arduino.flush()


def analyze_mood(face_region):
    try:
        analysis = DeepFace.analyze(face_region, actions=['emotion'], enforce_detection=False)
        return analysis['dominant_emotion'] if isinstance(analysis, dict) else analysis[0]['dominant_emotion']
    except Exception as e:
        print("Error in mood detection:", e)
        return None
SCALE_FACTOR = 0.5
def analyze_mood_from_frame(frame, net, output_layers, classes):
    height, width, _ = frame.shape
    blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), swapRB=True, crop=False)
    net.setInput(blob)
    outs = net.forward(output_layers)

    dominant_emotion = None
    for out in outs:
        for detection in out:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]

            if confidence > CONFIDENCE_THRESHOLD and classes[class_id] == "person":
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[2] * width)
                h = int(detection[3] * height)

                x = int(center_x - w / 2)
                y = int(center_y - h / 2)

                face_region = frame[y:y+h, x:x+w]
                dominant_emotion = analyze_mood(face_region)

    return dominant_emotion

def calculate_motor_steps(deviation, frame_dimension, steps_per_revolution, camera_fov):
    angle_per_step = 360 / steps_per_revolution
    angle_deviation = (deviation / frame_dimension) * camera_fov
    steps = angle_deviation / angle_per_step

    # Apply scaling factor
    scaled_steps = steps * SCALE_FACTOR

    return int(scaled_steps)

def calculate_deviation_and_direction(center_coordinate, reference_center_coordinate, threshold=10):
    deviation = center_coordinate - reference_center_coordinate
    if abs(deviation) <= threshold:
        return deviation, 'n'  
    elif deviation > threshold:
        return deviation, 'd'  
    else:
        return deviation, 'u' 
def mood_detection():
    arduino = setup_arduino()
    with open("angles_and_steps.txt", "w") as file:
        net, output_layers = load_yolov4()
        classes = load_classes()
        cap = cv2.VideoCapture(1)

        CONFIDENCE_THRESHOLD = 0.5
        reference_center_x = None
        horizontal_direction = 'N'  # Initialize to neutral
        vertical_direction = 'N'


        while True:
            ret, frame = cap.read()
            if not ret:
                break
        
            height, width, _ = frame.shape
            blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), swapRB=True, crop=False)
            net.setInput(blob)
            outs = net.forward(output_layers)
        
            for out in outs:
                for detection in out:
                    scores = detection[5:]
                    class_id = np.argmax(scores)
                    confidence = scores[class_id]
        
                    if confidence > CONFIDENCE_THRESHOLD and classes[class_id] == "person":
                        center_x = int(detection[0] * width)
                        center_y = int(detection[1] * height)
                        w = int(detection[2] * width)
                        h = int(detection[3] * height)
        
                        # Bounding box
                        x = int(center_x - w / 2)
                        y = int(center_y - h / 2)
                        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
        
                        if reference_center_x is None:
                            reference_center_x = center_x 
                        if reference_center_y is None:
                            reference_center_y = center_y 
        
                        deviation = center_x - reference_center_x
                        angle_difference = calculate_angle(deviation, width)
                        direction = 'R' if deviation < 0 else 'L'
                        # mood = 'h'
                        mood = mood_detection(frame, net, output_layers, classes)
                        vertical_deviation, vertical_direction = calculate_deviation_and_direction(center_y, reference_center_y)
                        # vertical_direction = 'd' 
                        send_angle_to_arduino(abs(angle_difference), direction, mood, vertical_direction)
                        print("+++++++++++++++++++++++++++++++++++++++++++++\n")
                        print("Vertical: ", abs(angle_difference), direction, vertical_deviation, vertical_direction, mood)
                        # print(f"Moved {abs(angle_difference):.2f} degrees to the {direction}")
                        
        
                      
                        cv2.putText(frame, f"Moved: {abs(angle_difference):.2f} degrees {direction}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                        if mood in audio_files:
                                    if audio_files[mood] is not None:
                                        audio_files[mood].play()
                    cv2.imshow('frame', frame)
                    if cv2.waitKey(1) == ord('q'):
                        break
        
            cap.release()
            cv2.destroyAllWindows()

def document_processing(image_path):
    
    if not os.path.exists(image_path):
        print(f"Image file not found at {image_path}")
        return

    extracted_text = image_to_text(image_path)
    if not extracted_text.strip():  
        extracted_text = 'I am unable to read anything from this document'

    print("Extracted Text:", extracted_text)
  
    text_to_speech(extracted_text)

def main():
    if len(sys.argv) > 1:
        input_type = sys.argv[1]  
        if input_type == 'document':
        
             
            image_path = capture_image_from_webcam()
            document_processing(image_path)
           
        elif input_type == 'person':
            mood_detection()
        else:
            print("Invalid input type. Please use 'document' or 'person'.")
    else:
        print("No input type provided. Please use 'document' or 'person'.")

if __name__ == "__main__":
    main()

#include <Arduino.h>

int dirPin = 5;
int stepPin = 6;
int stepsPerRevolution = 200; 


int dirPin2 = 3;
int stepPin2 = 4;

int minDelay = 500; 
int maxDelay = 2000; 
int accelerationSteps = 50; 

int redPin = 9;   
int greenPin = 10; 
int bluePin = 11;  

void setup() {
  pinMode(stepPin, OUTPUT);
  pinMode(dirPin, OUTPUT);

  pinMode(stepPin2, OUTPUT);
  pinMode(dirPin2, OUTPUT);

  
  pinMode(redPin, OUTPUT);
  pinMode(greenPin, OUTPUT);
  pinMode(bluePin, OUTPUT);

  Serial.begin(115200); 
}

void loop() {
  if (Serial.available() > 0) {
    String input = Serial.readStringUntil('\n');

    int firstSeparatorIndex = input.indexOf(':');
    int secondSeparatorIndex = input.indexOf(':', firstSeparatorIndex + 1);
    int thirdSeparatorIndex = input.indexOf(':', secondSeparatorIndex + 1);

    float angle = input.substring(0, firstSeparatorIndex).toFloat();
    char horizontalDirection = input.substring(firstSeparatorIndex + 1, secondSeparatorIndex).charAt(0);
    char verticalDirection = input.substring(secondSeparatorIndex + 1, thirdSeparatorIndex).charAt(0);
    char mood = input.substring(thirdSeparatorIndex + 1).charAt(0);

    bool rotateDirection = (horizontalDirection == 'R'); 
    int steps = angleToSteps(angle);

    rotateMotor(stepPin, dirPin, steps, rotateDirection);

   
    if (verticalDirection == 'N') {
      rotateMotor(stepPin2, dirPin2, angleToSteps(30), true); 
    } else if (verticalDirection == 'S') {
      rotateMotor(stepPin2, dirPin2, angleToSteps(30), false);
    }
    // No action for 'M'

    setRGBColorBasedOnMood(mood);
  }
}


void rotateMotor(int stepPin, int dirPin, int steps, bool direction) {
  digitalWrite(dirPin, direction ? HIGH : LOW);
  int stepDelay = maxDelay;
  int stepChange = (maxDelay - minDelay) / accelerationSteps;

  for (int i = 0; i < steps; i++) {
    digitalWrite(stepPin, HIGH);
    delayMicroseconds(stepDelay);
    digitalWrite(stepPin, LOW);
    delayMicroseconds(stepDelay);

    if (i < accelerationSteps && stepDelay > minDelay) {
      stepDelay -= stepChange;
    } else if (i >= steps - accelerationSteps && stepDelay < maxDelay) {
      stepDelay += stepChange;
    }
  }
}
int angleToSteps(float angle) {
  return (int)(angle / 360.0 * stepsPerRevolution);
}

void setRGBColorBasedOnMood(char mood) {
  switch (mood) {
    case 'a': // Anger
      analogWrite(redPin, 255);   // Red
      analogWrite(greenPin, 0);
      analogWrite(bluePin, 0);
      break;
    case 'd': // Disgust
      analogWrite(redPin, 255);   // Red
      analogWrite(greenPin, 0);
      analogWrite(bluePin, 0);
      break;
    case 'f': // Fear
      analogWrite(redPin, 255);   // Red
      analogWrite(greenPin, 0);
      analogWrite(bluePin, 0);
      break;
    case 'h': // Happiness
      analogWrite(redPin, 255);   // Yellow
      analogWrite(greenPin, 255);
      analogWrite(bluePin, 0);
      break;
    case 's': // Sadness
      analogWrite(redPin, 255);   // Red
      analogWrite(greenPin, 0);
      analogWrite(bluePin, 0);  // Blue
      break;
    case 'u': // Surprise
      analogWrite(redPin, 255);   // Cyan
      analogWrite(greenPin, 255);
      analogWrite(bluePin, 255);
      break;
    case 'n': // Neutral
      analogWrite(redPin, 128);   // White
      analogWrite(greenPin, 128);
      analogWrite(bluePin, 128);
      break;
    default:  // Default color
      analogWrite(redPin, 0);   
      analogWrite(greenPin, 0);
      analogWrite(bluePin, 0);    // Turn off LED
  }
}