Real-Time Person Tracking and Duration Monitoring in Restaurants
- Pushkar Nandgaonkar
- 3 days ago
- 9 min read
Introduction
In the hospitality industry, especially in busy restaurants and cafes, understanding how long customers stay inside the premises can offer valuable insights into customer behavior, service efficiency, and space utilization. Manually tracking each customer’s presence is not only impractical but also prone to errors and inconsistencies.
This is where AI-powered real-time person tracking and duration monitoring systems come into play. By using computer vision techniques, we can automatically detect, track, and calculate the time each individual spends inside a restaurant, all in real time.
In this blog, we will explore how to implement a person tracking system using YOLOv8 and OpenCV, and track how long each person has been present inside the restaurant with the help of Python.

Problem Statement
Understanding how long each customer spends in a restaurant is critical for optimizing operations, improving customer service, and enhancing overall efficiency. However, most restaurants rely on manual observation or estimates, which often result in inaccurate data and missed opportunities for insight.
Key challenges restaurants face:
Lack of accurate customer presence dataMost restaurants don’t track how long customers stay, making it difficult to analyze peak hours, table turnover, and service time.
Difficulty monitoring multiple guests simultaneouslyIn a dynamic environment, it’s nearly impossible for staff to manually track all customers entering, leaving, or staying for extended periods.
Limited staff attention during busy hoursDuring rush hours, staff are focused on service, leaving little room for manual tracking or note-taking regarding customer duration.
No automated documentation for analytics or reportingWithout automated tools, restaurants struggle to gather historical data on guest duration for making informed business decisions.
To solve these challenges, a real-time person tracking system powered by AI can automatically detect individuals and monitor how long they stay in the restaurant—providing both real-time visibility and useful analytics.
How AI-Powered Person Tracking Systems Work
Modern AI-based tracking systems leverage computer vision and deep learning to automatically detect and monitor people within a given space—such as a restaurant. These systems use powerful object detection models like YOLO (You Only Look Once) to not only identify individuals but also track their movement and calculate how long they stay in the premises.
Using AI, these systems can provide:
Person Detection: Identifying individuals in video frames using real-time object detection models like YOLOv8.
Unique ID Assignment: Assigning each detected person a unique ID to continuously track their movement across frames.
Duration Monitoring: Calculating and updating how long each individual has been present in the restaurant.
Real-Time Visualization: Displaying bounding boxes and timestamps over each person, showing their ID and duration live on the video feed.
Data for Analysis: Collecting customer dwell time information that can be used to optimize seating, staffing, and service flow.
Implementation Benefits for Restaurants
Improved Customer Experience
By tracking how long each person stays in the restaurant, managers can identify peak hours, average dwell times, and seating efficiency—ultimately helping optimize wait times and improve customer satisfaction.
Operational Efficiency
The system automates the monitoring process, allowing staff to focus on service rather than manually observing tables. Real-time data on occupancy can help manage seating and staffing more effectively.
Data-Driven Decision Making
Dwell time data provides actionable insights into customer behavior patterns, helping with decisions like layout optimization, staffing schedules, or promotional targeting based on peak hours.
Compliance & Security Monitoring
Tracking individuals helps ensure no unauthorized persons linger in restricted zones. It also supports contact tracing or occupancy control during health-related scenarios.
Cost Optimization
Better understanding of customer flow and time spent enables restaurants to maximize table turnover, avoid bottlenecks, and reduce operational inefficiencies—contributing directly to profitability.
Employee Accountability & Training
By tracking staff movement and duration at specific locations (like kitchens or counters), management can identify workflow gaps, improve task assignments, and highlight training needs.
Practical Applications
Real-Time Person Tracking and Duration Monitoring is increasingly adopted in:
Quick Service Restaurants (QSRs) to analyze dine-in patterns and optimize layout for higher turnover
Fine Dining Establishments to maintain personalized service by tracking guest presence durations
Cafeterias and Food Courts for managing crowd density and flow during peak hours
Workplace Canteens to prevent overcrowding and manage staggered break schedules
Retail-Integrated Dining Spaces to study consumer behavior across zones
Model Used
For this project, we used the pre-trained YOLOv8s model (yolov8s.pt) from the Ultralytics library. YOLOv8s is a fast and efficient object detection model that accurately identifies people (class 0) in real time with minimal computational resources.
We did not train the model on a custom dataset. Instead, we utilized YOLOv8’s built-in ability to detect people to track individuals as they move through different areas of the restaurant—such as entrances, dining areas, and service counters.
This approach avoids the complexity of dataset labeling and model retraining, making it perfect for quick deployment and real-world testing without heavy infrastructure.
How It Works
The system continuously processes video frames from CCTV or IP cameras in real-time.
Person detection is performed using YOLOv8 to identify all individuals present in each frame.
Each detected person is assigned a unique ID using a centroid tracking algorithm.
The system monitors and updates the duration each person remains within a predefined zone (e.g., waiting area, dining space, counter).
Duration metrics are displayed live and stored for report generation and behavioral analysis.
This simple yet powerful system enables restaurant managers to:
Monitor staff and guest movement
Track time spent at service zones
Understand customer dwell patterns
Ensure employee presence at key stations
Implementation
YOLOv8-Based Person Detection with Centroid Tracking
This implementation combines real-time person detection using YOLOv8 with a custom CentroidTracker for persistent identity assignment and duration tracking of individuals in a video stream. It is tailored for surveillance, monitoring, and analytics in dynamic environments such as restaurants or public places.
CentroidTracker Implementation
We designed a custom tracking algorithm that assigns IDs to detected people and maintains their identities across frames using centroid-based distance matching.
import cv2
import datetime
import numpy as np
from collections import OrderedDict
from scipy.spatial import distance as dist
from ultralytics import YOLO
class CentroidTracker:
def __init__(self, maxDisappeared=50, maxDistance=50):
self.nextObjectID = 0
self.objects = OrderedDict()
self.disappeared = OrderedDict()
self.bbox = OrderedDict()
self.maxDisappeared = maxDisappeared
self.maxDistance = maxDistance
def register(self, centroid, inputRect):
self.objects[self.nextObjectID] = centroid
self.bbox[self.nextObjectID] = inputRect
self.disappeared[self.nextObjectID] = 0
self.nextObjectID += 1
def deregister(self, objectID):
del self.objects[objectID]
del self.disappeared[objectID]
del self.bbox[objectID]
def update(self, rects):
if len(rects) == 0:
for objectID in list(self.disappeared.keys()):
self.disappeared[objectID] += 1
if self.disappeared[objectID] > self.maxDisappeared:
self.deregister(objectID)
return self.bbox
inputCentroids = np.zeros((len(rects), 2), dtype="int")
inputRects = []
for (i, (startX, startY, endX, endY)) in enumerate(rects):
cX = int((startX + endX) / 2.0)
cY = int((startY + endY) / 2.0)
inputCentroids[i] = (cX, cY)
inputRects.append(rects[i])
if len(self.objects) == 0:
for i in range(0, len(inputCentroids)):
self.register(inputCentroids[i], inputRects[i])
else:
objectIDs = list(self.objects.keys())
objectCentroids = list(self.objects.values())
D = dist.cdist(np.array(objectCentroids), inputCentroids)
rows = D.min(axis=1).argsort()
cols = D.argmin(axis=1)[rows]
usedRows = set()
usedCols = set()
for (row, col) in zip(rows, cols):
if row in usedRows or col in usedCols:
continue
if D[row, col] > self.maxDistance:
continue
objectID = objectIDs[row]
self.objects[objectID] = inputCentroids[col]
self.bbox[objectID] = inputRects[col]
self.disappeared[objectID] = 0
usedRows.add(row)
usedCols.add(col)
unusedRows = set(range(0, D.shape[0])).difference(usedRows)
unusedCols = set(range(0, D.shape[1])).difference(usedCols)
if D.shape[0] >= D.shape[1]:
for row in unusedRows:
objectID = objectIDs[row]
self.disappeared[objectID] += 1
if self.disappeared[objectID] > self.maxDisappeared:
self.deregister(objectID)
else:
for col in unusedCols:
self.register(inputCentroids[col], inputRects[col])
return self.bbox
Step-by-step Logic:
Initialization:
objects: Maps each ID to the centroid of the detected person.
disappeared: Tracks how many frames a person has not been detected.
bbox: Stores the bounding box for each object.
maxDisappeared: How long to keep tracking a missing object.
maxDistance: Maximum distance to associate new detections with existing IDs.
Register:
Assign a new ID to a detected person and store their centroid and bounding box.
Deregister:
Remove an object if it's been missing for too long.
Update:
If no detections, increment disappeared count for existing objects.
Else, compute new centroids and match them with existing ones using Euclidean distance.
Match the closest objects using a greedy approach.
Handle unmatched objects by registering new ones or marking old ones as disappeared.
YOLOv8 Integration and Video Inference Pipeline
The inference system uses YOLOv8 to detect humans in each video frame, and then applies the CentroidTracker to persist IDs across frames.
def main():
# Load YOLOv8 model
model = YOLO('models/yolov8s.pt') # Path to your YOLOv8 model
# Video input/output
cap = cv2.VideoCapture('input_data/restaurant.mp4') # Change path as needed
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps_input = cap.get(cv2.CAP_PROP_FPS)
out = cv2.VideoWriter('output_yolo_tracking.mp4',
cv2.VideoWriter_fourcc(*'mp4v'), # 'avc1' or 'mp4v' are good options
fps_input, (width, height))
tracker = CentroidTracker(maxDisappeared=40, maxDistance=50)
object_id_list = []
dtime = dict()
appearing_time = dict()
while True:
ret, frame = cap.read()
if not ret:
break
# Run YOLOv8 detection
results = model(frame, verbose=False)
detections = results[0].boxes
rects = []
for box in detections:
cls_id = int(box.cls[0])
conf = float(box.conf[0])
if cls_id == 0 and conf > 0.4: # Only 'person' class with confidence > 0.4
x1, y1, x2, y2 = map(int, box.xyxy[0])
rects.append((x1, y1, x2, y2))
objects = tracker.update(rects)
for objectId, bbox in objects.items():
x1, y1, x2, y2 = map(int, bbox) # Fix: convert tuple to int using map
if objectId not in object_id_list:
object_id_list.append(objectId)
curr_time = datetime.datetime.now()
dtime[objectId] = curr_time
appearing_time[objectId] = 0
else:
curr_time = datetime.datetime.now()
time_diff = curr_time - dtime[objectId]
dtime[objectId] = curr_time
appearing_time[objectId] += time_diff.total_seconds()
formatted_time = str(datetime.timedelta(seconds=int(appearing_time[objectId])))
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 255), 2)
cv2.putText(frame, f"ID:{objectId} | {formatted_time}", (x1, y1 - 5),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)
# Write frame to output video
out.write(frame)
cv2.imshow("Optical Flow Movement Detection", frame)
if cv2.waitKey(1) == 27:
break
Workflow Overview:
Model Loading:
Loads the pretrained YOLOv8 model from disk.
Example:
model = YOLO('models/yolov8s.pt')
Video Setup:
Reads input video from file.
Initializes video writer for saving output.
Retrieves frame width, height, and FPS for consistency.
Tracking Initialization:
Instantiates CentroidTracker with tuned parameters.
Initializes dictionaries to store object appearance time.
Frame-by-Frame Processing:
For every frame:
Run YOLOv8 detection (model(frame)).
Filter results to only keep class person (class ID 0) with confidence > 0.4.
Extract bounding box coordinates and format them for tracking.
Update the tracker with the current frame’s bounding boxes.
Key Components in Tracker Update Logic
def update(self, rects):
if len(rects) == 0:
for objectID in list(self.disappeared.keys()):
self.disappeared[objectID] += 1
if self.disappeared[objectID] > self.maxDisappeared:
self.deregister(objectID)
return self.bbox
inputCentroids = np.zeros((len(rects), 2), dtype="int")
inputRects = []
for (i, (startX, startY, endX, endY)) in enumerate(rects):
cX = int((startX + endX) / 2.0)
cY = int((startY + endY) / 2.0)
inputCentroids[i] = (cX, cY)
inputRects.append(rects[i])
if len(self.objects) == 0:
for i in range(0, len(inputCentroids)):
self.register(inputCentroids[i], inputRects[i])
else:
objectIDs = list(self.objects.keys())
objectCentroids = list(self.objects.values())
D = dist.cdist(np.array(objectCentroids), inputCentroids)
rows = D.min(axis=1).argsort()
cols = D.argmin(axis=1)[rows]
usedRows = set()
usedCols = set()
for (row, col) in zip(rows, cols):
if row in usedRows or col in usedCols:
continue
if D[row, col] > self.maxDistance:
continue
objectID = objectIDs[row]
self.objects[objectID] = inputCentroids[col]
self.bbox[objectID] = inputRects[col]
self.disappeared[objectID] = 0
usedRows.add(row)
usedCols.add(col)
unusedRows = set(range(0, D.shape[0])).difference(usedRows)
unusedCols = set(range(0, D.shape[1])).difference(usedCols)
if D.shape[0] >= D.shape[1]:
for row in unusedRows:
objectID = objectIDs[row]
self.disappeared[objectID] += 1
if self.disappeared[objectID] > self.maxDisappeared:
self.deregister(objectID)
else:
for col in unusedCols:
self.register(inputCentroids[col], inputRects[col])
return self.bbox
Centroid Calculation:
For each bounding box: compute (centerX, centerY)
Object Matching:
Use scipy.spatial.distance.cdist to compute pairwise distances between existing and new centroids.
Match based on minimal distances while staying within maxDistance.
Object Maintenance:
Objects not matched increment their disappeared counter.
If counter exceeds threshold → deregister.
Unmatched new detections → register as new objects.
Detection Filtering Logic
for box in detections:
cls_id = int(box.cls[0])
conf = float(box.conf[0])
if cls_id == 0 and conf > 0.4:
...
Class Check: Only detects people (cls_id == 0)
Confidence Threshold: Ignores weak detections (conf > 0.4)
Example Output Use-Case
Object Tracking with Appearance Duration:
Track how long each person has been visible in the video.
Can be extended to detect dwell time, customer engagement, or security anomalies.
Result
Full code is available at the following link:
Limitations of the System
While the system is effective for real-time person detection and duration monitoring, it comes with several limitations that impact its robustness and scalability in real-world environments:
1. No Re-identification Across Frames
The system lacks person Re-ID (Re-identification) capabilities. When a person exits and re-enters the frame, they are assigned a new ID, which breaks continuity in duration tracking.
2. Tracking Failures in Crowded or Overlapping Scenes
In densely populated areas or when people overlap or occlude each other, the centroid tracker may incorrectly assign or switch IDs.
3. No Zone-Based Analytics
The system does not distinguish between different zones (e.g., dining vs. kitchen), limiting insights into area-specific behavior and staff-vs-customer analysis.
4. Detection Confidence Dependency
It only tracks people detected with a confidence score above 0.4. In low-light or blurry conditions, people may go undetected, leading to inaccurate time measurements.
5. Static Camera Assumption
The tracking algorithm assumes a fixed camera. Any camera movement (e.g., vibration, repositioning) can break tracking accuracy.
6. Limited Action Understanding
The system detects presence and duration, but cannot identify specific actions (e.g., sitting, eating, working), which restricts its utility for behavior analytics.
7. Single-Camera Limitation
The system is designed for single-camera use. It does not support multi-camera environments or hand-off tracking across different camera views.
8. Hardware Dependence
Running YOLOv8 in real-time requires a GPU-enabled system. On CPU-only machines, the system may lag or drop frames, affecting detection and tracking accuracy.
Get Help When You Need It
Implementing a real-time person tracking and duration monitoring system can be challenging—especially when you're working with live video feeds, integrating tracking algorithms, or optimizing performance for a production environment. Whether it's model fine-tuning, accurate zone calibration, or connecting the system with your restaurant’s analytics dashboard, it's perfectly normal to need expert support along the way.
If you're looking to deploy a real-time person tracking solution in a restaurant setting and need personalized guidance, CodersArts can help. They specialize in assisting both students and enterprises with end-to-end computer vision solutions tailored for the food service industry.
For students, CodersArts offers project support with:
YOLO-based person detection
Real-time object tracking implementation
Duration monitoring logic
Academic documentation and research guidance
For businesses, their services include:
End-to-end solution design and system integration
Cloud-based deployment for multi-location monitoring
Custom alerts and reporting dashboards
Staff behavior analytics and operational insights
CodersArts can help you accelerate your project from prototype to production.
Visit www.codersarts.com or email contact@codersarts.com to get started.

Comentarios