Hand Segmentation Made Simple with YOLO and Python
- ganesh90
- 3 days ago
- 7 min read
Introduction
In industries such as augmented reality, robotics, gaming, and gesture-based control, understanding and interpreting hand movements is critical. Accurately detecting and segmenting hands from video streams enables machines to recognize gestures, track hand positions, and enhance human-computer interaction.
However, detecting hands in real time poses challenges due to varying lighting conditions, occlusions, different skin tones, and cluttered backgrounds. Manual annotation or tracking is labor intensive, error prone, and does not scale well with large data or live camera feeds.
In this blog, we will explore a computer vision based approach to segment hands. We will implement a solution using the YOLOv8 segmentation model and the EgoHands dataset.

Problem Statement
Applications that rely on hand tracking, such as augmented reality and virtual assistants, require precise segmentation of the hand region to function effectively. Simple bounding box detection is not sufficient for fine grained understanding of hand movements and shapes.
Key challenges in hand segmentation include:
Difficulty in detecting hands against complex and dynamic backgrounds
Lack of annotated datasets for pixel-level hand segmentation
Inability to run segmentation models in real time for live applications
Performance degradation due to occlusions and variable lighting
How Hand Segmentation Systems Work
Hand segmentation systems use deep learning techniques to detect and isolate hand regions from images and video frames. These systems are trained on datasets that include pixel-level annotations of hand regions. One of the most efficient models for real time object detection and segmentation is YOLO, short for “You Only Look Once.”
Key capabilities of these systems include:
Hand segmentation: Detecting hand boundaries at the pixel level
Real time performance: Processing video frames quickly for interactive use
Multiple hand detection: Identifying several hands in the same frame
Quantitative feedback: Estimating hand area and position over time
Once trained, the model can be deployed on edge devices or integrated with AR/VR systems for gesture recognition and control.
Benefits of Using YOLOv8 for Hand Segmentation
Precision: YOLOv8 segmentation outputs detailed hand masks rather than just bounding boxes, allowing for fine control in applications.
Speed: The model is optimized for high speed inference, making it suitable for real time applications.
Scalability: Once trained, the model can be applied to large datasets or live video feeds without additional annotation effort.
Deployment Ready: The model can be used in cross-platform environments, including mobile and embedded devices.
Visualization and Feedback: Real time overlay of segmentation masks helps in debugging and provides instant feedback during operation.
Applications in Real World Environments
Hand segmentation using deep learning can be applied in a variety of real world use cases:
Augmented reality (AR): Overlaying digital content on hand gestures
Robotics: Enabling robots to detect and respond to human hand motions
Gaming and gesture control: Interpreting hand signs to control gameplay
Sign language translation: Recognizing and translating sign language gestures
Virtual training environments: Enhancing simulations with gesture tracking
Dataset Overview: EgoHands
The EgoHands dataset serves as our foundation, providing a rich collection of egocentric hand images captured from first person perspectives. This dataset uniquely captures the natural variations in hand appearance, pose, and interaction scenarios that occur in real world usage.
Dataset Link: https://vision.soic.indiana.edu/projects/egohands/
Size: 1.3 GB
Dataset Characteristics:
4,800 labeled images across multiple video sequences
Pixel level annotations with precise polygon boundaries
Diverse scenarios including object manipulation, gesture performance, and natural hand movements
Varying lighting conditions from indoor and outdoor environments
Multiple hand configurations including single hands, both hands, and hand object interactions
The egocentric perspective makes this dataset particularly valuable for applications like AR, VR, and wearable computing, where the camera viewpoint matches the user perspective.
Implementation
Dataset Preprocessing: From EgoHands to YOLO Format
The EgoHands dataset contains polygon-based annotations. We need to convert these polygons into YOLOv8 segmentation format.
def _load_polygons_yolo_format(self, polygons_path, frame_idx, image_shape):
# Load the .mat file which stores polygons per frame
mat_data = sio.loadmat(polygons_path)
polygons = mat_data['polygons']
frame_data = polygons[0, frame_idx]
yolo_annotations = []
for poly_array in frame_data:
if isinstance(poly_array, np.ndarray) and poly_array.size > 0:
# Normalize polygon coordinates to [0, 1] for YOLO
normalized_poly = poly_array.copy()
normalized_poly[:, 0] /= image_shape[1] # x normalized by width
normalized_poly[:, 1] /= image_shape[0] # y normalized by height
normalized_poly = np.clip(normalized_poly, 0, 1) # Ensure within bounds
# YOLO format: class_id x1 y1 x2 y2 ...
yolo_line = [0] + normalized_poly.flatten().tolist()
yolo_annotations.append(yolo_line)
return yolo_annotations
Step by step:
Load the data:
Opens a MATLAB file (.mat) that contains hand polygons for different video frames
Gets the polygon data for one specific frame
Process each hand shape:
Takes the pixel coordinates that outline each hand
Normalizes coordinates: Converts from actual pixel positions to percentages
X-coordinates: divides by image width (so 0-1 range)
Y-coordinates: divides by image height (so 0-1 range)
Safety check: Makes sure all values stay between 0 and 1
Format for YOLO:
Flattens the coordinate pairs into one long list
Adds a "0" at the beginning (tells YOLO this is class 0, probably "hand")
Creates the final format: [0, x1, y1, x2, y2, x3, y3, ...]
Preparing the Dataset Directory
We organize the data into the format expected by YOLO: images/train, labels/train, etc.
image_resized = sample['image'].resize((640, 640), Image.LANCZOS)
image_resized.save(os.path.join(output_dir, 'images', split_name, f"{unique_name}.jpg"))
# Write annotation in YOLO segmentation format
with open(os.path.join(output_dir, 'labels', split_name, f"{unique_name}.txt"), 'w') as f:
for annotation in sample['yolo_annotations']:
# Convert annotation list to a formatted string
line = ' '.join([str(annotation[0])] + [f"{coord:.6f}" for coord in annotation[1:]])
f.write(line + '\n')
Step by step:
Resize the image:
Takes the original image and resizes it to 640x640 pixels
Uses LANCZOS resampling (resizing method)
Saves it as a .jpg file in the correct folder (like images/train/ or images/val/)
Create the label file:
For each image, creates a matching .txt file with the same name
Goes in the labels/ folder (YOLO's required structure)
Format the annotations:
Takes each hand outline (polygon coordinates)
Writes them as: 0 0.123456 0.234567 0.345678 0.456789...
First number (0) = class ID (hand)
Following numbers = normalized x,y coordinates with 6 decimal precision
YOLO folder structure created:
output_dir/
├── images/
│ ├── train/
│ │ └── image001.jpg
│ └── val/
│ └── image002.jpg
└── labels/
├── train/
│ └── image001.txt
└── val/
└── image002.txt
Training YOLOv8 for Segmentation
Training is wrapped in a class that handles model loading, argument passing, and performing training.
def train(self, data_yaml_path, epochs=100, imgsz=640, batch_size=16, ...):
# Define YOLO training configuration
train_args = {
'data': data_yaml_path,
'epochs': epochs,
'imgsz': imgsz,
'batch': batch_size,
'device': self.device,
'project': 'egohands_yolo',
'name': 'hand_segmentation',
'save': True,
'plots': True
}
# Start training the model
self.training_results = self.model.train(**train_args)
How it works:
Sets up training configuration:
data: Points to the YAML file that tells YOLO where to find images and labels
epochs: Will show the model all training images 100 times
imgsz: Resizes all images to 640x640 pixels during training
batch_size: Processes 16 images at once (faster than one-by-one)
device: Uses GPU for speed
project/name: Creates organized folders to save results
save/plots: Automatically saves the best model and creates progress charts
Starts the training:
YOLO takes over and begins the learning process
Shows model images → model guesses where hands are → checks against correct answers → adjusts model → repeat
Saves progress and creates performance graphs automatically
Evaluating the Model and Visualizing Predictions
Tests the trained YOLO model on validation images (pictures it has never seen during training) to get honest performance scores.
# Evaluate trained model on validation data
model = YOLO(model_path)
results = model.val(data=data_yaml_path, device=device, plots=True, save_json=True)
Step by step:
Load the trained model:
Takes your saved model file and loads it into memory
Ready to start testing
Run validation test:
data: Uses the same YAML file that points to your validation images
device: Runs on GPU for faster testing
plots=True: Automatically creates charts and graphs showing performance
save_json=True: Saves detailed results in a JSON file for later analysis
Results


Full code is available at:
The YOLOv8n Segmentation model demonstrates a precision of 0.985, a recall of 0.967, an mAP at 0.5 of 0.990, and an mAP at 0.5 to 0.95 of 0.818. While these metrics indicate strong performance in general hand segmentation tasks, the model may perform better only on selective image and video types and not uniformly across all conditions.
It occasionally misclassifies other objects as hands. The performance of model may improve with hyperparameter tuning and by incorporating additional datasets such as EgoYouTubeHands, which offers diverse egocentric hand views, HandOverFace, which provides examples of hands occluding faces, and GTEA, which captures hand-object interactions during everyday activities.
Get Help When You Need It
Developing a hand segmentation system that functions in real time and performs reliably across diverse conditions can be challenging. Whether you are addressing issues with adapting YOLOv8 for segmentation tasks, or deploying your solution in augmented reality, robotics, or gesture recognition applications, expert guidance can significantly accelerate progress.
If you are working on a hand segmentation project and require personalized assistance, CodersArts provides expert support for both students and enterprises implementing computer vision technologies.
For Students
CodersArts supports students engaged in academic or research projects by offering help with:
Converting datasets such as EgoHands into YOLO-compatible formats
Debugging model training pipelines and preprocessing scripts
Improving segmentation accuracy and minimizing false positives
Annotating pixel-level masks and preparing training data
Evaluating model performance and visualizing results
For Enterprises
For companies building real-time hand tracking or gesture-based systems, CodersArts delivers production-ready solutions, including:
System architecture consultation for real-time hand segmentation
Edge deployment and optimization for AR/VR or embedded platforms
Custom training with industry-specific data, such as medical or industrial environments
Integration into interactive systems, robotics platforms, or gaming interfaces
Enhancing segmentation robustness under occlusions and variable lighting conditions
Visit www.codersarts.com or email contact@codersarts.com to access the support you need to build, optimize, and scale high-performance hand segmentation systems.

Comentarii