Hand Segmentation Made Simple with YOLO and Python

ganesh90
3 days ago
7 min read

Introduction

In industries such as augmented reality, robotics, gaming, and gesture-based control, understanding and interpreting hand movements is critical. Accurately detecting and segmenting hands from video streams enables machines to recognize gestures, track hand positions, and enhance human-computer interaction.

However, detecting hands in real time poses challenges due to varying lighting conditions, occlusions, different skin tones, and cluttered backgrounds. Manual annotation or tracking is labor intensive, error prone, and does not scale well with large data or live camera feeds.

In this blog, we will explore a computer vision based approach to segment hands. We will implement a solution using the YOLOv8 segmentation model and the EgoHands dataset.

Problem Statement

Applications that rely on hand tracking, such as augmented reality and virtual assistants, require precise segmentation of the hand region to function effectively. Simple bounding box detection is not sufficient for fine grained understanding of hand movements and shapes.

Key challenges in hand segmentation include:

Difficulty in detecting hands against complex and dynamic backgrounds
Lack of annotated datasets for pixel-level hand segmentation
Inability to run segmentation models in real time for live applications
Performance degradation due to occlusions and variable lighting

How Hand Segmentation Systems Work

Hand segmentation systems use deep learning techniques to detect and isolate hand regions from images and video frames. These systems are trained on datasets that include pixel-level annotations of hand regions. One of the most efficient models for real time object detection and segmentation is YOLO, short for “You Only Look Once.”

Key capabilities of these systems include:

Hand segmentation: Detecting hand boundaries at the pixel level
Real time performance: Processing video frames quickly for interactive use
Multiple hand detection: Identifying several hands in the same frame
Quantitative feedback: Estimating hand area and position over time

Once trained, the model can be deployed on edge devices or integrated with AR/VR systems for gesture recognition and control.

Benefits of Using YOLOv8 for Hand Segmentation

Precision: YOLOv8 segmentation outputs detailed hand masks rather than just bounding boxes, allowing for fine control in applications.
Speed: The model is optimized for high speed inference, making it suitable for real time applications.
Scalability: Once trained, the model can be applied to large datasets or live video feeds without additional annotation effort.
Deployment Ready: The model can be used in cross-platform environments, including mobile and embedded devices.
Visualization and Feedback: Real time overlay of segmentation masks helps in debugging and provides instant feedback during operation.

Applications in Real World Environments

Hand segmentation using deep learning can be applied in a variety of real world use cases:

Augmented reality (AR): Overlaying digital content on hand gestures
Robotics: Enabling robots to detect and respond to human hand motions
Gaming and gesture control: Interpreting hand signs to control gameplay
Sign language translation: Recognizing and translating sign language gestures
Virtual training environments: Enhancing simulations with gesture tracking

Dataset Overview: EgoHands

The EgoHands dataset serves as our foundation, providing a rich collection of egocentric hand images captured from first person perspectives. This dataset uniquely captures the natural variations in hand appearance, pose, and interaction scenarios that occur in real world usage.

Dataset Link: https://vision.soic.indiana.edu/projects/egohands/

Size: 1.3 GB

Dataset Characteristics:

4,800 labeled images across multiple video sequences
Pixel level annotations with precise polygon boundaries
Diverse scenarios including object manipulation, gesture performance, and natural hand movements
Varying lighting conditions from indoor and outdoor environments
Multiple hand configurations including single hands, both hands, and hand object interactions

The egocentric perspective makes this dataset particularly valuable for applications like AR, VR, and wearable computing, where the camera viewpoint matches the user perspective.

Implementation

Dataset Preprocessing: From EgoHands to YOLO Format

The EgoHands dataset contains polygon-based annotations. We need to convert these polygons into YOLOv8 segmentation format.

def _load_polygons_yolo_format(self, polygons_path, frame_idx, image_shape):
    # Load the .mat file which stores polygons per frame
    mat_data = sio.loadmat(polygons_path)
    polygons = mat_data['polygons']
    frame_data = polygons[0, frame_idx]
    yolo_annotations = []

    for poly_array in frame_data:
        if isinstance(poly_array, np.ndarray) and poly_array.size > 0:
            # Normalize polygon coordinates to [0, 1] for YOLO
            normalized_poly = poly_array.copy()
            normalized_poly[:, 0] /= image_shape[1]  # x normalized by width
            normalized_poly[:, 1] /= image_shape[0]  # y normalized by height
            normalized_poly = np.clip(normalized_poly, 0, 1)  # Ensure within bounds

            # YOLO format: class_id x1 y1 x2 y2 ...
            yolo_line = [0] + normalized_poly.flatten().tolist()
            yolo_annotations.append(yolo_line)

    return yolo_annotations

Step by step:

Load the data:
- Opens a MATLAB file (.mat) that contains hand polygons for different video frames
- Gets the polygon data for one specific frame
Process each hand shape:
- Takes the pixel coordinates that outline each hand
- Normalizes coordinates: Converts from actual pixel positions to percentages
  - X-coordinates: divides by image width (so 0-1 range)
  - Y-coordinates: divides by image height (so 0-1 range)
- Safety check: Makes sure all values stay between 0 and 1
Format for YOLO:
- Flattens the coordinate pairs into one long list
- Adds a "0" at the beginning (tells YOLO this is class 0, probably "hand")
- Creates the final format: [0, x1, y1, x2, y2, x3, y3, ...]

Preparing the Dataset Directory

We organize the data into the format expected by YOLO: images/train, labels/train, etc.

image_resized = sample['image'].resize((640, 640), Image.LANCZOS)
image_resized.save(os.path.join(output_dir, 'images', split_name, f"{unique_name}.jpg"))

# Write annotation in YOLO segmentation format
with open(os.path.join(output_dir, 'labels', split_name, f"{unique_name}.txt"), 'w') as f:
    for annotation in sample['yolo_annotations']:
        # Convert annotation list to a formatted string
        line = ' '.join([str(annotation[0])] + [f"{coord:.6f}" for coord in annotation[1:]])
        f.write(line + '\n')

Step by step:

Resize the image:
- Takes the original image and resizes it to 640x640 pixels
- Uses LANCZOS resampling (resizing method)
- Saves it as a .jpg file in the correct folder (like images/train/ or images/val/)
Create the label file:
- For each image, creates a matching .txt file with the same name
- Goes in the labels/ folder (YOLO's required structure)
Format the annotations:
- Takes each hand outline (polygon coordinates)
- Writes them as: 0 0.123456 0.234567 0.345678 0.456789...
- First number (0) = class ID (hand)
- Following numbers = normalized x,y coordinates with 6 decimal precision

YOLO folder structure created:

output_dir/
├── images/
│   ├── train/
│   │   └── image001.jpg
│   └── val/
│       └── image002.jpg
└── labels/
    ├── train/
    │   └── image001.txt
    └── val/
        └── image002.txt

Training YOLOv8 for Segmentation

Training is wrapped in a class that handles model loading, argument passing, and performing training.

def train(self, data_yaml_path, epochs=100, imgsz=640, batch_size=16, ...):
    # Define YOLO training configuration
    train_args = {
        'data': data_yaml_path,
        'epochs': epochs,
        'imgsz': imgsz,
        'batch': batch_size,
        'device': self.device,
        'project': 'egohands_yolo',
        'name': 'hand_segmentation',
        'save': True,
        'plots': True
    }

    # Start training the model
    self.training_results = self.model.train(**train_args)

How it works:

Sets up training configuration:
- data: Points to the YAML file that tells YOLO where to find images and labels
- epochs: Will show the model all training images 100 times
- imgsz: Resizes all images to 640x640 pixels during training
- batch_size: Processes 16 images at once (faster than one-by-one)
- device: Uses GPU for speed
- project/name: Creates organized folders to save results
- save/plots: Automatically saves the best model and creates progress charts
Starts the training:
- YOLO takes over and begins the learning process
- Shows model images → model guesses where hands are → checks against correct answers → adjusts model → repeat
- Saves progress and creates performance graphs automatically

Evaluating the Model and Visualizing Predictions

Tests the trained YOLO model on validation images (pictures it has never seen during training) to get honest performance scores.

# Evaluate trained model on validation data
model = YOLO(model_path)
results = model.val(data=data_yaml_path, device=device, plots=True, save_json=True)

Step by step:

Load the trained model:
- Takes your saved model file and loads it into memory
- Ready to start testing
Run validation test:
- data: Uses the same YAML file that points to your validation images
- device: Runs on GPU for faster testing
- plots=True: Automatically creates charts and graphs showing performance
- save_json=True: Saves detailed results in a JSON file for later analysis

Results

Full code is available at:

The YOLOv8n Segmentation model demonstrates a precision of 0.985, a recall of 0.967, an mAP at 0.5 of 0.990, and an mAP at 0.5 to 0.95 of 0.818. While these metrics indicate strong performance in general hand segmentation tasks, the model may perform better only on selective image and video types and not uniformly across all conditions.

It occasionally misclassifies other objects as hands. The performance of model may improve with hyperparameter tuning and by incorporating additional datasets such as EgoYouTubeHands, which offers diverse egocentric hand views, HandOverFace, which provides examples of hands occluding faces, and GTEA, which captures hand-object interactions during everyday activities.

Get Help When You Need It

Developing a hand segmentation system that functions in real time and performs reliably across diverse conditions can be challenging. Whether you are addressing issues with adapting YOLOv8 for segmentation tasks, or deploying your solution in augmented reality, robotics, or gesture recognition applications, expert guidance can significantly accelerate progress.

If you are working on a hand segmentation project and require personalized assistance, CodersArts provides expert support for both students and enterprises implementing computer vision technologies.

For Students

CodersArts supports students engaged in academic or research projects by offering help with:

Converting datasets such as EgoHands into YOLO-compatible formats
Debugging model training pipelines and preprocessing scripts
Improving segmentation accuracy and minimizing false positives
Annotating pixel-level masks and preparing training data
Evaluating model performance and visualizing results

For Enterprises

For companies building real-time hand tracking or gesture-based systems, CodersArts delivers production-ready solutions, including:

System architecture consultation for real-time hand segmentation
Edge deployment and optimization for AR/VR or embedded platforms
Custom training with industry-specific data, such as medical or industrial environments
Integration into interactive systems, robotics platforms, or gaming interfaces
Enhancing segmentation robustness under occlusions and variable lighting conditions

Visit www.codersarts.com or email contact@codersarts.com to access the support you need to build, optimize, and scale high-performance hand segmentation systems.

Hand Segmentation Made Simple with YOLO and Python

Introduction

Problem Statement

Key challenges in hand segmentation include:

How Hand Segmentation Systems Work

Benefits of Using YOLOv8 for Hand Segmentation

Applications in Real World Environments

Dataset Overview: EgoHands

Implementation

Dataset Preprocessing: From EgoHands to YOLO Format

Step by step:

Load the data:

Preparing the Dataset Directory

Step by step:

Training YOLOv8 for Segmentation

How it works:

Evaluating the Model and Visualizing Predictions

Step by step:

Results

Get Help When You Need It

For Students

For Enterprises

Recent Posts

Comentarii