top of page

Hand Segmentation Made Simple with YOLO and Python

Introduction

In industries such as augmented reality, robotics, gaming, and gesture-based control, understanding and interpreting hand movements is critical. Accurately detecting and segmenting hands from video streams enables machines to recognize gestures, track hand positions, and enhance human-computer interaction.


However, detecting hands in real time poses challenges due to varying lighting conditions, occlusions, different skin tones, and cluttered backgrounds. Manual annotation or tracking is labor intensive, error prone, and does not scale well with large data or live camera feeds.


In this blog, we will explore a computer vision based approach to segment hands. We will implement a solution using the YOLOv8 segmentation model and the EgoHands dataset.

Problem Statement

Applications that rely on hand tracking, such as augmented reality and virtual assistants, require precise segmentation of the hand region to function effectively. Simple bounding box detection is not sufficient for fine grained understanding of hand movements and shapes.


Key challenges in hand segmentation include:

  • Difficulty in detecting hands against complex and dynamic backgrounds

  • Lack of annotated datasets for pixel-level hand segmentation

  • Inability to run segmentation models in real time for live applications

  • Performance degradation due to occlusions and variable lighting



How Hand Segmentation Systems Work

Hand segmentation systems use deep learning techniques to detect and isolate hand regions from images and video frames. These systems are trained on datasets that include pixel-level annotations of hand regions. One of the most efficient models for real time object detection and segmentation is YOLO, short for “You Only Look Once.”


Key capabilities of these systems include:

  • Hand segmentation: Detecting hand boundaries at the pixel level

  • Real time performance: Processing video frames quickly for interactive use

  • Multiple hand detection: Identifying several hands in the same frame

  • Quantitative feedback: Estimating hand area and position over time


Once trained, the model can be deployed on edge devices or integrated with AR/VR systems for gesture recognition and control.



Benefits of Using YOLOv8 for Hand Segmentation

  1. Precision: YOLOv8 segmentation outputs detailed hand masks rather than just bounding boxes, allowing for fine control in applications.

  2. Speed: The model is optimized for high speed inference, making it suitable for real time applications.

  3. Scalability: Once trained, the model can be applied to large datasets or live video feeds without additional annotation effort.

  4. Deployment Ready: The model can be used in cross-platform environments, including mobile and embedded devices.

  5. Visualization and Feedback: Real time overlay of segmentation masks helps in debugging and provides instant feedback during operation.



Applications in Real World Environments

Hand segmentation using deep learning can be applied in a variety of real world use cases:

  • Augmented reality (AR): Overlaying digital content on hand gestures

  • Robotics: Enabling robots to detect and respond to human hand motions

  • Gaming and gesture control: Interpreting hand signs to control gameplay

  • Sign language translation: Recognizing and translating sign language gestures

  • Virtual training environments: Enhancing simulations with gesture tracking



Dataset Overview: EgoHands

The EgoHands dataset serves as our foundation, providing a rich collection of egocentric hand images captured from first person perspectives. This dataset uniquely captures the natural variations in hand appearance, pose, and interaction scenarios that occur in real world usage.

Size: 1.3 GB


Dataset Characteristics:

  • 4,800 labeled images across multiple video sequences

  • Pixel level annotations with precise polygon boundaries

  • Diverse scenarios including object manipulation, gesture performance, and natural hand movements

  • Varying lighting conditions from indoor and outdoor environments

  • Multiple hand configurations including single hands, both hands, and hand object interactions


The egocentric perspective makes this dataset particularly valuable for applications like AR, VR, and wearable computing, where the camera viewpoint matches the user perspective.



Implementation

Dataset Preprocessing: From EgoHands to YOLO Format

The EgoHands dataset contains polygon-based annotations. We need to convert these polygons into YOLOv8 segmentation format.

def _load_polygons_yolo_format(self, polygons_path, frame_idx, image_shape):
    # Load the .mat file which stores polygons per frame
    mat_data = sio.loadmat(polygons_path)
    polygons = mat_data['polygons']
    frame_data = polygons[0, frame_idx]
    yolo_annotations = []

    for poly_array in frame_data:
        if isinstance(poly_array, np.ndarray) and poly_array.size > 0:
            # Normalize polygon coordinates to [0, 1] for YOLO
            normalized_poly = poly_array.copy()
            normalized_poly[:, 0] /= image_shape[1]  # x normalized by width
            normalized_poly[:, 1] /= image_shape[0]  # y normalized by height
            normalized_poly = np.clip(normalized_poly, 0, 1)  # Ensure within bounds

            # YOLO format: class_id x1 y1 x2 y2 ...
            yolo_line = [0] + normalized_poly.flatten().tolist()
            yolo_annotations.append(yolo_line)

    return yolo_annotations

Step by step:

  1. Load the data:

    • Opens a MATLAB file (.mat) that contains hand polygons for different video frames

    • Gets the polygon data for one specific frame

  2. Process each hand shape:

    • Takes the pixel coordinates that outline each hand

    • Normalizes coordinates: Converts from actual pixel positions to percentages

      • X-coordinates: divides by image width (so 0-1 range)

      • Y-coordinates: divides by image height (so 0-1 range)

    • Safety check: Makes sure all values stay between 0 and 1

  3. Format for YOLO:

    • Flattens the coordinate pairs into one long list

    • Adds a "0" at the beginning (tells YOLO this is class 0, probably "hand")

    • Creates the final format: [0, x1, y1, x2, y2, x3, y3, ...]


Preparing the Dataset Directory

We organize the data into the format expected by YOLO: images/train, labels/train, etc.

image_resized = sample['image'].resize((640, 640), Image.LANCZOS)
image_resized.save(os.path.join(output_dir, 'images', split_name, f"{unique_name}.jpg"))

# Write annotation in YOLO segmentation format
with open(os.path.join(output_dir, 'labels', split_name, f"{unique_name}.txt"), 'w') as f:
    for annotation in sample['yolo_annotations']:
        # Convert annotation list to a formatted string
        line = ' '.join([str(annotation[0])] + [f"{coord:.6f}" for coord in annotation[1:]])
        f.write(line + '\n')

Step by step:

  1. Resize the image:

    • Takes the original image and resizes it to 640x640 pixels

    • Uses LANCZOS resampling (resizing method)

    • Saves it as a .jpg file in the correct folder (like images/train/ or images/val/)

  2. Create the label file:

    • For each image, creates a matching .txt file with the same name

    • Goes in the labels/ folder (YOLO's required structure)

  3. Format the annotations:

    • Takes each hand outline (polygon coordinates)

    • Writes them as: 0 0.123456 0.234567 0.345678 0.456789...

    • First number (0) = class ID (hand)

    • Following numbers = normalized x,y coordinates with 6 decimal precision

YOLO folder structure created:

output_dir/
├── images/
│   ├── train/
│   │   └── image001.jpg
│   └── val/
│       └── image002.jpg
└── labels/
    ├── train/
    │   └── image001.txt
    └── val/
        └── image002.txt

Training YOLOv8 for Segmentation

Training is wrapped in a class that handles model loading, argument passing, and performing training.

def train(self, data_yaml_path, epochs=100, imgsz=640, batch_size=16, ...):
    # Define YOLO training configuration
    train_args = {
        'data': data_yaml_path,
        'epochs': epochs,
        'imgsz': imgsz,
        'batch': batch_size,
        'device': self.device,
        'project': 'egohands_yolo',
        'name': 'hand_segmentation',
        'save': True,
        'plots': True
    }

    # Start training the model
    self.training_results = self.model.train(**train_args)

How it works:

  1. Sets up training configuration:

    • data: Points to the YAML file that tells YOLO where to find images and labels

    • epochs: Will show the model all training images 100 times

    • imgsz: Resizes all images to 640x640 pixels during training

    • batch_size: Processes 16 images at once (faster than one-by-one)

    • device: Uses GPU for speed

    • project/name: Creates organized folders to save results

    • save/plots: Automatically saves the best model and creates progress charts

  2. Starts the training:

    • YOLO takes over and begins the learning process

    • Shows model images → model guesses where hands are → checks against correct answers → adjusts model → repeat

    • Saves progress and creates performance graphs automatically


Evaluating the Model and Visualizing Predictions

Tests the trained YOLO model on validation images (pictures it has never seen during training) to get honest performance scores.

# Evaluate trained model on validation data
model = YOLO(model_path)
results = model.val(data=data_yaml_path, device=device, plots=True, save_json=True)

Step by step:

  1. Load the trained model:

    • Takes your saved model file and loads it into memory

    • Ready to start testing

  2. Run validation test:

    • data: Uses the same YAML file that points to your validation images

    • device: Runs on GPU for faster testing

    • plots=True: Automatically creates charts and graphs showing performance

    • save_json=True: Saves detailed results in a JSON file for later analysis



Results

Figure showing the original images on the left and the predicted results on the right.
Figure showing the original images on the left and the predicted results on the right.

Full code is available at:


The YOLOv8n Segmentation model demonstrates a precision of 0.985, a recall of 0.967, an mAP at 0.5 of 0.990, and an mAP at 0.5 to 0.95 of 0.818. While these metrics indicate strong performance in general hand segmentation tasks, the model may perform better only on selective image and video types and not uniformly across all conditions.


It occasionally misclassifies other objects as hands. The performance of model may improve with hyperparameter tuning and by incorporating additional datasets such as EgoYouTubeHands, which offers diverse egocentric hand views, HandOverFace, which provides examples of hands occluding faces, and GTEA, which captures hand-object interactions during everyday activities.



Get Help When You Need It

Developing a hand segmentation system that functions in real time and performs reliably across diverse conditions can be challenging. Whether you are addressing issues with adapting YOLOv8 for segmentation tasks, or deploying your solution in augmented reality, robotics, or gesture recognition applications, expert guidance can significantly accelerate progress.


If you are working on a hand segmentation project and require personalized assistance, CodersArts provides expert support for both students and enterprises implementing computer vision technologies.


For Students

CodersArts supports students engaged in academic or research projects by offering help with:

  • Converting datasets such as EgoHands into YOLO-compatible formats

  • Debugging model training pipelines and preprocessing scripts

  • Improving segmentation accuracy and minimizing false positives

  • Annotating pixel-level masks and preparing training data

  • Evaluating model performance and visualizing results


For Enterprises

For companies building real-time hand tracking or gesture-based systems, CodersArts delivers production-ready solutions, including:

  • System architecture consultation for real-time hand segmentation

  • Edge deployment and optimization for AR/VR or embedded platforms

  • Custom training with industry-specific data, such as medical or industrial environments

  • Integration into interactive systems, robotics platforms, or gaming interfaces

  • Enhancing segmentation robustness under occlusions and variable lighting conditions


Visit www.codersarts.com or email contact@codersarts.com to access the support you need to build, optimize, and scale high-performance hand segmentation systems.





Comentarii


bottom of page