Camera is currently off

Loading models...

Camera Feed

Start camera to begin detection

Current Detection

—

Waiting for detection

How it works

Our real-time detection system uses MediaPipe for hand tracking and custom trained neural networks (MLP/LSTM) to recognize ASL hand gestures, translating them into letters with high accuracy.

Real-Time ASL Recognition Using Neural Networks

ECS 170 - Introduction to Artificial Intelligence

Group Project•Fall 2025

Project Overview

This web application demonstrates real-time American Sign Language (ASL) alphabet recognition using deep learning and computer vision. The system captures video from your webcam, detects hand landmarks using MediaPipe, and classifies the hand gesture into one of 26 ASL letters (A-Z) using custom-trained neural networks.

The entire inference pipeline runs directly in your browser using TensorFlow.js, requiring no server-side processing. This enables real-time predictions without any backend server, making deployment simple and accessible.

System Architecture

Webcam

Video Input

→

MediaPipe

21 Landmarks

→

Normalize

42 Features

→

MLP

TensorFlow.js

Figure 1: End-to-end inference pipeline running entirely in the browser

AI Methodologies & Techniques

Hand Landmark Detection

We use Google's MediaPipe Hands to detect 21 3D landmarks on each hand in real-time. Each landmark represents a joint or fingertip position (x, y coordinates normalized to 0-1).

Landmarks: Wrist, Thumb (4), Index (4), Middle (4), Ring (4), Pinky (4) = 21 points

Feature Engineering

Raw landmark coordinates are transformed into a normalized feature vector:

1.Translation invariance: All coordinates are relative to the wrist position
2.Scale invariance: Normalized by the maximum absolute coordinate value
3.2D only: We use only X,Y coordinates (not Z depth) for stability

Final feature vector: 21 landmarks × 2 coordinates = 42 features

Neural Network Models

MLP Model (Manual Training)

• Input: 42 features

• Hidden: 128 → 64 → 32 neurons

• Activation: ReLU + Dropout(0.3, 0.3)

• Output: 25 classes (A-Y, excl. J, Z)

Trained on custom hand gestures

Kaggle MLP Model

• Input: 42 features

• Hidden: 256 → 128 → 64 neurons

• Activation: ReLU + BatchNorm + Dropout

• Output: 28 classes (A-Z + special)

Trained on Kaggle ASL dataset

MLP Model Training (Manual Data Collection)

The MLP Model was trained using hand-gesture-recognition-using-mediapipe, a methodology that allows for custom data collection and model training:

1. Data Collection Process

• Run the app.py script with webcam enabled
• Press 'k' to enter keypoint logging mode
• Press 0-9 keys to assign class IDs to hand poses
• Keypoints are saved to keypoint.csv with class labels
• Each sample contains 42 normalized coordinates (21 landmarks × 2)

2. MLP Architecture

The model uses a deeper architecture for improved performance:

Input(42) → Dropout(0.3) → Dense(128, ReLU) → Dropout(0.3) → Dense(64, ReLU) → Dense(32, ReLU) → Dense(25, Softmax)

Total parameters: ~16,700 (optimized for real-time inference)

3. Training Configuration

• Optimizer: Adam
• Loss: Sparse Categorical Cross-entropy
• Train/Test Split: 75/25
• Epochs: Up to 1000 with early stopping (patience=20)
• Batch Size: 128
• Final Accuracy: ~96% on test set

4. Model Export for Web

• Trained model saved as .hdf5 (Keras format)
• Converted to TFLite with quantization for efficiency
• Converted to TensorFlow.js format (GraphModel)
• Deployed as static files for browser inference

Kaggle MLP Model Training

The Kaggle MLP Model was trained on the ASL Alphabet dataset from Kaggle:

87,000+ images across 29 classes (A-Z, space, delete, nothing)
MediaPipe extracted keypoints from each image
80/10/10 train/validation/test split with stratification
Deeper architecture (256→128→64) with BatchNorm for stability
Early stopping and learning rate scheduling to prevent overfitting

Model Comparison

Feature	MLP Model	Kaggle MLP Model
Training Data	Custom collected	Kaggle ASL dataset
Classes	25 (no J, Z)	28 (A-Z + del, space)
Parameters	~16,700	~50,000+
Model Format	GraphModel (TFLite)	LayersModel (Keras)
Smoothing	None (raw output)	Consecutive-frame voting
Best For	Fast response, simple gestures	Stable predictions, full alphabet

Challenges & Solutions

Challenge: Prediction Flickering

Raw model predictions changed rapidly frame-to-frame, causing the displayed letter to flicker even when holding a steady pose.

Solution: Implemented temporal smoothing with consecutive-frame voting for the Kaggle model. The MLP model outputs raw predictions for faster response.

Challenge: Training vs. Inference Mismatch

The model trained on static Kaggle images performed poorly on live webcam input. The Z-depth coordinate from MediaPipe was particularly unstable.

Solution: Switched to 2D-only features (X,Y without Z), matching the approach used in proven hand gesture recognition systems. This improved stability significantly.

Challenge: Browser Deployment

Running ML models in the browser while maintaining real-time performance required careful optimization.

Solution: Used TensorFlow.js for model inference and MediaPipe's WebAssembly-accelerated hand detection. Models were converted from Keras/TFLite to TFJS format.

Team Contributions

Caitlyn Ho

Frontend Development, UI/UX Design, Deployment, Model Training

Mohamed Abdullrahma

Data Preprocessing, Feature Engineering, Model Architecture

David Bui

Backend Integration, Testing

Teddy Liu

Model Training, Data Collection

David Arista

Dataset Preparation, Research

Johnny Betmansour

Data Processing, Testing

Carlos Rayo

Documentation, Research

Xiyan Zeng

Data Augmentation, Model Export

Norman Gutierrez-Ugalde

Quality Assurance, Documentation

References

[1] Zhang, F., et al. "MediaPipe Hands: On-device Real-time Hand Tracking." CVPR Workshop 2020.
[2] Kaggle ASL Alphabet Dataset. https://www.kaggle.com/datasets/grassknoted/asl-alphabet
[3] TensorFlow.js Documentation. https://www.tensorflow.org/js
[4] hand-gesture-recognition-using-mediapipe. https://github.com/Kazuhito00/hand-gesture-recognition-using-mediapipe