Camera is currently off
Camera Feed
Start camera to begin detection
Current Detection
Waiting for detection
How it works
Our real-time detection system uses MediaPipe for hand tracking and custom trained neural networks (MLP/LSTM) to recognize ASL hand gestures, translating them into letters with high accuracy.
Real-Time ASL Recognition Using Neural Networks
ECS 170 - Introduction to Artificial Intelligence
This web application demonstrates real-time American Sign Language (ASL) alphabet recognition using deep learning and computer vision. The system captures video from your webcam, detects hand landmarks using MediaPipe, and classifies the hand gesture into one of 26 ASL letters (A-Z) using custom-trained neural networks.
The entire inference pipeline runs directly in your browser using TensorFlow.js, requiring no server-side processing. This enables real-time predictions without any backend server, making deployment simple and accessible.
System Architecture
Figure 1: End-to-end inference pipeline running entirely in the browser
AI Methodologies & Techniques
Hand Landmark Detection
We use Google's MediaPipe Hands to detect 21 3D landmarks on each hand in real-time. Each landmark represents a joint or fingertip position (x, y coordinates normalized to 0-1).
Landmarks: Wrist, Thumb (4), Index (4), Middle (4), Ring (4), Pinky (4) = 21 pointsFeature Engineering
Raw landmark coordinates are transformed into a normalized feature vector:
- 1.Translation invariance: All coordinates are relative to the wrist position
- 2.Scale invariance: Normalized by the maximum absolute coordinate value
- 3.2D only: We use only X,Y coordinates (not Z depth) for stability
Final feature vector: 21 landmarks × 2 coordinates = 42 features
Neural Network Models
• Input: 42 features
• Hidden: 128 → 64 → 32 neurons
• Activation: ReLU + Dropout(0.3, 0.3)
• Output: 25 classes (A-Y, excl. J, Z)
Trained on custom hand gestures
• Input: 42 features
• Hidden: 256 → 128 → 64 neurons
• Activation: ReLU + BatchNorm + Dropout
• Output: 28 classes (A-Z + special)
Trained on Kaggle ASL dataset
MLP Model Training (Manual Data Collection)
The MLP Model was trained using hand-gesture-recognition-using-mediapipe, a methodology that allows for custom data collection and model training:
1. Data Collection Process
- • Run the app.py script with webcam enabled
- • Press 'k' to enter keypoint logging mode
- • Press 0-9 keys to assign class IDs to hand poses
- • Keypoints are saved to keypoint.csv with class labels
- • Each sample contains 42 normalized coordinates (21 landmarks × 2)
2. MLP Architecture
The model uses a deeper architecture for improved performance:
Total parameters: ~16,700 (optimized for real-time inference)
3. Training Configuration
- • Optimizer: Adam
- • Loss: Sparse Categorical Cross-entropy
- • Train/Test Split: 75/25
- • Epochs: Up to 1000 with early stopping (patience=20)
- • Batch Size: 128
- • Final Accuracy: ~96% on test set
4. Model Export for Web
- • Trained model saved as .hdf5 (Keras format)
- • Converted to TFLite with quantization for efficiency
- • Converted to TensorFlow.js format (GraphModel)
- • Deployed as static files for browser inference
Kaggle MLP Model Training
The Kaggle MLP Model was trained on the ASL Alphabet dataset from Kaggle:
- 87,000+ images across 29 classes (A-Z, space, delete, nothing)
- MediaPipe extracted keypoints from each image
- 80/10/10 train/validation/test split with stratification
- Deeper architecture (256→128→64) with BatchNorm for stability
- Early stopping and learning rate scheduling to prevent overfitting
Model Comparison
| Feature | MLP Model | Kaggle MLP Model |
|---|---|---|
| Training Data | Custom collected | Kaggle ASL dataset |
| Classes | 25 (no J, Z) | 28 (A-Z + del, space) |
| Parameters | ~16,700 | ~50,000+ |
| Model Format | GraphModel (TFLite) | LayersModel (Keras) |
| Smoothing | None (raw output) | Consecutive-frame voting |
| Best For | Fast response, simple gestures | Stable predictions, full alphabet |
Challenges & Solutions
Challenge: Prediction Flickering
Raw model predictions changed rapidly frame-to-frame, causing the displayed letter to flicker even when holding a steady pose.
Solution: Implemented temporal smoothing with consecutive-frame voting for the Kaggle model. The MLP model outputs raw predictions for faster response.
Challenge: Training vs. Inference Mismatch
The model trained on static Kaggle images performed poorly on live webcam input. The Z-depth coordinate from MediaPipe was particularly unstable.
Solution: Switched to 2D-only features (X,Y without Z), matching the approach used in proven hand gesture recognition systems. This improved stability significantly.
Challenge: Browser Deployment
Running ML models in the browser while maintaining real-time performance required careful optimization.
Solution: Used TensorFlow.js for model inference and MediaPipe's WebAssembly-accelerated hand detection. Models were converted from Keras/TFLite to TFJS format.
Team Contributions
Frontend Development, UI/UX Design, Deployment, Model Training
Data Preprocessing, Feature Engineering, Model Architecture
Backend Integration, Testing
Model Training, Data Collection
Dataset Preparation, Research
Data Processing, Testing
Documentation, Research
Data Augmentation, Model Export
Quality Assurance, Documentation
References
- [1] Zhang, F., et al. "MediaPipe Hands: On-device Real-time Hand Tracking." CVPR Workshop 2020.
- [2] Kaggle ASL Alphabet Dataset. https://www.kaggle.com/datasets/grassknoted/asl-alphabet
- [3] TensorFlow.js Documentation. https://www.tensorflow.org/js
- [4] hand-gesture-recognition-using-mediapipe. https://github.com/Kazuhito00/hand-gesture-recognition-using-mediapipe