ARTICLE AD BOX
I am building a small computer vision project (MoodAI Player) where a webcam detects facial emotions and triggers music + LED responses on a Raspberry Pi .
I am planning to train a YOLOv11 model, with data annotated in Roboflow, using 6 classes:
happy, sad, angry, neutral, fearful, sleepyDataset size: ~600 images (100 per class)
❓ Main Issue (Data Collection)
I think my image collection strategy might be the weak point, so I want advice specifically on how training images should be captured.
Current approach:
Mostly frontal faces
Clean background
Good lighting
Face fills ~60–80% of frame
Some variation (angles, lighting, different people)
50 images with glasses / 50 without
🤔 Questions
Should I keep images clean or include more real-world noise (background clutter, different distances, etc.)?
How much variation is useful?
Side angles?
Low light / shadows?
Occlusions (hair, hands, etc.)?
Is 100 images per class too small, even with augmentation in Roboflow?
Any common mistakes in emotion datasets I should avoid?
(especially confusing classes like sleepy vs neutral)
🎯 Goal
This is a university project, so I’m not aiming for perfect accuracy—just a model that works reasonably well in real-time.
