Running Haar Cascades from Scratch in Rust
I owned one of those digital cameras from the early 2010s. I always wondered how they could draw an orange bounding box around my face in real-time on low-power hardware. It turns out, they use an algorithm described by Paul Viola and Michael Jones in 2001, where in you slide a bunch of "Haar cascades" looking for matches.
I implemented this approach to face detection in Rust. I didn't want to train my own Haar cascades, so I just took the models in OpenCV's GitHub repository and wrote some Python code to parse the XML and output the relevant information into a neater JSON format for my program to consume.
import json
import xml.etree.ElementTree as ET
# tree = ET.parse('haarcascade_frontalface_alt.xml')
tree = ET.parse('haarcascade_frontalface_default.xml')
root = tree.getroot()[0][1]
data = []
for stage in root:
data.append({'threshold': float(stage[1].text), 'trees': []})
for tree in stage[0]:
features, threshold, lt, gt = tree[0]
rect, tilted = features
data[-1]['trees'].append({
'threshold': float(threshold.text),
'lt': float(lt.text),
'gt': float(gt.text),
'tilted': not bool(tilted.text),
'rect': [],
})
for r in rect:
x, y, w, h, weight = map(int, r.text[:-1].split())
data[-1]['trees'][-1]['rect'].append({
'x': x, 'y': y,
'w': w, 'h': h,
'weight': weight,
})
with open('data.json', 'w') as fp:
json.dump(data, fp)
I use serde
to deserialize the JSON into Rust structures. The crate v4l is used to obtain a stream of 640x480 frames from my webcam in YUYV format; U and V represent color and Y the luminance. Every two pixels is encoded YUYV. Since we operate on greyscale images anyway, later to process the image buffer we just skip the U and V and retain only the luminance information.
We also report (very roughly) how much time is required to run the detect
function. It takes about 1–2 milliseconds.
fn main() -> io::Result<()> {
let cascade: Vec<Stage> =
serde_json::from_reader(File::open("data.json").unwrap()).unwrap();
let dev = Device::new(0)?;
let mut format = dev.format()?;
format.width = 640;
format.height = 480;
format.fourcc = FourCC::new(b"YUYV");
dev.set_format(&format)?;
let mut stream = MmapStream::with_buffers(&dev, Type::VideoCapture, 4)?;
stream.next()?; // warmup
loop {
let (buf, _meta) = stream.next()?;
let start = Instant::now();
if detect(&cascade, buf, format.width as usize, format.height as usize) {
print!("detected! ");
} else {
print!("not detected! ");
}
println!("{:.6}", (Instant::now() - start).as_secs_f32());
}
}
Before looking at the detect
function, let's look at the Rust structures we need. Firstly, in the deserialized Rect, we implement two functions. The first just returns a shifted version of the Rect. The second returns the number of pixels which overlap between two Rects.
#[derive(Serialize, Deserialize, Copy, Clone)]
struct Rect {
x: usize,
y: usize,
w: usize,
h: usize,
weight: i32,
}
impl Rect {
fn offset(&self, x: usize, y: usize) -> Self {
Rect {
x: self.x + x,
y: self.y + y,
w: self.w,
h: self.h,
weight: self.weight,
}
}
fn overlap(&self, other: Rect) -> usize {
let (x1, y1) = (self.x.max(other.x), self.y.max(other.y));
let (x2, y2) = (
(self.x + self.w).min(other.x + other.w),
(self.y + self.h).min(other.y + other.h),
);
if x1 >= x2 || y1 >= y2 {
0
} else {
(x2 - x1) * (y2 - y1)
}
}
}
We must create a shrink function to rescale the image buffer. The algorithm used here is nearest neighbor interpolation, which is simple but works well enough with Haar cascades. The shrink function takes an image buffer of size w*h
with each array element representing a greyscale pixel. It returns an iterator for the rescaled image.
fn shrink(buf: Vec<u8>, w: usize, h: usize, neww: usize) -> impl Iterator<Item = u8> {
let scale = neww as f32 / w as f32;
let out_w = scale * w as f32;
let out_h = scale * h as f32;
let mut i = 0.0;
std::iter::from_fn(move || {
let (y, x) = (i / out_w, i % out_w);
if y >= out_h - 1.0 && x >= out_w - 1.0 {
return None;
}
let by = (y * 1.0 / scale) as usize;
let bx = (x * 1.0 / scale) as usize;
i += 1.0;
Some(buf[by * w + bx])
})
}
The integral image with a one-time preprocessing cost permits us to take the sum of any Rect in constant time, by storing the cumulative sum of the numbers before it. This code is quite sensitive to off-by-one errors, which I had to debug.
struct Integral {
data: Vec<Vec<f32>>,
}
impl Integral {
fn new<I: Into<f32>>(buf: impl IntoIterator<Item = I>, w: usize, h: usize) -> Self {
let mut data = vec![vec![0f32; w + 1]; h + 1];
for (i, value) in buf.into_iter().enumerate() {
let (y, x) = (i / w + 1, i % w + 1);
data[y][x] = value.into() as f32;
let n = data[y - 1][x]; // north
let w = data[y][x - 1]; // west
let z = data[y - 1][x - 1];
data[y][x] += z + (n - z) + (w - z);
}
Integral { data }
}
fn sum(&self, r: &Rect) -> f32 {
let (x, y) = (r.x, r.y);
let z = self.data[y][x];
let n = self.data[y][x + r.w];
let w = self.data[y + r.h][x];
(self.data[y + r.h][x + r.w] - n - w + z) * r.weight as f32
}
}
In the detect
function we loop over a list of scales which we would like to apply the Haar cascade to. I find that 0.1 and 0.07 work best for my situation and the distance that I am from the webcam. For each scale, we construct an image integral for the image buffer—note how we just step by two to ignore the chrominance information. We make two copies of the image shrunk down the scale, and construct both an regular integral image of the sums and an integral image of each pixel value squared. We need this to calculate the variance information.
Next, we instantiate a vector we can store the matches in. We loop through each pixel and do some operations to calculate the variance and normalization factor (I'll explain how I derived these later). Then, we begin the cascade. We loop through each stage and compare the results from the trees to the stage threshold. If it is smaller, then it does not pass the Haar cascade and we continue. Otherwise, if the window passes all stages, we note it down in the results vector.
At the end of each scale we consider the number of overlapping results and if this exceeds 3, then we return true—there is a face in the webcam image. Otherwise, false.
fn detect(cascade: &Vec<Stage>, buf: &[u8], w: usize, h: usize) -> bool {
for scale in [0.1, 0.07] {
let sw = (w as f32 * scale) as usize;
let sh = (h as f32 * scale) as usize;
let buf: Vec<u8> = buf.into_iter().step_by(2).copied().collect();
let img1 = shrink(buf.clone(), w, h, sw);
let img2 = shrink(buf.clone(), w, h, sw);
let integral = Integral::new(img1, sw, sh);
let integral_squared = Integral::new(img2.map(|x| x as f32).map(|x| x * x), sw, sh);
let mut results: Vec<Rect> = Vec::new();
for y in 0..(sh - 24) {
for x in 0..(sw - 24) {
let window = Rect { x: x, y: y, w: 24, h: 24, weight: 1 };
let sqsum = integral_squared.sum(&window);
let sum = integral.sum(&window);
let area = (window.w * window.h) as f32;
let nf = area * sqsum - sum * sum;
let variance_nf = if nf > 0.0 { 1.0 / f32::sqrt(nf) } else { 1.0 };
'fail: {
for stage in cascade {
let sum: f32 = stage
.trees
.iter()
.map(|tree| {
let feature_sum: f32 = tree
.rect
.iter()
.map(|r| integral.sum(&r.offset(x, y)))
.sum();
if feature_sum * variance_nf < tree.threshold {
tree.lt
} else {
tree.gt
}
})
.sum();
if sum < stage.threshold {
break 'fail;
}
}
results.push(window.clone());
}
}
}
for rect in &results {
if results.iter().filter(|&n| rect.overlap(*n) > 288).count() > 3 {
return true;
}
}
}
return false;
}
This approach works surprisingly well. It is not rotation invariant, and is sensitive to tilts and movements like that. I read some research after finishing this code of alternate methods that I would implement in the future were I to have the need for face detection again.
Appendix
From a StackOverflow question and by digging through the OpenCV source code myself, I figured out how they were computing the variance normalization factor. Here is the code that determines the value for each tree in any stage:
const CascadeClassifierImpl::Data::Stump& stump = cascadeStumps[i];
double value = featureEvaluator(stump.featureIdx);
tmp += value < stump.threshold ? stump.left : stump.right;
featureEvaluator
is a function pointer, except not really—it's an instance of the class HaarEvaluator
with the call operator overridden as so:
float operator()(int featureIdx) const
{ return optfeaturesPtr[featureIdx].calc(pwin) * varianceNormFactor; }
...
inline float HaarEvaluator::OptFeature :: calc( const int* ptr ) const
{
float ret = weight[0] * CALC_SUM_OFS(ofs[0], ptr) +
weight[1] * CALC_SUM_OFS(ofs[1], ptr);
if( weight[2] != 0.0f )
ret += weight[2] * CALC_SUM_OFS(ofs[2], ptr);
return ret;
}
varianceNormFactor
, the value we're searching for, is set elsewhere, with the following code:
pwin = &sbuf.at<int>(pt) + s.layer_ofs;
const int* pq = (const int*)(pwin + sqofs);
int valsum = CALC_SUM_OFS(nofs, pwin);
unsigned valsqsum = (unsigned)(CALC_SUM_OFS(nofs, pq));
double area = normrect.area();
double nf = area * valsqsum - (double)valsum * valsum;
if( nf > 0. )
{
nf = std::sqrt(nf);
varianceNormFactor = (float)(1./nf);
return area*varianceNormFactor < 1e-1;
}
else
{
varianceNormFactor = 1.f;
return false;
}
nofs
is an array of corners, which is passed to the macro CALC_SUM_OFS
that calculates the sum from the integral image. I'm unsure why they multiply valsqsum
by area; mathematically I would expect them to divide both valsqsum
and valsum
by area
, yielding the expected values, then evaluate valsqsum - valsum*valsum
for variance.