Lanczos Image Resizing
I wrote a document viewer to read my PDF files, but I wanted to view my CBZ files with it too. CBZ files are ZIP archives of images used as a format to store comic books. On starting, the problem I ran into was that resizing the images using SFML left aliasing artifacts and jaggies. I'd heard that Lanczos was the best algorithm to resize images, so I implemented it:
// Follows the definition on Wikipedia:
// https://en.wikipedia.org/wiki/Lanczos_resampling
double lanczos2(double x) {
constexpr double a = 2;
if (x == 0)
return 1;
else if (-a <= x && x < a)
return a * sin(M_PI * x) * sin(M_PI * x / a) / (M_PI * M_PI * x * x);
else
return 0;
}
sf::Image resize(sf::Image img, float zoom) {
// src & out data are len(w * h * 4); 4 channels, RGBA.
auto [src_w, src_h] = img.getSize();
const uint8_t* src_data = img.getPixelsPtr();
unsigned int out_w = src_w * zoom;
unsigned int out_h = src_h * zoom;
uint8_t* out_data = (uint8_t*)malloc(out_w * out_h * 4);
// For each output pixel we can map/scale it to a corresponding
// position in the source image; we then take the five nearest pixels
// to this position and compute their distances, and we apply the
// lanczos function to these distances (its called lanczos resampling,
// after all). The resulting numbers are called "weights."
//
// This is a memoization/caching step -- computing lanczos2 25-times
// for each output pixel (x,y) is expensive, but by mapping the output
// pixel to a position in the source image and pre-computing the
// weights we gain huge savings/speedups.
struct KernelEntry {
int lo, hi;
double weights[5];
};
// compute the weights in the y-direction.
struct KernelEntry weights_ys[out_h];
for (int y = 0; y < out_h; ++y) {
double srcY = (double)y / zoom;
int lo = ceil(srcY - 3);
if (lo < 0)
lo = 0;
int hi = floor(srcY + 3 - 1e-6f);
if (hi > src_h - 1)
hi = src_h - 1;
assert(hi - lo <= 5);
weights_ys[y] = (struct KernelEntry) { lo, hi };
for (int y_ = lo; y_ <= hi; ++y_) {
weights_ys[y].weights[y_ - lo] = lanczos2(y_ - srcY);
}
}
// compute the weights in the x-direction.
struct KernelEntry weights_xs[out_w];
for (int x = 0; x < out_w; ++x) {
double srcX = (double)x / zoom;
int lo = ceil(srcX - 3);
if (lo < 0)
lo = 0;
int hi = floor(srcX + 3 - 1e-6f);
if (hi > src_w - 1)
hi = src_w - 1;
assert(hi - lo <= 5);
weights_xs[x] = (struct KernelEntry) { lo, hi };
for (int x_ = lo; x_ <= hi; ++x_) {
weights_xs[x].weights[x_ - lo] = lanczos2(x_ - srcX);
}
}
// loop through each output pixel
for (int y = 0; y < out_h; ++y) {
for (int x = 0; x < out_w; ++x) {
int out_i = (y * out_w + x) * 4;
out_data[out_i + 3] = 255; // alpha = 255
// loop through each color channel independently (RGB)
for (int channel = 0; channel < 3; ++channel) {
double sum = 0;
double total_weight = 0;
// compute the kernel stuff
for (int y_ = weights_ys[y].lo; y_ <= weights_ys[y].hi; ++y_) {
for (int x_ = weights_xs[x].lo; x_ <= weights_xs[x].hi; ++x_) {
double wy = weights_ys[y].weights[y_ - weights_ys[y].lo];
double wx = weights_xs[x].weights[x_ - weights_xs[x].lo];
double weight = wx * wy;
sum += src_data[(y_ * src_w + x_) * 4 + channel] * weight;
total_weight += weight;
}
}
out_data[out_i + channel] = std::clamp((int)(sum / total_weight), 0, 255);
}
}
}
return sf::Image({ out_w, out_h }, out_data);
}
The core algorithm is simple: Loop over each output pixel and map it to a position in the source image. Think of the nearest pixels to this position as square buckets of paint (I know, a cardinal sin, but nevertheless) and multiply the values of these pixels by the appropriate weight—which, in this case, is the Lanczos function applied to the distance between the mapped position plus or minus an integer increment, that lays within the pixel.
The two loops in the middle compute the "kernel entries" and are strictly unnecessary. But, they precompute the mapping of output pixels to source positions and the corresponding five Lanczos weights, which saves a lot of time compared to calculating the Lanczos function 25 times per pixel.
I noticed slight jaggies in the final output image around text and such, but this was solved by prefiltering the image with a Gaussian blur before resampling. This led to much better visual results in my opinion, although at the expense of a little sharpness. Here is the code for computing a Gaussian kernel with a certain sigma value, which is set depending on the zoom level:
double gaussian_kernel[5][5];
void init_gaussian_kernel(double sigma) {
double a[5];
for (int i = -2; i < 3; ++i) {
a[i + 2] = exp(-.5 * i * i / (sigma * sigma));
}
double sum = 0;
for (int y = 0; y < 5; ++y) {
for (int x = 0; x < 5; ++x) {
gaussian_kernel[y][x] = a[y] * a[x];
sum += gaussian_kernel[y][x];
}
}
// normalize
for (int y = 0; y < 5; ++y) {
for (int x = 0; x < 5; ++x) {
gaussian_kernel[y][x] /= sum;
}
}
}
And in the resize function, we add this near the beginning:
if (zoom < 1.0) { // if downsampling apply gaussian blur for better results
init_gaussian_kernel(0.5 * (1.0 / zoom));
uint8_t* filtered_src_data = (uint8_t*)malloc(src_w * src_h * 4);
#pragma omp parallel for
for (int y = 0; y < src_h; ++y) {
for (int x = 0; x < src_w; ++x) {
int y_lo = y - 2;
if (y_lo < 0)
y_lo = 0;
int y_hi = y + 2;
if (y_hi > src_h - 1)
y_hi = src_h - 1;
int x_lo = x - 2;
if (x_lo < 0)
x_lo = 0;
int x_hi = x + 2;
if (x_hi > src_w - 1)
x_hi = src_w - 1;
double r = 0, g = 0, b = 0;
for (int ky = y_lo; ky <= y_hi; ++ky) {
for (int kx = x_lo; kx <= x_hi; ++kx) {
r += (double)src_data[(ky * src_w + kx) * 4 + 0] * gaussian_kernel[ky - y_lo][kx - x_lo];
g += (double)src_data[(ky * src_w + kx) * 4 + 1] * gaussian_kernel[ky - y_lo][kx - x_lo];
b += (double)src_data[(ky * src_w + kx) * 4 + 2] * gaussian_kernel[ky - y_lo][kx - x_lo];
}
}
filtered_src_data[(y * src_w + x) * 4 + 0] = (int)r;
filtered_src_data[(y * src_w + x) * 4 + 1] = (int)g;
filtered_src_data[(y * src_w + x) * 4 + 2] = (int)b;
filtered_src_data[(y * src_w + x) * 4 + 3] = 255;
}
}
src_data = filtered_src_data;
}
Also, I added two lines of code leveraging OpenMP to easily parallelize both the Gaussian blurring and resampling loops. This significantly speeds up the page render times from around 600ms to only 100ms! (Also including the time taken to decode JPG/PNG). 100ms is fast enough to feel no noticeable latency when flipping through pages.
It amazes me how the code needs to touch 6,774,840 pixels, 25 + 25 times, for each color channel, so at least a billion addition and multiplication operations. And my computer can complete them in less than 100ms?!? Sure, it's on multiple cores, but someone tell me I'm doing the math wrong. I haven't fully appreciated how fast CPUs are until now.
It's truly a testament to the ingenious engineering present (at every level of the stack) of modern computing systems, that I can write such trash code, yet through the power of compiler optimizations, CPU black magic and friendship, the code manages to scale and render 2091x3240 images in under 100ms. The embarrassingly parallel nature of the loops here and the 16 cores of my laptop are really saving my ass.
It would be better to implement Lanczos resampling on GPUs, but I'm not familiar with them. 100ms is sufficient speed.