Secret Performance Hack: Count Sort ?
January 13, 2021
I was doing some recreational programming tonight on leetcode and I stumbled into an interesting performance case study.
The Problem
This is LeetCode Problem: 881 Boats to Save People.
Boats to Save People
The i
-th person has weight people[i]
, and each boat can carry a maximum weight of limit
.
Each boat carries at most 2 people at the same time, provided the sum of the weight of those people is at most limit
.
Return the minimum number of boats to carry every given person. (It is guaranteed each person can be carried by a boat.)
Example 1:
Input: people = [1,2], limit = 3
Output: 1
Explanation: 1 boat (1, 2)
From LeetCode
My code solved this problem in a blazing 452ms and 21 MB of memory. Not too shabby!
class Solution:
def numRescueBoats(self, people: List[int], limit: int) -> int:
people.sort()
start,end = 0,len(people)-1
boats = 0
while start <= end:
while start < end and people[end] > limit - people[start]:
boats += 1
end -= 1
boats += 1
start += 1
end -= 1
return boats
After completing the problem, I realized that this problem is perfectly suited for Count Sort.
Count Sort
Count sort is one of my favorite algorithms because for the right data it is a sort that works in linear time, that’s O(n)!
Count Sort is a special sorting algorithm that works against data that is guaranteed to fit into a fixed range w
. The stable version of Count Sort is O(n+w) in both time and space complexity, where n
is the number of things to be sorted. The non-stable version of Count Sort can be done with only O(w) additional space.
Count sort works by putting elements in an array of fixed length C
according to their values on one pass. Then count sort scans through the C
and emits elements in sorted order.
Here is how I implemented it for this problem. In this example, people
is the array I needed to sort.
count = [0]*(limit+1)
for i in people:
count[i] += 1
p = 0
for i in range(len(count)):
while count[i]:
people[p] = i
count[i] -= 1
p += 1
I was thrilled to use my favorite sorting algorithm! Count Sort should be faster than Timsort (python’s built in sorting algorithm). Which has an average performance of O(n * log n) and O(n) space complexity.
Stable Count Sort
My stable count sort solution solved the problem in 552ms and 21.1 MB of memory:
class Solution:
def numRescueBoats(self, people: List[int], limit: int) -> int:
#Count Sort
count = [0]*(limit+1)
for i in people:
count[i] += 1
total = 0
for i in range(limit+1):
count[i],total = total,count[i]+total
n = [0]*len(people)
for i in people:
n[count[i]] = i
count[i] += 1
people = n
start,end = 0,len(people)-1
boats = 0
while start <= end:
if people[end] <= limit - people[start]:
start += 1
boats += 1
end -= 1
return boats
There was a slight increase in memory usage, .1 MB. This isn’t unexpected, Timsort uses O(N) space, whereas stable count sort uses O(N+W) space. But I was not expecting to see the runtime increase of +100ms!
Let’s see how the non-stable count sort does.
Non-Stable Count Sort
Non-Stable Count Sort Solution 500ms and 20.1 MB of memory:
class Solution:
def numRescueBoats(self, people: List[int], limit: int) -> int:
#Count Sort
count = [0]*(limit+1)
for i in people:
count[i] += 1
p = 0
for i in range(len(count)):
while count[i]:
people[p] = i
count[i] -= 1
p += 1
start,end = 0,len(people)-1
boats = 0
while start <= end:
if people[end] <= limit - people[start]:
start += 1
boats += 1
end -= 1
return boats
I was happy to see the decrease in memory use. I predicted an O(w) memory usage, because I modify the people
array in place, so this .9 MB decrease in memory usage was expected. The non-stable version also performed 50 ms faster than the stable version, which was expected as the non-stable version has less overhead operations.
But why did Count Sort not beat out Timsort?
Why would Timsort beat Count Sort?
Timsort is a hybrid stable sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data.
Honestly, I don’t know the answer here. But I have some ideas:
Best Case
Timsort has a Best case performance of O(n). This could beat out Count Sort (also O(n) if Timsort’s actual performance is faster.
Timsort is very adaptive and works best with real world data according to Wikipedia. It could just be smarter than me and out perform count sort here.
Timsort is Written in C
The code for Timsort in CPython’s built-in sort()
implementation is actually written in C!
This would absolutely have a big performance impact over my implementation of count sort in Python.
Leetcode is a Poor Test Harness
Leetcode isn’t really meant to measure these kind of performance differences. There are pytest modules that I could use to measure the performance of this code with 1000s of iterations to account for statistical variations.
The tests provided by leetcode are also limited. The inputs are only so large.
1 <= people.length <= 50000
1 <= people[i] <= limit <= 30000
It could be the case that CPython’s Timsort does out perform Count Sort for these small numbers, but when you get into the millions or billions scale, Count Sort starts to be faster.
Conclusion
I believe Timsort outperformed Count Sort for all the reasons I listed above and probably more reasons that I didn’t even think of!
Software performance is a super hard problem. There are layers and layers of abstractions to dig into, as well as low level physical variance occurring at the atomic level when measuring performance.
This was a fun exercise to go through, and I hope you learned something, I definitely did.