I happened across a video about the Collatz Conjecture a couple months back and it seemed like a good way to continue learning Rust. The Conjecture is straightforward to implement, and optimizing it has led to some interesting places. I’m no mathmetician, nor do I think I can solve the conjecture, but it was a good bit of fun to explore.
What is the Collatz Conjecture?
The Collatz Conjecture is an unsolved problem in mathematics centered on a deceptively simple algorithm known as 3x+1
. Given a number, if it’s odd, multiply it by 3 and add 1; If it’s even, divide by 2. In Python, that’s:
while True:
if num % 2 == 0:
num = num / 2
else:
num = num * 3 + 1
The Conjecture is that all positive integers converge at 1. Whether this is true is what’s unsolved.
What could I accomplish with a bit of Rust? Well, I can’t prove the conjecture unless I’ve gravely misunderstood Rust’s capabilities – testing all positive integers would be a feat. I could, however, disprove it. There are two ways to disprove it: find a number that grows infinitely with 3x+1
, or find a cycle of numbers that doesn’t resolve to 1. The first is a pesky infinity, so that’s out. I figured I’d explore the second.
There is one known cycle of positive integers: 1 -> 4 -> 2 -> 1 -> ...
. That resolves to 1, but if there were another cycle which didn’t, that would disprove the Conjecture.
Unfortunately, Terence Tao has already shown almost all numbers will resolve. Further, all integers up through 268 have been tested. I couldn’t contribute to the mathematics, but that’s not surprising. I still wanted to try implementing it, and seeing how much compute it’d take me to solve for all integers through 268.
Implementing 3x+1
I started with three implementations: recursive, non-recursive, and the “shortcut” method from the wikipedia page. The implementations keep track of the number of steps it takes to reach 1, which was used for testing.
Recursive Implementation
Rusts’s match
syntax made this implementation especially clean:
pub fn recursive(num: usize) -> usize {
fn _recurse(num: usize, count: usize) -> (usize, usize) {
match num {
1 => (num, count),
num if num % 2 == 0 => _recurse(num / 2, count + 1),
_ => _recurse(3 * num + 1, count + 1),
}
}
_recurse(num, 0).1
}
I was curious if Rust has tail call optimization, so I used Compiler Explorer to get the assembly generated by this function. It confirmed that, yes, Rust does TCO:
example::recursive:
xor eax, eax
cmp rdi, 1
je .LBB0_3
.LBB0_1:
mov rcx, rdi
shr rcx
test dil, 1
lea rdi, [rdi + 2*rdi + 1]
cmove rdi, rcx
inc rax
cmp rdi, 1
jne .LBB0_1
.LBB0_3:
ret
It’s a bit hard to read, but If I ugly this up a little bit the optimization goes away:
pub fn recursive(num: usize) -> usize {
fn _recurse(num: usize, count: usize) -> (usize, usize) {
let (foo, bar) = match num {
1 => (num, count),
x if x % 2 == 0 => _recurse(num / 2, count + 1),
_ => _recurse(3 * num + 1, count + 1),
};
return (foo, bar+1);
}
_recurse(num, 0).1
}
example::recursive:
xor esi, esi
jmp example::recursive::_recurse
example::recursive::_recurse:
mov rax, rsi
cmp rdi, 1
je .LBB1_2
push rax
mov rcx, rdi
shr rcx
test dil, 1
lea rdi, [rdi + 2*rdi + 1]
cmove rdi, rcx
inc rax
mov rsi, rax
call example::recursive::_recurse
add rsp, 8
.LBB1_2:
inc rax
ret
Compilers are finnicky things, and one shouldn’t rely on tail call optimization, but it’s nice when it happens.
Simple and Shortcut Implementations
Both of these implementations closely mirror the Python snippet above. Again, Rust’s match
syntax kept it succinct and readable:
pub fn simple(num: usize) -> usize {
let (mut num, mut count) = (num, 0);
while num != 1 {
num = match num {
num if num % 2 == 0 => num / 2,
_ => 3 * num + 1,
};
count += 1;
}
count
}
pub fn shortcut(num: usize) -> usize {
let (mut num, mut count) = (num, 0);
while num != 1 {
(num, count) = match num {
num if num % 2 == 0 => (num / 2, count + 1),
_ => ((3 * num + 1) / 2, count + 2), // +2 accounts for skipped step
};
}
count
}
Benchmarking Performance
The shortcut
implementation should be the fastest, but how much faster is it? Is it fast enough to get to 268?
Rust has robust tooling for benchmarks. It was quick to set up, but the code is too long to post the full thing. You can find the details in the full implementation.
There are three tests which solve 3x+1
for 5,000 consecutive numbers. Each has a different starting number to better understand how the size of the number affects performance. Presumably, fewer steps are needed to solve for smaller numbers.
The results (with some rearrangement for clarity):
# [1..5000]
test bench_recursive_small ... bench: 321,086 ns/iter (+/- 3,660)
test bench_simple_small ... bench: 324,699 ns/iter (+/- 1,148)
test bench_shortcut_small ... bench: 262,787 ns/iter (+/- 3,206)
# [1_000_000..1_005_000]
test bench_recursive_mid ... bench: 565,324 ns/iter (+/- 9,872)
test bench_simple_mid ... bench: 563,903 ns/iter (+/- 3,090)
test bench_shortcut_mid ... bench: 445,642 ns/iter (+/- 5,922)
# [1_000_000_000..1_000_005_000]
test bench_recursive_big ... bench: 925,948 ns/iter (+/- 6,393)
test bench_simple_big ... bench: 919,804 ns/iter (+/- 2,590)
test bench_shortcut_big ... bench: 730,428 ns/iter (+/- 15,508)
As expected, bigger numbers take longer. It was nice to see recursive
and simple
neck and neck thanks to the tail call optimization, but shortcut
is ~21% faster. Not bad!
Did this get me close to 268? I did some back of the envelope math:
shortcut
solved 5,000 “big” numbers in 730,428ns. A rate of ~6.8M solves per second. At that rate, it’d take 4.34e+13
seconds. That works out to… 1.3 million years. I had some work to do.
The same math put recursive
and simple
at 1.7 million years. I had that going for me, at least.
Insight Into 3x+1
1.3 million years is a lot of compute power. I could spin up a massive cluster on AWS and try to brute force it, but that’s a lot of time and money to recreate what’s already been done. Instead, there were some insights I could use to make this faster.
The biggest question is whether we have to resolve down to 1. If we reach 2, we know that resolves to 1, same with 4. We can short-circuit the entire 1 -> 4 -> 2 -> 1 -> ...
cycle. That’d save some work.
It’s also true that all powers of 2 trivially resolve. I could add a log2 check for every number.
There were other rules I could add, but there’s one, much larger, insight to be had. I’ve been solving these numbers sequentially. For any number N, I already know that all numbers less than N resolve. Therefore, any number which goes below its initial value will resolve to 1.
It looks like this:
pub fn faster_shortcut(num: usize) -> usize {
// Special case: can't get to < 1.
if num == 1 {
return 1;
}
let mut count = 0;
let mut curr_num = num;
while curr_num >= num {
(curr_num, count) = match curr_num {
curr_num if curr_num % 2 == 0 => (curr_num / 2, count + 1),
_ => ((3 * curr_num + 1) / 2, count + 2),
};
}
count
}
How does it stack up to shortcut
?:
# [1..5000]
test bench_shortcut_small ... bench: 262,787 ns/iter (+/- 3,206)
test bench_faster_shortcut_small ... bench: 17,159 ns/iter (+/- 166)
# [1_000_000..1_005_000]
test bench_shortcut_mid ... bench: 445,642 ns/iter (+/- 5,922)
test bench_faster_shortcut_mid ... bench: 16,208 ns/iter (+/- 1,298)
# [1_000_000_000..1_000_005_000]
test bench_shortcut_big ... bench: 730,428 ns/iter (+/- 15,508)
test bench_faster_shortcut_big ... bench: 15,525 ns/iter (+/- 396)
Now we’re talking! 98% faster is pretty good, but was it enough? 2% of 1,300,000 years is… 26,000 years. Still more work to be done.
I can do a bit more, like skipping even numbers (~13% faster), but it was approaching the point of diminishing returns. Without an insight that’s 5 orders of magnitude faster, I needed to parallelize the workload. Fortunately, it’s embarrassingly parallel.
More Power!
If I was ever going to get to 268 I wasn’t going to do it it with a single core.
Fortunately, Rust has a good threadpool implementation. Getting data out is a bit fiddly, so I had to learn about channels
. It looks something like this:
pub fn solve(
start: usize,
end: usize,
output_channel: Sender<(usize, usize)>,
threads: usize,
) {
let pool = ThreadPool::new(threads);
for num in start..end {
let output_channel = output_channel.clone();
pool.execute(move || {
let result = shortcut(num);
output_channel.send((num, result));
});
}
pool.join();
}
That was easy. I’m sure with 4 threads it’ll be 4x faster.
test bench_solve_big ... bench: 56,306,136 ns/iter (+/- 5,612,649)
test bench_solve_mid ... bench: 63,223,966 ns/iter (+/- 2,560,124)
test bench_solve_small ... bench: 63,715,650 ns/iter (+/- 3,901,619)
Uh oh. Maybe more threads? I have 24 cores. Use them all!
test bench_solve_big ... bench: 97,167,581 ns/iter (+/- 3,491,744)
test bench_solve_mid ... bench: 97,583,864 ns/iter (+/- 3,722,710)
test bench_solve_small ... bench: 98,035,438 ns/iter (+/- 2,615,151)
That’s not good. So, what’s happening?
The core algorithm is very small. The disassembly above shows this is a small chunk of machine code that I’m running in a tight loop. The function only takes a few nanoseconds to compute 3x+1
for a given number.
Adding the threadpool changed all that. Now, for each number tested, it’s cloning the output channel, adding the job to a queue, a worker is taking it off the queue, doing the work, and sending a response back via the channel. This isn’t exactly inefficient, but the formerly tight inner loop now has many more steps and some memory allocation. Even executing in parallel isn’t enough to overcome the added overhead. Maybe it could with more core, but I wasn’t able to test that.
(For a more in-depth discussion of the topic, check out this paper. It makes a good argument that single-core performance can provide a good “ground truth” of performance when you start distributing computation.)
I decided to try batching to amortize the overhead. I wanted that tight inner loop running as much as possible in order to lessen the impact of the threadpool.
pub fn solve(
start: usize,
end: usize,
output_channel: Sender<BatchSummary>,
batch_size: usize,
threads: usize,
) {
let pool = ThreadPool::new(threads);
let mut batch_start = start;
while batch_start < end {
let batch_end = min(batch_start + batch_size, end);
let output_channel = output_channel.clone();
pool.execute(move || {
let mut max_steps = 0;
let start_time = SystemTime::now();
for num in batch_start..batch_end {
// max steps is mildly interesting, but really i'm making sure
// the compiler doesn't make this function call disappear.
max_steps = max(max_steps, faster_shortcut(num));
}
// Send a completion summary to the output channel
output_channel
.send(BatchSummary {
start: batch_start,
end: batch_end,
start_time,
end_time: SystemTime::now(),
max_steps,
})
.expect("channel broken!");
});
batch_start = batch_end;
}
pool.join();
}
I added BatchSummary
, since individual results aren’t terribly interesting. It has basic information about each job so things like average execution time can be tracked.
The benchmark solves 5,000 numbers. Let’s try batches of 1,000 numbers over 5 threads:
test bench_solve_big ... bench: 137,272 ns/iter (+/- 8,960)
test bench_solve_mid ... bench: 136,763 ns/iter (+/- 8,561)
test bench_solve_small ... bench: 136,933 ns/iter (+/- 9,923)
Better, but still less efficient than running on a single core. Bigger batches will amortize further, so I tried that. Experimentally, I found 10M to be a good batch size. Here’s 40M with 4 threads (i.e. 1 batch per core):
# No threading (baseline)
test bench_faster_shortcut_small ... bench: 134,173,824 ns/iter (+/- 663,586)
# 1 thread
test bench_solve_small ... bench: 146,406,820 ns/iter (+/- 3,089,826)
# 4 threads
test bench_solve_small ... bench: 36,333,800 ns/iter (+/- 2,377,087)
Now we’re talking! As expected, four threads is four times faster than one. It’s also 3.722 times as fast as the non-threaded version (context switching, allocations, and overhead eat up that missing 0.278).
I amortized away almost all of the overhead. Considering the variance of the threaded tests, it was working very well.
Going back to napkin math, I still need 26,000 years of compute, but now I can use as many cores as I want. Looking at recent prices, I can get $149/core/year which put solving to 268 at a very affordable $3,874,000!
Conclusions and Next Steps
I couldn’t solve up to 268 in a cost or time effective manner, but there’s still a lot more to explore. But, this post is pretty long, so I’m going to stop here. This problem turned out to be a good testbed for learning Rust and playing with distributed computation. I’ve posted the code on Github for anyone who wants to see the whole thing.
In the future, I plan on writing about:
- Looking at alternate implementations
- Adding a CLI
- Running
3x+1
on a GPU - Distributing all of this on GCP or AWS.
Maybe I’ll even do some ground-breaking math along the way! (probably not)
In the meantime, all the code can be found in this repo.