A brief detour this week as I look into an alternate Collatz implementation. The first part of the series can be found here.
The Collatz Conjecture Wikipedia page has an alternate formulation titled “as an abstract machine that computes in base two”. I have a computer. It’s a non-abstract machine that comptes in base two. Seemed like it could be fun.
Making the Abstract Concrete
The idea is pretty clever:
Repeated applications of the Collatz function can be represented as an abstract machine that handles strings of bits. The machine will perform the following three steps on any odd number until only one 1 remains:
- Append 1 to the (right) end of the number in binary (giving 2n + 1);
- Add this to the original number by binary addition (giving 2n + 1 + n = 3n + 1);
- Remove all trailing 0s (that is, repeatedly divide by 2 until the result is odd).
– [wikipedia]
This shouldn’t be too hard to implement with bitwise operators, but will it be faster than the “faster” method from part one?
The “faster” method had a few optimizations:
- Stopping if the generated number drops below the original number.
- Computing
(3x + 1) / 2
in one step rather than two (i.e. the “shortcut” method).
The benchmarks from part one ran in about 12µs (full details here). I wanted to see if this method would beat that.
Things That Don’t Work
The interesting part of this method is that it can solve 3x+1
in fewer steps than other implementations. One could “skip” iterations by chopping off multiple trailing zeroes at once. Implementing the multiplication step with bitshifts is novel, but not nearly as promising.
This bit of optimization gets me no closer to solving the problem, but it is yak shaving at its finest. It’s not like I’m doing groundbreaking math. I’m just here to have fun with Rust.
The first implementation didn’t look half-bad:
pub fn bitwise(num: usize) -> usize {
if num == 1 {
return 1;
}
let mut count = 0;
let mut curr_num = num;
while curr_num >= num {
curr_num = match curr_num & 1 {
0 => curr_num >> 1,
1 => ((curr_num << 1) + 1 + curr_num) >> 1,
_ => panic!("at the disco"),
};
count += 1;
}
count
}
And it ran pretty okay…
test bench_bitwise_big ... bench: 13,000 ns/iter (+/- 210)
test bench_faster_shortcut_big ... bench: 12,406 ns/iter (+/- 192)
But, that’s a boring result. I wanted to know how far I could push it.
First, I jettisoned the bitwise multiplication step. It’s ungainly and slightly slower than the more straightforward version. Then, I tried to get iteration skips working. I want to bit shift by the number of trailing zeroes in a given number. My binary math is a bit rusty, but it turns out finding the least significant bit isn’t that hard.
For an unsigned type, (~x + 1) & x
will set the least significant bit, and only that bit, to 1. This took me some work to understand, so let me explain. Here’s what it looks like with the number 12:
- 12 =>
1100
- ~12 =>
0011
- ~12 + 1 =>
0100
0100
AND1100
=>0100
By flipping trailing zeroes and the least significant bit, something like 100
turns into 011
. Add 1 to that and it’s back to 100
– same as the original number. The key is that only those bits are changed in the compliment. Therefore, the AND on other bits gives 0 and we’ve isolated only the least significant bit.
My version 2:
pub fn bitwise(num: usize) -> usize {
if num == 1 {
return 1;
}
let mut count = 0;
let mut curr_num = num;
while curr_num >= num {
let least_significant_bit = (!curr_num + 1) & curr_num;
curr_num = match least_significant_bit {
1 => ((curr_num << 1) + 1 + curr_num) >> 1,
2 => curr_num >> 1,
4 => curr_num >> 2,
8 => curr_num >> 3,
16 => curr_num >> 4,
32 => curr_num >> 5,
64 => curr_num >> 6,
128 => curr_num >> 7,
256 => curr_num >> 8,
_ => curr_num >> 9, // has to be at least this much
};
count += 1;
}
count
}
Not exactly pretty, but it could be cleaned up later. The real question was if it was faster.
test bench_bitwise_big ... bench: 42,801 ns/iter (+/- 3,060)
test bench_faster_shortcut_big ... bench: 11,961 ns/iter (+/- 362)
Nope.
It turns out Rust already has trailing_zeros
, which simplified the code a good bit:
pub fn bitwise(num: usize) -> usize {
if num == 1 {
return 1;
}
let mut count = 0;
let mut curr_num = num;
while curr_num >= num {
curr_num = match curr_num.trailing_zeros() {
0 => ((curr_num << 1) + 1 + curr_num) >> 1,
_ => curr_num >> curr_num.trailing_zeros(),
};
count += 1;
}
count
}
And…
test bench_bitwise_big ... bench: 15,817 ns/iter (+/- 981)
test bench_faster_shortcut_big ... bench: 12,496 ns/iter (+/- 211)
Better, but not as good as the first version.
Something that Does Work
The final version, which was faster than faster_shortcut
, combined several approaches:
pub fn bitwise(num: usize) -> usize {
if num == 1 {
return 1;
}
let mut count = 0;
let mut curr_num = num;
while curr_num >= num {
curr_num = match curr_num % 2 {
1 => (3 * curr_num + 1) / 2,
0 => curr_num >> curr_num.trailing_zeros(),
_ => panic!("at the disco"),
};
count += 1;
}
count
}
This implementation:
- Stops if
curr_num < num
- Removes all trailing zeroes in one operation
- Retains the more readable version of the multiplication step
- Delays calling
.trailing_zeros()
until we need it - Runs 30% faster!
test bench_bitwise_big ... bench: 9,426 ns/iter (+/- 178)
test bench_faster_shortcut_big ... bench: 12,254 ns/iter (+/- 328)
Adding a CLI
One last yak to shave before I move on. I wanted a simple CLI so that I could run this more interactively. It has no real utility, but it was a good opportunity to learn argument parsing in Rust.
It wasn’t too hard to learn clap and create a simple interface for the project. The full thing is a bit wordy, but you can find it here. It also includes how I’m using channels to get status information back to the caller. The result looks pretty nice:
❯ ./collatz --help
collatz -- run the 3x+1 problem on some numbers or something
Usage: collatz [OPTIONS]
Options:
-s, --start <START>
Start running at N [default: 1]
-e, --end <END>
Where to end (0 runs forever) [default: 0]
-b, --batch-override <BATCH_OVERRIDE>
Override default batch size of `num_to_solve / (threads*2) `
-t, --threads <THREADS>
How many threads to use [default: 24]
-h, --help
Print help
Now, I can build and distribute a binary if I did want to run it on a cloud provider. Not today, though. Today I wanted to watched it run forever. It looked something like this:
❯ ./collatz
Running with settings:
- start: 1
- end: ∞
- batch size: 200,000,000
- threads: 24
Starting infini-batch [1.00e0, 2.00e10]
-----
Total Solves Overall solves/s Batch Duration Batch solves/s Max steps to solve
2.00e8 8.839e7 2262.783ms 8.84e7 361
4.00e8 8.828e7 2268.234ms 8.82e7 361
... snip ...
1.98e10 8.814e7 1491.901ms 1.34e8 471
2.00e10 8.844e7 1488.477ms 1.34e8 471
-----
Total Solves: 20,000,000,000
Processing Time: 226.131884011s
Solves/s: 88,443,960
Starting infini-batch [2.00e10, 4.00e10]
-----
Total Solves Overall solves/s Batch Duration Batch solves/s Max steps to solve
2.00e8 8.784e7 2276.852ms 8.78e7 431
4.00e8 8.786e7 2276.083ms 8.79e7 466
... snip ...
1.98e10 8.763e7 1507.896ms 1.33e8 469
2.00e10 8.792e7 1545.609ms 1.29e8 469
-----
Total Solves: 20,000,000,000
Processing Time: 227.492126893s
Solves/s: 87,915,130
Starting infini-batch [4.00e10, 6.00e10]
-----
... so on and so forth ...
Solving 87 million numbers a second doesn’t get us any closer to doing useful math, but at least it looks pretty neat. While running against real numbers, I hit the upper limit of u64
, so I had to swap out for 128-bit integers. They’re a bit slower, but nothing to be done about that.
Moving On…
This was a fun little digression. There’s a whole section in the wiki page about other optimizations, but they’re above my pay grade. I’m no mathematician.
In the meantime, I still want to:
- Run this on a GPU
- Distribute it to many machines
I’m not honestly sure if or when I’ll get to that, though. Andrej Karpthy’s course on neural networks has been occupying all my free time. I plan on continuing down that rabbit hole as far as it’ll go. If I do anything interesting, I’ll try to write about it here.
PS: The code from this post can be found here.