Collatz Part 2: Yak Shaving

A brief detour this week as I look into an alternate Collatz implementation. The first part of the series can be found here.

The Collatz Conjecture Wikipedia page has an alternate formulation titled “as an abstract machine that computes in base two”. I have a computer. It’s a non-abstract machine that comptes in base two. Seemed like it could be fun.

Making the Abstract Concrete

The idea is pretty clever:

Repeated applications of the Collatz function can be represented as an abstract machine that handles strings of bits. The machine will perform the following three steps on any odd number until only one 1 remains:

Append 1 to the (right) end of the number in binary (giving 2n + 1);

Add this to the original number by binary addition (giving 2n + 1 + n = 3n + 1);

Remove all trailing 0s (that is, repeatedly divide by 2 until the result is odd).

– [wikipedia]

This shouldn’t be too hard to implement with bitwise operators, but will it be faster than the “faster” method from part one?

The “faster” method had a few optimizations:

Stopping if the generated number drops below the original number.
Computing (3x + 1) / 2 in one step rather than two (i.e. the “shortcut” method).

The benchmarks from part one ran in about 12µs (full details here). I wanted to see if this method would beat that.

Things That Don’t Work

The interesting part of this method is that it can solve 3x+1 in fewer steps than other implementations. One could “skip” iterations by chopping off multiple trailing zeroes at once. Implementing the multiplication step with bitshifts is novel, but not nearly as promising.

This bit of optimization gets me no closer to solving the problem, but it is yak shaving at its finest. It’s not like I’m doing groundbreaking math. I’m just here to have fun with Rust.

The first implementation didn’t look half-bad:

pub fn bitwise(num: usize) -> usize {
    if num == 1 {
        return 1;
    }
    let mut count = 0;
    let mut curr_num = num;
    while curr_num >= num {
        curr_num = match curr_num & 1 {
            0 => curr_num >> 1,
            1 => ((curr_num << 1) + 1 + curr_num) >> 1,
            _ => panic!("at the disco"),
        };
        count += 1;
    }
    count
}

And it ran pretty okay…

test bench_bitwise_big           ... bench:      13,000 ns/iter (+/- 210)
test bench_faster_shortcut_big   ... bench:      12,406 ns/iter (+/- 192)

But, that’s a boring result. I wanted to know how far I could push it.

First, I jettisoned the bitwise multiplication step. It’s ungainly and slightly slower than the more straightforward version. Then, I tried to get iteration skips working. I want to bit shift by the number of trailing zeroes in a given number. My binary math is a bit rusty, but it turns out finding the least significant bit isn’t that hard.

For an unsigned type, (~x + 1) & x will set the least significant bit, and only that bit, to 1. This took me some work to understand, so let me explain. Here’s what it looks like with the number 12:

12 => 1100
~12 => 0011
~12 + 1 => 0100
0100 AND 1100 => 0100

By flipping trailing zeroes and the least significant bit, something like 100 turns into 011. Add 1 to that and it’s back to 100 – same as the original number. The key is that only those bits are changed in the compliment. Therefore, the AND on other bits gives 0 and we’ve isolated only the least significant bit.

My version 2:

pub fn bitwise(num: usize) -> usize {
    if num == 1 {
        return 1;
    }
    let mut count = 0;
    let mut curr_num = num;
    while curr_num >= num {
        let least_significant_bit = (!curr_num + 1) & curr_num;
        curr_num = match least_significant_bit {
            1 => ((curr_num << 1) + 1 + curr_num) >> 1,
            2 => curr_num >> 1,
            4 => curr_num >> 2,
            8 => curr_num >> 3,
            16 => curr_num >> 4,
            32 => curr_num >> 5,
            64 => curr_num >> 6,
            128 => curr_num >> 7,
            256 => curr_num >> 8,
            _ => curr_num >> 9, // has to be at least this much
        };
        count += 1;
    }
    count
}

Not exactly pretty, but it could be cleaned up later. The real question was if it was faster.

test bench_bitwise_big           ... bench:      42,801 ns/iter (+/- 3,060)
test bench_faster_shortcut_big   ... bench:      11,961 ns/iter (+/- 362)

Nope.

It turns out Rust already has trailing_zeros, which simplified the code a good bit:

pub fn bitwise(num: usize) -> usize {
    if num == 1 {
        return 1;
    }
    let mut count = 0;
    let mut curr_num = num;
    while curr_num >= num {
        curr_num = match curr_num.trailing_zeros() {
            0 => ((curr_num << 1) + 1 + curr_num) >> 1,
            _ => curr_num >> curr_num.trailing_zeros(),
        };
        count += 1;
    }
    count
}

And…

test bench_bitwise_big           ... bench:      15,817 ns/iter (+/- 981)
test bench_faster_shortcut_big   ... bench:      12,496 ns/iter (+/- 211)

Better, but not as good as the first version.

Something that Does Work

The final version, which was faster than faster_shortcut, combined several approaches:

pub fn bitwise(num: usize) -> usize {
    if num == 1 {
        return 1;
    }
    let mut count = 0;
    let mut curr_num = num;
    while curr_num >= num {
        curr_num = match curr_num % 2 {
            1 => (3 * curr_num + 1) / 2,
            0 => curr_num >> curr_num.trailing_zeros(),
            _ => panic!("at the disco"),
        };
        count += 1;
    }
    count
}

This implementation:

Stops if curr_num < num
Removes all trailing zeroes in one operation
Retains the more readable version of the multiplication step
Delays calling .trailing_zeros() until we need it
Runs 30% faster!

test bench_bitwise_big           ... bench:       9,426 ns/iter (+/- 178)
test bench_faster_shortcut_big   ... bench:      12,254 ns/iter (+/- 328)

Adding a CLI

One last yak to shave before I move on. I wanted a simple CLI so that I could run this more interactively. It has no real utility, but it was a good opportunity to learn argument parsing in Rust.

It wasn’t too hard to learn clap and create a simple interface for the project. The full thing is a bit wordy, but you can find it here. It also includes how I’m using channels to get status information back to the caller. The result looks pretty nice:

❯ ./collatz --help
collatz -- run the 3x+1 problem on some numbers or something

Usage: collatz [OPTIONS]

Options:
  -s, --start <START>
          Start running at N [default: 1]
  -e, --end <END>
          Where to end (0 runs forever) [default: 0]
  -b, --batch-override <BATCH_OVERRIDE>
          Override default batch size of `num_to_solve / (threads*2) `
  -t, --threads <THREADS>
          How many threads to use [default: 24]
  -h, --help
          Print help

Now, I can build and distribute a binary if I did want to run it on a cloud provider. Not today, though. Today I wanted to watched it run forever. It looked something like this:

❯ ./collatz
Running with settings:
  - start: 1
  - end: ∞
  - batch size: 200,000,000
  - threads: 24

Starting infini-batch [1.00e0, 2.00e10]
-----

Total Solves	Overall solves/s	Batch Duration	Batch solves/s	Max steps to solve
2.00e8		8.839e7			2262.783ms		8.84e7		361
4.00e8		8.828e7			2268.234ms		8.82e7		361
  ... snip ...
1.98e10		8.814e7			1491.901ms		1.34e8		471
2.00e10		8.844e7			1488.477ms		1.34e8		471

-----
Total Solves: 20,000,000,000
Processing Time: 226.131884011s
Solves/s: 88,443,960
Starting infini-batch [2.00e10, 4.00e10]
-----

Total Solves	Overall solves/s	Batch Duration	Batch solves/s	Max steps to solve
2.00e8		8.784e7			2276.852ms		8.78e7		431
4.00e8		8.786e7			2276.083ms		8.79e7		466
 ... snip ...
1.98e10		8.763e7			1507.896ms		1.33e8		469
2.00e10		8.792e7			1545.609ms		1.29e8		469

-----
Total Solves: 20,000,000,000
Processing Time: 227.492126893s
Solves/s: 87,915,130
Starting infini-batch [4.00e10, 6.00e10]
-----

... so on and so forth ...

Solving 87 million numbers a second doesn’t get us any closer to doing useful math, but at least it looks pretty neat. While running against real numbers, I hit the upper limit of u64, so I had to swap out for 128-bit integers. They’re a bit slower, but nothing to be done about that.

Moving On…

This was a fun little digression. There’s a whole section in the wiki page about other optimizations, but they’re above my pay grade. I’m no mathematician.

In the meantime, I still want to:

Run this on a GPU
Distribute it to many machines

I’m not honestly sure if or when I’ll get to that, though. Andrej Karpthy’s course on neural networks has been occupying all my free time. I plan on continuing down that rabbit hole as far as it’ll go. If I do anything interesting, I’ll try to write about it here.

PS: The code from this post can be found here.

Making the Abstract Concrete#

Things That Don’t Work#

Something that Does Work#

Adding a CLI#

Moving On…#

Making the Abstract Concrete

Things That Don’t Work

Something that Does Work

Adding a CLI

Moving On…