Now, finally, we come to our last task.
Okay.
There's not really any new idea here. We just take the two ideas from the previous section and integrate them. That's it.
So combining wheel factorization and symmetry reduction, basically, right?
Exactly. There's nothing conceptually tricky here if you've followed along thus far. It's simply a matter of the implementation growing still more complicated.
That makes sense.
So, drawing it all together?
Well, we should bring back our wheel factorization function, right? $$h(j)= \begin{cases} 0 & \text{if } 2 | j \textrm{ or } 3 | j \textrm{ or } 5 | j \textrm{ or } 7 | j \textrm{ or } 11 | j \textrm{ or } 13| j \textrm{ or } 17| j \textrm{ or } 19 | j \\ 1 & \text{otherwise} \end{cases} $$
And then we would need new count of divisor functions that incorporate both wheel factorization and counting with symmetry reduction... $$D'_{ 1}( n ) = \sum_{\substack{t=23 \\ h(t)=1}}^n 1$$ $$D'_{ 2 }( n ) = \sum_{\substack{t=23 \\ h(t)=1}}^{\lfloor n^\frac{1}{2} \rfloor} \sum_{\substack{u=t \\ h(u)=1}}^{\lfloor \frac{n}{t} \rfloor} f_2(t,u)$$ $$D'_{ 3 }( n ) = \sum_{\substack{t=23 \\ h(t)=1}}^{\lfloor n^\frac{1}{3} \rfloor} \sum_{\substack{u=t \\ h(u)=1}}^{\lfloor (\frac{n}{t})^\frac{1}{2} \rfloor} \sum_{\substack{v=u \\ h(v)=1}}^{\lfloor \frac{n}{t \cdot u} \rfloor} f_3(t,u,v)$$
And then we get our count of primes from those counts of divisor functions, just as we did from our previous wheel factorization examples, like so. $$\Pi(n) = \sum_{k=1} \frac{(-1)^{k+1}}{k} \cdot D'_k(n)$$ $$\sum_{j=1} \frac{\mu(j)}{j} \cdot \Pi(n^\frac{1}{j}) = \pi(n) - \pi(19)$$
That's it, right?
That's it. No real new ideas here, just the extra burden of integrating these techniques together. A JavaScript implementation might look like this...
Once again, I'm including some unrolling and inlining optimizations in D_2
which I will pass over without comment, which, needless to say, complicates the code a bit
and makes timing comparisons with previous implementations looser.
Still, if you run this, what do you see?
Wow - the speed increase here is pretty extreme. So the original, totally unoptimized version computed $\pi(10^5)$ in around 1.5 seconds. Then, both previous optimized versions computed $\pi(10^5)$ in around a millisecond, and they both computed $\pi(10^9)$ in around a second.
And now?
Both the optimizations seem to be working perfectly together. This implementation computes $\pi(10^9)$ in around 3 milliseconds, at least on my middle of the road computer. And meanwhile, it computes $\pi(10^{13})$ in around 1 second.
That sounds about right.
And that's none too shabby for 115 lines of single-threaded JavaScript that don't allocate any run-time memory other than the minimal stack space for the recursive D_k
and the fixed size pools of the arrays used by wheelTranslation
and wheelOffsets
to make the wheel efficient.
This talk of memory usage finally lets me resolve the provocative claim made at the very beginning of this discussion.
About computing the count of primes quickly in 64KB of RAM, you mean?
Exactly. The above code, which excludes numbers divisible by the first 8 primes, uses about 40 megs of fixed size caches. For how fast the code is, and on modern machines, this is basically trivial.
Sure.
And this approach is, as I mentioned, very sensitive to wheel sizes. Nevertheless, if we want to exclude only numbers divisible by the first 6 primes when counting, memory usage drops drastically, to around 64KB.
But it's slower?
Right. At least for these tests, it runs between 2 and 3 times slower than the larger wheel. Still, for a prime counting algorithm that is practically not using any memory at all, it's still pretty astonishingly fast. You can run that implementation, whose code can be found here, in the table below.
Alternatively, if you'd like to explore these memory claims in more fine-grained details outside of JavaScript, this link contains a Windows x64 15k command line executable of this .cpp code, computing the same algorithm and keeping rigorous track of how much memory it uses along the way in a very resource inexpensive command line context.
Okay. Now let's broaden our view a bit.