Counting Primes: Possible Evolutions and a Conclusion

So, why do you think I've spent so much time elaborating the approach here that I have?

You mean, given that it's not as fast as the current fastest approaches?

Right.

Well, it's still clearly quite fast. So that's not nothing.

True.

And if we're comparing algorithms, it certainly uses shockingly less memory than any of these other approaches, and that memory bound essentially doesn't ever grow. That seems quite surprising. How big of a deal is that?

I'll be honest, I'm not sure. But yes, that is a major difference.

Really, though, it does seem like, on something like a moral level, the approach here is almost like a sibling of Legendre's original approach to counting primes.

That feels right to me. It's why I stressed, at the beginning of this discussion, that the other major prime counting approaches have had considerable development applied to them. They've benefited from a lot of smart eyeballs. And that has not been the case here, or at least it hasn't been as far as I know at the time of writing - I only have two eyes, and almost everything I've written here has been a combination of my own efforts combined, of course, with copious amounts of reading.

And thus, by talking through one possible development of this approach at great length, you are hoping to draw more eyes to it, right?

Precisely.

Okay. Well, do you have anything useful to say on that front, if someone were interested in picking up that baton?

I do. Here are a few areas that seem plausibly noteworthy to me.

The Inner Loop

If you go back to the final version of the code we evolved, the function D_2 is where almost all execution time is spent. That function looked like this:

function D_2(n, a, p, r) {
    const nroot = Math.floor(Math.sqrt(n) + EPSILON);
    let t = UnsievedInRange(a, Math.floor(n / a)) * p / r + UnsievedInRange(a, nroot) * p / 2 + p / (r * (r + 1));
    let wheelEntry = NumToWheelEntry(a);
    a += wheelOffsets[wheelEntry];
    wheelEntry = IncrementWheelEntry(wheelEntry);
    let u = 0;
    while (a <= nroot) {
	    u += UnsievedInRange(a, Math.floor(n / a));
	    a += wheelOffsets[wheelEntry];
	    wheelEntry = IncrementWheelEntry(wheelEntry);
    }
    return t + u * p;
}

But actually we can be much more specific than that. The real culprit for almost all execution time is the following inner loop:

while (a <= nroot) {
	u += UnsievedInRange(a, Math.floor(n / a));
	a += wheelOffsets[wheelEntry];
	wheelEntry = IncrementWheelEntry(wheelEntry);
}

/*
The two simple helpers functions referenced in the function here are defined like so:

function UnsievedInRange(rangeStart, rangeEnd) {
    if (++rangeEnd < wheelCyclePeriod) 
        return wheelTranslation[rangeEnd] - wheelTranslation[rangeStart] - 1;
    const b = rangeEnd % wheelCyclePeriod;
    if (rangeStart < wheelCyclePeriod) 
        return (rangeEnd - b) * wheelEntryRatio + wheelTranslation[b] - wheelTranslation[rangeStart] - 1;
    const a = rangeStart % wheelCyclePeriod;
    return (rangeEnd - b - rangeStart + a) * wheelEntryRatio + wheelTranslation[b] - wheelTranslation[a] - 1;
}
function IncrementWheelEntry( wheelEntry ){
    if( wheelEntry + 1 >= wheelCycleEntries) 
        return 0;
    return wheelEntry + 1;
}
*/

So when we run the code, you're saying almost all time is just spent in those 5 lines.

Right. So to improve this algorithm, finding better ways to handle this loop could potentially results in big improvements.

Looking more closely at the surrounding code, the loop does seem like it is entirely dependent on just the two input values n and a.

Exactly. It seems as though it could be fruitful to explore ways, via caching and code reorganization or via clever identities, to rethink how computation is handled here and reduce the cost of this loop. But so far I haven't had much luck.

I see.

Alternatively, as one possible lead, a road one might go down is to look at other approaches for the fast computation of the divisor summatory function, which D_2 is clearly a variant of. This approach computes the divisor summatory function in very roughly $O(n^\frac{1}{3})$ time, as just one really impressive example.

Excluding Larger Primes When Counting

Okay. So what are some other possibly useful approaches?

Well, here's another idea that might be fruitful. I already mentioned that the execution time of this approach is much, much more sensitive to wheel factorization than many other prime counting techniques.

Right - I just saw that running the algorithm while excluding the first eight primes was two or three faster than running it excluding the first six.

Correct. Well, there is a reason I don't have versions of the algorithm that excludes more than the first 8 primes. As a practical matter, the way wheel factorization is implemented here is fairly fast and relatively simple at the expense of using increasingly large amounts of memory as more primes are excluded. The memory grows like the product of all the primes that have been excluded, multiplied together. So the memory usages grows more or less like a factorial.

Which gets big fast.

True. But, needless to say, that's not the only way one might try to iterate efficiently over numbers not divisible by the first $j$ primes.

What is complicated about that?

Well, part of what makes the current algorithm work quickly is that it can, given two numbers $m$ and $n$, efficiently determine how many numbers there are in the range between $m$ and $n$ that aren't divisible by the first $j$ primes. That's what the code in the inner loop of D_2 that we just looked at relies on, in fact. In this code, that calculation looks like this:

function UnsievedInRange(rangeStart, rangeEnd) {
    if (++rangeEnd < wheelCyclePeriod) 
        return wheelTranslation[rangeEnd] - wheelTranslation[rangeStart] - 1;
    const b = rangeEnd % wheelCyclePeriod;
    if (rangeStart < wheelCyclePeriod) 
        return (rangeEnd - b) * wheelEntryRatio + wheelTranslation[b] - wheelTranslation[rangeStart] - 1;
    const a = rangeStart % wheelCyclePeriod;
    return (rangeEnd - b - rangeStart + a) * wheelEntryRatio + wheelTranslation[b] - wheelTranslation[a] - 1;
}

Which does seem quick, yes.

In the more general case, this is computing the Legendre $\phi$ function for a range, which was mentioned earlier in the survey of other prime counting techniques. Computing it when excluding a small number of primes can be done efficiently with precomputed tables, as we're doing here. But it becomes much more non-trivial to compute as the number of excluded primes grows larger.

I see.

At any rate, it might be interesting to explore versions of the broader algorithm that counted over ranges that excluded more primes than the first 8, but a still relatively small amount - say the first 15 or 20. Alternatively, it might also be interesting to explore versions of this algorithm counted over ranges that excluded much larger primes - say, up to $n^\frac{1}{4}$, $n^\frac{1}{3}$, or even $n^\frac{1}{2}$, which would make the techniques used to compute the Legendre $\phi$ function much more important, and which would pull the algorithm down the road of the LMO approach.

Techniques from Evolving Legendre's Approach

Anything else?

Well, one idea I find compelling is that the core identity used here for prime counting has a distinct family resemblance to Legendre's method of counting primes, the one that has commanded most attention for algorithmic development. As I mentioned a few sections ago, that resemblance is particularly emphasized when the two identities are written out like so:

$$ \begin{darray}{llll} \pi(n) - \pi(\sqrt n) + 1 = &\sum_{j \le n} 1 &-\sum_{\substack{ p_1 \cdot j_2 \le n \\ p_1 \le \sqrt n}} 1 &+\sum_{\substack { p_1 \cdot p_2 \cdot j_3 \le n \\ p_1 < p_2 \le \sqrt n}} 1 &-\sum_{\substack { p_1 \cdot p_2 \cdot p_3 \cdot j_4 \le n \\ p_1 < p_2 < p_3 \le \sqrt n}} 1 &+ ... \\ \sum_{k=1}^{\lfloor \log_2 n \rfloor} \frac{\pi(n^\frac{1}{k})}{k} = &\sum_{\substack{1 < j \le n}} 1 &-\sum_{\substack{j_1 \cdot j_2 \le n \\ 1 < j_1 \le j_2}} \frac{f(j_1,j_2)}{2} &+\sum_{\substack{j_1 \cdot j_2 \cdot j_3 \le n \\ 1 < j_1 \le j_2 \le j_3}} \frac{f(j_1,j_2,j_3)}{3} &-\sum_{\substack{j_1 \cdot j_2 \cdot j_3 \cdot j_4 \le n \\ 1 < j_1 \le j_2 \le j_3 \le j_4}} \frac{f(j_1,j_2,j_3,j_4)}{4} &+ ... \\ \end{darray} $$

So here, the first identity is Legendre's identity, and the second is the one we've been using, with $f(j_1, j_2,...)$ the combinatorial function we discussed when covering symmetry reduction.

Right.

The styles of counting clearly have deep parallels. The primary difference between the two is that Legendre's identity largely counts over the primes, while the identity we use counts over natural numbers.

Which are easier to count over because natural numbers are much more predictable- but then, there are also a lot more natural numbers than primes, right? So it seems like there is a trade off there...

Well, yes, although the difference isn't so great as you might think. To a rough approximation, the amount of numbers less than some number $n$ that are prime is proven to be $\frac{n}{\log n}$, and $\log n$ grows very, very slowly. And of course, for purposes of counting, including small numbers are the most consequential for execution times. So when we use a wheel, we are reduced to counting over primes or numbers with only somewhat larger prime factors.

At any rate, as I've mentioned a number of times, a great deal of attention has been spent finding clever ways to rearrange Legendre's identity to improve the possibility of computing it quickly. Although I am passingly familiar with the resulting algorithms, they are by no means simple. But a hope might be that, because of the close parallels between these two identities, perhaps some ideas that have worked for the first algorithm might have fruitful analogs for the identity we've been using.

GPUs

As a final observation, this algorithm, in its current form, is trivially parallelizable.

Because it uses precomputed fixed sized tables and otherwise doesn't allocated memory, right?

Right. There really isn't a lot of computation that is dependent on prior results. So at least in its current form, it might be interesting to see how fast this algorithm could be made to execute if ported over to GPUs. It's possible it could be relatively easy to do, and it's also possible that (given how fast GPUs are at raw number crunching) that it might be a really good approach to certain ranges of input values before big O factors reared their heads for very big input values.

How confident are you about this?

Well, I haven't tried it myself, so I'm not sure. And on the other hand, the core of this approach hinges on massive amounts of 64-bit integer division, which is not the focus of GPUs, of course. So ultimately going down this road (without any other algorithmic improvements first) might be diverting but is probably something of a dead end. At the very least, though, it might be novel to see how far a prime counting technique that didn't use any memory, essentially, could be pushed.

Conclusion

Okay. That sounds like some interesting possible directions. Is there anything else worth noting about this approach?

Sure. With some slight variation, it can be used to sum primes as well as count them.

Okay.

Also, I mentioned a few times in our discussion that I have another approach to the basic identity we've been working with. It does not make use of symmetry reduction, and it is, in general, more fleshed out and much less elegant. It does, however, execute in roughly $O(n^{\frac{2}{3}+\epsilon})$ time and roughly $O(n^{\frac{1}{3}+\epsilon})$ space. I've included a C++ implementation of it, as well as a bit of discussion, here. Like the approach we've already talked about, it has also barely received any attention, so it's possible it could be a good basis for something faster.

Interesting.

More broadly, everything we've worked through here is just the tip of the iceberg. Both of the approaches to counting primes here hinge on finding techniques for computing divisor summatory functions efficiently.

Yeah, that seems right.

Well, it turns out that vastly more ideas from multiplicative number theory can be expressed in terms of divisor summatory functions with the right lens. So as a practical matter, there is a lot of upside to finding more efficient ways to compute those particular functions. I have a discussion about this topic here.

Anything else?

That should be about it. Thanks for reading through all this, and I hope I managed to get my ideas across relatively clearly. I've included two last sections that cover the history of the development of these algorithms if you have any interest in such things, as well as a description of the other prime counting algorithm based on this identity.

P. S. I have a lot more incomplete animations, live code, and other number theory and math explorations like this that I'm trying to put out publicly. But I'm struggling to find a way to make this work, just in terms of time and resources. If you happen to be on friendly terms with any wealthy benefactors / tech moguls / foundation heads, or are yourself such a person, and this work is the sort of thing you'd appreciate supporting, let me know!

Nathan McKenzie
August 2025

And a big thanks to the following indispensable JavaScript libraries: KaTeX, rainbow, d3, function-plot, math, and vis