Counting Primes: A Survey of Other Methods

So originally you wanted to know if the approach to prime counting described here is the fastest way to do so.

Right. Or, well, I guess more generally if any approach based on the identities $$p_k(n,j)=\rec{\frac{1}{k}-p_{k+1}(\frac{n}{j},2) + p_k (n, j+1)}$$ $$\pi(n) = \sum_{k=1} \frac{\mu(k)}{k} \cdot p_1(n^\frac{1}{k},2)$$

To answer that in a useful way, I'll need to provide some context first. And to do that, I need to summarize the other well known fast prime counting approaches.

Okay.

So, as far as I'm aware, there are two major lineages of prime counting algorithms out in the wild that have received a lot of attention and development from sharp mathematicians. As is often the case with such algorithms, they start with important core identities, and then smaller and smaller optimizations are found that either change the big-O by $\log$ factors or improve actual computational performance, all at the expense of vastly more complicated implementations.

Okay. So what are they?

The Sieve of Eratosthenes and Legendre

The first major approach starts by taking the core idea of the Sieve of Eratosthenes and restating it in arithmetic. This was originally done by Adrien-Marie Legendre, I believe.

Restating it in arithmetic? What does that even mean?

Well, if you've ever encountered a description of the Sieve of Eratosthenes, the algorithm was probably described in words, as a kind of recipe, as algorithms often are. But it turns out, with some cleverness, the core idea in the Sieve of Eratosthenes can be captured via arithmetic instead. The French mathematician Legendre captures a version of these ideas in a certain recursive function using the inclusion-exclusion principle to count how many numbers are left in a range of numbers after sieving by a fixed number of primes.

I don't want to dwell on these ideas, because they're well covered lots of other places. I'll just note that, given the following block of JavaScript code, as long as $n \le 10000$, you can use $g(n)$ to compute the count primes up to $n$, minus the count of primes up to $\sqrt n$, plus one, and that value is clearly computed without factoring numbers one by one. Which is to say, primes are also counted here without identifying individual primes explicitly.

X_p=[0,0,1,1,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,
	0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,
	0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0]

f=(m,n,j=2)=>(n<j||m<j*j)?0:X_p[j]*(1-f(m,n/j,j+1))+f(m,n,j+1)
g=(n,j=1)=>n<j?0:1-f(n,n/j)+g(n,j+1)

Once again, this code seems unnecessarily dense.

Sure. But you are welcome to press F12 and run it which I think is handy. And as an example, if you run that code and then type in g(100), you should see the output of $22$, which is indeed how many primes are $\le 100$ minus the count of primes no bigger than $10$, plus one. Or if you try g(1000), you should see $158$, which is the count of primes no bigger than $1000$ minus the count of primes no bigger than $\sqrt{ 1000}$, plus one.

At any rate, if it helps to see the idea in more typical math notation, this idea could be written like so. Suppose you had some fixed constant $m$. Then the following ideas can be used to count primes:

$$\chi_P(j) = \begin{cases}1 & \text{if } j \text{ is prime} \\ 0 & \text{otherwise}\end{cases}$$ $$f(n, j) = \begin{cases} \chi_P(j) \cdot \left(1 - f\left(\frac{n}{j}, j+1\right)\right) + f(n, j+1) & \text{if } n \geq j \text{ and } m \geq j^2 \\ 0 & \text{otherwise} \end{cases}$$ $$g(n, j) = \begin{cases} 1 - f\left(\frac{n}{j}, 2\right) + g(n, j+1) & \text{if } n \geq j \\ 0 & \text{otherwise} \end{cases}$$ $$\pi(m) = \pi(\sqrt m ) - 1+g(m,1)$$

Look carefully and you should see that this approach only relies on primes up to $\sqrt m$ to count the primes up to $m$. The array named p keeps track of whether each number $ \le 100$ is prime, which is why this code theoretically (barring implementation issues) works for input up to $100^2$, which is $10000$.

Written that way, it seems like the definition for $f(n,j)$ and $g(n,j)$ have a very striking visual similarity to the earlier definition for $p_k(n,j)$.

Well, I rewrote the usual definition of the Legendre phi function pretty substantially here - normally you'll find it denoted $\phi(n,j)$ - to emphasize exactly that fact. It's typically written quite a bit differently, which you'll immediately see if you look for other references for the Legendre phi function.

Animating this approach to prime counting likewise stresses how much this style of prime counting has in common with $p_k(n,j)$, I think. It also highlights the combinatorial idea at the heart of this approach.

You're just counting a different style of divisor, right? And employing a different combinatorial identity?

Right. And make sure to click "Order by Δ" button, too, and watch the resulting animation. Exactly as with the main idea we've been working with, reordering Legendre's idea in this way gives it a much closer resemblance to trial division, and it might help give you a better sense of why this approach works to count primes. This alternate ordering keeps exactly the same divisors. It just builds the tree of divisors differently.

So this is exactly like the identity we've been working with in that regard.

Right. And you could count primes this way as well, but it would be slow the way that trial division is, because it would rely on divisibility tests. In fact, if you perform trial division but only test primes up to $\sqrt n$ instead of all whole numbers by keeping a cache of primes up to $\sqrt n$ around, which you can do, it looks rather close to this approach. Once again, it's the fact that you can count these divisors on their own terms, without needing to perform any divisibility tests on the numbers they're factored from, that opens the door to much higher efficiency with the combinatorial idea here.

So this really is very similar to the approach we've been using, isn't it? Except this one requires all the primes up to $\sqrt n$, which is significant for questions of memory bounds...

Exactly. Anyway, for emphasizing visual similarity, with a bit of work both of those expressions, Legendre's function and the weighted count of primes powers we started with, can also be rewritten like so

$$ \begin{darray}{llll} \pi(n) - \pi(\sqrt n) + 1 = &\sum_{j \le n} 1 &-\sum_{\substack{ p_1 \cdot j_2 \le n \\ p_1 \le \sqrt n}} 1 &+\sum_{\substack { p_1 \cdot p_2 \cdot j_3 \le n \\ p_1 < p_2 \le \sqrt n}} 1 &-\sum_{\substack { p_1 \cdot p_2 \cdot p_3 \cdot j_4 \le n \\ p_1 < p_2 < p_3 \le \sqrt n}} 1 &+ ... \\ \sum_{k=1}^{\lfloor \log_2 n \rfloor} \frac{\pi(n^\frac{1}{k})}{k} = &\sum_{\substack{1 < j \le n}} 1 &-\sum_{\substack{j_1 \cdot j_2 \le n \\ 1 < j_1 \le j_2}} \frac{f(j_1,j_2)}{2} &+\sum_{\substack{j_1 \cdot j_2 \cdot j_3 \le n \\ 1 < j_1 \le j_2 \le j_3}} \frac{f(j_1,j_2,j_3)}{3} &-\sum_{\substack{j_1 \cdot j_2 \cdot j_3 \cdot j_4 \le n \\ 1 < j_1 \le j_2 \le j_3 \le j_4}} \frac{f(j_1,j_2,j_3,j_4)}{4} &+ ... \\ \end{darray} $$

And here, $f(j_1, j_2,...)$ is the combinatorial function, essentially a multinomial coefficient, that we discussed when covering symmetry reduction. I think this form really does emphasize conceptual similarities and stresses the combinatorial inclusion-exclusion principle each idea is harnessing. And remember, our fastest version of the second identity counts over numbers with 23 as their smallest prime factor. So in practice, each $j_k$ term here is either a prime or a number with fairly large prime factors, making the two approaches even more similar in practice.

That makes sense.

Anyway, much like with $p_k(n,j)$ above, the definitions for $f(n,j)$ and $g(n,j)$ count primes, but they don't do so efficiently at all. A much more reasonable, basic implementation of this approach might look something like this:

And so that code is quite similar to the approach we worked through, although this approach does require $O(\frac{\sqrt n}{\log n})$ memory to cache those primes. And it seems reasonably fast.

True. As I say, there is an obvious family resemblance between the two approaches.

Could we apply wheel factorization here for a huge speedup again?

Well, let's look at that briefly. To give a more apples-to-apples comparison, you could apply wheel factorization to this approach as well, to draw closer to our final approach. Code to do so can be found here, and timings for that approach using wheel factorization to exclude the first 6 primes look like this:

Interesting. But it seems to have much less impact, apparently?

Apparently. Which should be unsurprising, if you think about it. Here, we were already counting over primes prior to the wheel. So this addition just excludes a few, specific primes from counting.

Oh, right. In our previous context, the wheel threw out a huge percentage of numbers from our counting process. So that impact was vastly greater.

Right. With all that said, even disregarding this discussion, Legendre's approach has sums of the form $\sum_{\substack { p_1 \cdot p_2 \cdot j_3 \le n \\ p_1 < p_2 \le \sqrt n}} 1$, versus the equivalent $\sum_{\substack{j_1 \cdot j_2 \cdot j_3 \le n \\ 1 < j_1 \le j_2 \le j_3}} \frac{f(j_1,j_2,j_3)}{3}$ from the main approach here. The constraints of the two styles of variables used for counting are importantly different. My sense is the constraints on variables in the main approach here are considerably more favorable in the way they affect the running time behavior of their sums... but this topic really warrants a more rigorous discussion. But instead of that, let's just move on.

Evolving Legendre, LMO, Records, and Other Branches

So this form of Legendre's idea is really just the beginning. A huge amount of effort has been put in by very smart people to develop this basic combinatorial idea here into much faster ways to count primes.

So see here and here for early improvements to algorithms based on this basic idea. These ideas still kept the computational time of the approach in the rough range of $O(n)$, with more favorable $\log$ factors and reduced memory usage.

All that changed with the publication in 1985 of the Meissel, Lehmer, Lagarais, Miller, Odlyzko (whew) approach to counting primes, which significantly evolves and speeds up this approach to counting primes. This algorithm counts primes in, very roughly, $O(n^\frac{2}{3})$ time and $O(n^\frac{1}{3})$ space.

Wow, that's quite a jump in calculation speed.

It was, yes. The original paper can be found here. And improvements to it can be found here and here and here. Incidentally, many of these articles cover the history of this school of approaches much better than I could.

Extremely fast implementations of these approaches can be found in this software. To a very rough approximation, the fastest of these algorithms that have been used in the wild run in something like $O(n^\frac{2}{3})$ time and $O(n^\frac{1}{3})$ space, with some $\log$ factors that I'm omitting here.

At the time of this discussion, that software has been used to determine $\pi(10^{29})$, which is to say, the count of primes $\le 10,000,000,000,000,000,000,000,000,000$.

That must've been a huge computational effort.

Well, you can read more about what went into that here, but yeah, it sounds pretty extreme. From the announcement there, Kim Walisch says

"The computation took 92.76 CPU core years (only physical CPU cores are counted) and the peek RAM usage was 1.3 terabytes. We used the primecount-6.4 program with backup functionality for this computation. Half of the computation was run on my dual socket AMD EPYC server with two AMD EPYC 7642 CPUs with a total of 96 CPU cores (192 threads), here is a picture of it. The other half of the computation was run on David's dual socket AMD EPYC server with two AMD EPYC 7742 CPUs with a total of 128 CPU cores (256 threads). Overall our servers ran for about 6 months to complete that computation."

So these are not trivial computational challenges, to say the least.

Anyway, this is usually referred to as the combinatorial approach to counting primes.

I see.

Let me note two more variations of this idea.

Go ahead.

One is another variation of this idea that goes by the cheerful name "Lucy's Algorithm + Fenwick Trees". You can find it here. The article where it is implemented suggests it runs in something like $O(n^{\frac{2}{3}+\epsilon})$ time and $O(n^{\frac{1}{3}+\epsilon})$ space, making it roughly comparable to the above techniques.

Got it. And the other variation?

Well, interestingly, a new approach related to this general idea was published not too long ago, which (it is said) can run in $O(n^\frac{1}{2})$ time and $O(n^\frac{1}{2})$ space or, alternatively, $O(n^\frac{8}{15})$ time and $O(n^\frac{1}{3})$ space from Hirsh, Kessler, and Mendlovic. So that is, needless to say, really fascinating too, although as far as I know it hasn't been iterated on to the degree that these other approaches have. I honestly feel negligent here not saying more about this approach, but it was published after most of this dialogue was written, and I haven't yet spent the time and attention with it that I would like.

Inverse Mellin Transform and $\log \zeta(s)$

Okay. So that's the first major approach. What's the second?

The second major approach is... well, much less accessible, to be honest. It starts by applying an inverse Mellin transform to a certain expression involving the logarithm of the Riemann zeta function, which looks like this:

$$ \sum_{k=1}^{\lfloor \log_2 n \rfloor} \frac{\pi(n^\frac{1}{k})}{k} = \lim_{t \to \infty} \frac{1}{2 \pi i } \int_{2-i t}^{2 + i t} \log \zeta(s) \frac{x^s}{s} \, d s $$

And then the core challenge of this approach is finding different ways to compute that contour integral as well as finding ways to alter the inside of the integral to make evaluation more efficient while still counting primes correctly, all while dealing with precision issues... or at least, that's my general understanding of this approach. It's not one I've spent time implementing.

... I don't know where to begin. I don't think I can follow any of that.

That's okay. Complex analysis is beautiful and amazing, but it also can be quite daunting, and this won't make sense if you don't have a background in it. It's also a much rockier fit for computers; unlike the previous algorithm, I can't really simply demonstrate even the most basic idea here. There's no easy bit of highly dense JavaScript I can type here so you can play with this approach.

I guess not. That's too bad.

Still, if you're interested in more discussion about this approach, you can look here, here, and here.

Alternatively, Crandall and Pomerance's Prime numbers: a computational perspective has good explanations of both this approach as well as the combinatorial approach from the previous section late in the third chapter of their book.

On paper, these algorithms run, very roughly, in something like $O(n^\frac{1}{2})$ time and $O(n^\frac{1}{4})$ space, or $O(n^\frac{3}{5})$ time and $O(\epsilon)$ space. In practice, I believe they've often been very difficult to implement and introduce challenges for dealing with precision. And of course it's not unknown for constant time factors to dwarf big-O concerns in some circumstances, too, if the constant factors are big enough.

Is it strange that the approach we cover here, based on $p_k(n,j)$, has nothing to do with these other two techniques?

Well, it might seem that way at first glance. But watch this. If I take the original expression for $p_k(n,j)$ we started with, I can generalize it by adding a complex scaling factor $s$. And then it looks like this:

$$p_k(n,j)=\rec{\rr{j^{-s}}\cdot(\frac{1}{k}-p_{k+1}(\frac{n}{j},2)) + p_k (n, j+1)}$$

Now, what happens if $s$ is $0$?

I suppose we still have our weighted count of prime powers, just as we started with, right? Like this: $$p_1(n,2) = \sum_{k=1}^{\lfloor \log_2 n \rfloor} \frac{\pi(n^\frac{1}{k})}{k}$$

Exactly. And note that that's precisely the left hand side of the inverse Mellin transform we just looked at.

On the other hand, if the real part of $s$ happens to be greater than $1$, then the limit as $n$ goes to $\infty$ will converge. And in that case,

$$\lim_{n \to \infty} p_1(n,2) = \log \zeta(s)$$

Which is the most important term on the right hand side of the inverse Mellin transform. So in that sense, this approach is related to the one I began this discussion with.

Oh. Huh. It's almost like $p_k(n,j)$ combines aspects of both of the existing approaches. It's style is mechanically combinatorial, in a way quite reminiscent of Legendre's approach. But the identity it uses is clearly more conceptually related to the contour integral of $\log \zeta(s)$.

Well, while that glosses over important details, I suppose that's one way to look at it all, yes.

Anyway, returning to the subject of this contour integral, at least one paper referenced here says its variant was used to compute $\pi(10^{25})$, which is a huge number.

The approach here is often referred to as the analytic approach to counting primes.

How Does $p_k(n,j)$ Fit In?

Okay. That was a nice whirlwind survey of well-known prime counting algorithms. But it didn't answer my original question, the one you said was complicated... How fast can prime counting approaches based on the idea from the very start of this discussion be, starting from that simple recursive idea we labeled $p_k(n,j)$?

Well, the fastest algorithm asymptotically based on the approach I describe here that I know of runs in something like $O(n^{\frac{2}{3}+\epsilon})$ time and $O(n^{\frac{1}{3}+\epsilon})$ space. I'm omitting some log factors which do, of course, matter in the long run. So it is, on paper anyway, capable of being competitive with some of the algorithms detailed above.

But that doesn't seem all that complicated at all; you could've just said that to begin with.

There's a few asterisks here, though.

In what way?

Well, the other algorithms detailed just now, based on the Sieve of Eratosthenes or the analytic method, have had huge amounts of effort and energy pored into them over the years by a lot of smart people. So in practice they are much faster, at a purely constant time level. They have much more robust implementations and clever computational techniques applied to them.

So you're saying the algorithms detailed here haven't had anywhere near the effort and attention put into it, right?

Well, yes, definitely. Or as far as I know, anyway.

More specifically, I worked out the details of this $O(n^{\frac{2}{3}+\epsilon})$ time and $O(n^{\frac{1}{3}+\epsilon})$ space approach to prime counting back in the early winter of 2011... but in practice, this was the endpoint of a gradual evolution of different algorithmic ideas that had really started back in 2004, with really two main branches, and the two branches don't actually overlap. My original branch from 2004, based especially on symmetry reduction, was much closer to the approach we evolved in code over this series of dialogues. At any rate, for this asymptotically faster 2011 branch - but it seems much worse in constant time factors, and uses much more memory - I wrote the ideas up and then, essentially, dumped them online with very little fanfare. I never quite figured out how to work through the process of submitting the algorithm officially to any journals, despite implementing it and knowing that it works. And that's the publishing history of this later approach.

Okay.

But the situation is actually a bit more complicated than that, still. In fact, it's largely why I've taken this long winded approach to writing through all this. I say I have an evolution of this style of prime counting that runs in something like $O(n^{\frac{2}{3}+\epsilon})$ time and $O(n^{\frac{1}{3}+\epsilon})$ space.

Sure.

That's true. But like most such algorithms, it's already quite tricky to implement and work with. It's possible and likely that it could be sped up even more, but it's already been developed enough that it's a pain to work with. If I may make something like an aesthetic value judgment, it's a clever and tricky approach, but I don't think its core is beautiful or elegant. I'll include an implementation and description of it at the end of these dialogues, but it wasn't the approach we evolved here, and its not the approach I'm drawn to.

Aesthetically, you mean.

Right. Instead, I'm aesthetically more drawn to the approach we worked through here. The core of it is, I think, much more elegant and potentially interesting.

Okay... well, now that we've worked through that algorithm and then this survey of other prime counting approaches, can you try to contextualize it?

In its current form, this is what I can say. It has surprisingly fast constant time performance and, as we saw, is relatively easy to implement, in relative terms. And it uses a negligible amount of RAM at run time - and thus, in its current form, is relatively simple to parallelize. Our implementation was a single-threaded JavaScript prime counting function computing $\pi(10^{14})$ in about 5.5 seconds in the browser on my decent laptop from 2017. It used a fixed sized read-only factor wheel cache (initiated at startup) that used 40 megs of RAM. And otherwise, the algorithm itself only used its call stack space as it executed, with a maximum recursive function depth of something like $\frac{\log n}{\log 23}$ which, as I say, is negligible.

Right. But what cracks have started showing?

Well, by this point, when $n=10^{14}$, increasing the value of $n$ by a factor of $10$ increases the execution time by a factor of 6.6, and that factor is inching upwards each power of 10. I'm not great at rigorous algorithmic analysis. But, an algorithm that executes in $O(n^\frac{2}{3})$ time should, very roughly, take $4.5$ times as long to execute when input is increased by a factor of 10. So this algorithm, in its current form, isn't in that tier. It's slower than that. And theoretically, it is probably the case that this algorithm as written has something like $O(n)$ time, $O(\log n)$ space bounds, ultimately. It just happens to have really surprisingly good constant time performance with a wheel applied. But that run time growth is obviously a problem as $n$ grows larger.

Ah, I see.

Right. This is why I'm stressing the fact that this approach, as currently written, doesn't rely in any significant way on memory allocation, though. Because all other major prime counting approaches rely heavily on reorganizing computation and caching results to speed up computation in different ways. They also, generally, take advantage of extra sub-identities in certain contexts to find ways to minimize computation. It's these additions that both make the algorithms fast to execute as well as much more complicated to implement. The approach I detailed here hasn't gone down that road yet, which is to say, I think it's quite underdeveloped compared to what might be possible with it. I feel as though this approach is less a peer of the most highly developed descendants of LMO, and more a peer of Legendre's original approach, but one that essentially doesn't use any memory or rely on primes. Anyway, my hunch is that it is about one bright idea away from becoming a very, very fast way to count primes. But at this point, I've hit a brick wall and am coming up short. So I'm hoping someone else might find this idea interesting to pickup and run with.

Hmm. Okay, I guess I follow that.

Anyway, enough discussion. Let's see how our implementation actually compares to other live prime counting techniques.