Okay. So at the beginning of this discussion, you mentioned you had a second algorithm that ran with more favorable asymptotic timings.
That's true. I have indeed developed a different such prime counting algorithm, and it does run in something like $O(n^{\frac{2}{3}+\epsilon})$ time and $O(n^{\frac{1}{3}+\epsilon})$ space, omitting log factors.
That surely seems interesting, right? Why have you given it such short shrift here? Isn't that more noteworthy than the other algorithm?
Well, as a practical matter, the algorithm I'm about to detail is much more complicated and harder to write about in a relatively breezy (at least in the context of number theory) way. There are a lot of weeds to get down into.
But it also centers on a hunch on my part, or something like an aesthetic intuition, you might say. The approach here is probably interesting, and it plausibly warrants further attention and development. It's quite likely it could be further developed in exactly the same way that further attention to the LMO algorithm has resulted in various incremental gains. But it feels, to me, like there's something significantly more natural about the other approach I already covered in previous discussions, and I suppose I stand by my broad sense that it's about one insightful idea away from being a really astonishingly fast way to count primes. And in particular, the approach here doesn't rely on symmetry reduction, and it has a lot more in common with existing algorithmic evolutions discussed when we covered other prime counting techniques.
Fair enough, I guess. Well, can you at least give a general sense of the idea?
Sure. In an earlier discussion, we defined a strict divisor summatory function as $$D'_{ 1}( n ) = \sum_{t=2}^n 1 \,\,\,\,\,\,\,\,\,\,\,\,\,\, D'_{ 2 }( n ) = \sum_{t=2}^n \sum_{u=2}^{\lfloor \frac{n}{t} \rfloor} 1 \,\,\,\,\,\,\,\,\,\,\,\,\,\, D'_{ 3 }( n ) = \sum_{t=2}^n \sum_{u=2}^{\lfloor \frac{n}{t} \rfloor} \sum_{v=2}^{\lfloor \frac{n}{t \cdot u} \rfloor} 1$$ and so on. And we noted that we could count prime powers in turn as $$\Pi( n )= \sum_{k = 1}^{ \lfloor \frac{\log n}{\log 2} \rfloor } \frac{( -1 )^{ k+1 }}{k}\cdot D'_{ k }( n )$$ and then use Möbius inversion to get the count of primes.
Right. And so in turn we used symmetry reduction to compute $D_k'(n)$ more quickly.
Correct. Well, there are other combinatorial approaches to computing $D_k'(n)$ efficiently. So in the approach that follows, given some constant $a$ such that $1 < a < n$, and given that $d_k'(n) = D_k'(n) - D_k'(n-1)$, it is the case that
Okay. Not too bad.
Right. So we start with that. This idea is actually perhaps reminiscent of the combinatorial identities of Vaughan and Heath-Brown, in slightly different number theoretic contexts. Anyway, we then follow standard computational number theory practice and take advantage of redundancies of sums of the form $\sum_{j=2}^n f(\lfloor \frac{n}{j} \rfloor)$ (which is satisfied if $f(n)$ only changes at whole number values). Sums of that form can be split into two, much shorter sums of roughly $\sqrt n$ terms each,
And so if we apply that to our new equation for $D_k'(n)$, we have my final, fast equation for computing $D_k'(n)$:
$$\begin{aligned} D_k'(n) &= \sum_{j = {a+1}}^{\lfloor \sqrt{n} \rfloor} D_{k-1}'( \lfloor\frac{n}{j}\rfloor ) + \sum_{j = 1}^{\lfloor \frac{n}{\lfloor \sqrt{n} \rfloor } \rfloor -1} (\lfloor \frac{n}{j} \rfloor-\lfloor \frac{n}{j+1} \rfloor) \cdot D_{k-1}'(j) \\ &+ \sum_{j = 2}^a d_{k-1}'(j) D_1'( \lfloor\frac{n}{j}\rfloor ) \\ &+ \sum_{j = 2}^a \sum_{m = 1}^{k-2} d_m'(j) \left[ \sum_{s = { { \lfloor \frac{a}{j} \rfloor}+1}}^{\lfloor\sqrt{\frac{n}{j}}\rfloor} D_{k-m-1}'( \lfloor\frac{n}{js}\rfloor ) + \sum_{s=1}^{\lfloor\frac{n}{j} \cdot \frac{1}{ \lfloor\sqrt{\frac{n}{j}}\rfloor}\rfloor-1} (\lfloor\frac{n}{js}\rfloor - \lfloor\frac{n}{j(s+1)}\rfloor) \cdot D_{k-m-1}'(s) \right] \end{aligned}$$So that is getting thornier.
That's true. Still, if we set $a=n^\frac{1}{3}$, and if we treat those internal $d_k'(n)$ terms and $D_k'(n)$ terms as black boxes, we'll see that the above definition can be computed in our time and space bounds - the triple sum on the last line is the culprit for this, in fact, and that line is also what makes $a=n^\frac{1}{3}$ optimal. And, usefully, we'll also see that the sum relies on no $D_k'(n)$ terms larger than $D_k'(n^\frac{2}{3})$ and no $d_k'(n)$ terms larger than $d_k'(n^\frac{1}{3})$.
And so then we could use this faster definition to compute $\Pi(n)$ in terms of $D_k'(n)$ and thus count primes...
Exactly. But we do need to account for those internal $D_k'(n)$ terms, up to $D_k'(n^\frac{2}{3})$. So, the next trick to make it actually compute efficiently is to use a segmented Sieve of Eratosthenes to compute all values of $D_k'(n)$ up to $n^\frac{2}{3}$. The segmented sieve can already be done in the time and space bound.
Computing those values of $D_k'(n)$, if you already have a segmented sieve, isn't too hard. If you have the full prime factorization of any given whole number $n$, which a modified Sieve of Eratosthenes can give you, then the regular divisor function can be computed quickly as $$\displaystyle d_k(n) = \prod_{p^a |n} \frac{k \cdot (k + 1) \cdot ... \cdot (k+a-1)}{a!} $$ And then you can compute $$d_k'(n) = \sum_{j=0}^k (-1)^{k-j} \cdot \binom{k}{j} \cdot d_j(n)$$ and finally $$D_k'(n) = D_k'(n-1) + d_k'(n)$$
To make this algorithm work in its space bounds, you have to interleave this segmented sieving process with the gradual computation of the above fast definition for $D_k'(n)$, and all of this does require even more reorganization and bookkeeping. So that's a general high level overview of this idea.
I see. It does seem like $D_k'(n)$ likely has all sorts of combinatorial identities that could be used to speed up computation if you were sufficiently clever.
I think that's right; in fact, I'm hopeful that this approach could be fruitful in exactly that regard.
Anyway - skip this if you find such things a bit too self-indulgent - turning to a bit of personal history, here are a few more details about the evolution of this algorithm. A more formal write-up of this algorithm can be found in this older, more formal article I wrote up in late 2011, here in .pdf format. At the time of this conversation I've never submitted it to a journal. I wrote it up long enough ago that it was created in OpenOffice, due to my obliviousness of LaTeX at the time, to give a sense of how dusty this file is.
There is a working C++ implementation of the algorithm at the bottom of this web page. It is not, however, robust, and so it stops returning correct values for input somewhere below $10^{11}$. The implementation of the algorithm also has had almost no work put into optimizing it at a constant time level.
So it's really just a rough proof of concept, then.
That's fair, yes. Timings on my machine if I run this algorithm in its current form look something like this:
This algorithm bears more than a passing resemblance to a different, earlier algorithm for computing the Mertens function in $O(n^{\frac{2}{3}+\epsilon})$ time and $O(n^{\frac{1}{3}+\epsilon})$ space by Deleglise and Rivat, which might be less surprising if you know that the prime counting function $\Pi(n)$ and the Mertens function are very closely related. As a matter of fact, the Mertens function can be expressed, using the terminology above, as $ M( n )= \sum_{k = 0} {( -1 )^{ k }}\cdot D'_{ k }( n )$. This connection is worth keeping in mind, too, as it suggests techniques for computing the Mertens function, like this fascinating $O(n^{\frac{3}{5}+\epsilon})$ time, $O(n^{\frac{}{}+\epsilon})$ space approach from Helfgott and Thompson, might have productive analogues to computing the count of primes, too, with a bit of finagling.
Oh, interesting.
Actually, although I already had all the core identities for this approach worked out prior to reading Deleglise and Rivat's paper, specifically the above final fast identity for computing $D_k'(n)$, it was reading their paper in the early winter of 2011 that jostled me out of a thoughtless assumption and made me realize how to make this approach work.
So you were more engaged with other research by the that point, then.
To an extent, yes. As a bit of a historical paper trail, here are file system snapshots of the process of me developing my actual implementation of this algorithm.
After realizing how to make this approach work, I took to Mathoverflow on January 25, 2011 and asked the following leading question, which is really my only public paper trail from the time. Later that summer, I wrote up a giant, unfocused document, put it on my private website, and then did a drive-by announcement of that fact on Mersenne Forum on August 25, 2011 here, with a Wayback link of that here.
I might, at some point, attempt to describe the ideas in this algorithm in an easier-to-read fashion, but for now the dusty old .pdf article will have to do.
A proof of concept C++ implementation looks like this:
#include "stdio.h"
#include "stdlib.h"
#include "math.h"
#include "conio.h"
#include "time.h"
typedef long long u64;
static u64 mu[] = { 0, 1, -1, -1, 0, -1, 1, -1, 0, 0, 1, -1, 0, -1, 1, 1, 0, -1, 0, -1, 0, 1, 1, -1, 0, 0, 1, 0, 0, -1, -1, -1,
0, 1, 1, 1, 0, -1, 1, 1, 0, -1, -1, -1, 0, 0, 1, -1, 0, 0, 0, 1, 0, -1, 0, 1, 0, 1, 1, -1, 0, -1, 1, 0, 0, 1, -1, -1, 0, 1, -1,
-1, 0, -1, 1, 0, 0, 1, -1, -1, 0, 0 };
static u64* binomials; /* This is used as a doubly subscripted array, 128x128. Indexing is done manually.*/
static u64 nToTheThird;
static u64 logn;
static u64 numPrimes;
static u64* primes;
static u64* factorsMultiplied;
static u64* totalFactors;
static u64* factors; /* This is used as a doubly subscripted array, n^1/3 x ln n. Indexing is done manually.*/
static u64* numPrimeBases;
static u64* DPrime; /* This is used as a doubly subscripted array, n^1/3 x ln n. Indexing is done manually.*/
static u64 curBlockBase;
static double t;
static u64 nToTheHalf;
static u64 numDPowers;
static double* dPrime;
static u64 S1Val;
static u64 S1Mode;
static u64* S3Vals;
static u64* S3Modes;
static bool ended;
static u64 maxSieveValue;
static u64 ceilval;
static u64 n;
u64 binomial(double n, int k) {
double t = 1;
for (int i = 1; i <= k; i++) {
t *= (n - (k - i)) / i;
}
return u64(t + .1);
}
static u64 invpow(double n, double k) {
return (u64)(pow(n, 1.0 / k) + .00000001);
}
/* See http://www.icecreambreakfast.com/primecount/primecounting.html#ch5 for a description of
calculating d_k'(n) from a complete factorization of a number n.*/
static u64 d1(u64* a, u64 o, u64 k, u64 l) {
u64 t = 1;
for (u64 j = 0; j < l; j++)
t *= binomials[(a[o * logn + j] - 1 + k) * 128 + a[o * logn + j]];
return t;
}
/* See http://www.icecreambreakfast.com/primecount/primecounting.html#ch5 for a description of
calculating d_k'(n) from a complete factorization of a number n.*/
static u64 d2(u64* a, u64 o, u64 k, u64 l, u64 numfacts) {
if (numfacts < k) return 0;
u64 t = 0;
for (u64 j = 1; j <= k; j++)
t += ((k - j) % 2 == 1 ? -1 : 1) * binomials[k * 128 + j] * d1(a, o, j, l);
if (t < 0) {
int asdf = 9;
}
return (u64)t;
}
static void allocPools(u64 n) {
nToTheThird = (u64)pow(n, 1.0 / 3);
logn = (u64)(log(pow(n, 2.00001 / 3)) / log(2.0)) + 1;
factorsMultiplied = new u64[nToTheThird];
totalFactors = new u64[nToTheThird];
factors = new u64[nToTheThird * logn];
numPrimeBases = new u64[nToTheThird];
DPrime = new u64[(nToTheThird + 1) * logn];
binomials = new u64[128 * 128 + 128];
for (u64 j = 0; j < 128; j++) for (u64 k = 0; k <= j; k++)binomials[j * 128 + k] = binomial(j, k);
for (u64 j = 0; j < logn; j++) DPrime[j] = 0;
curBlockBase = 0;
t = n - 1;
nToTheHalf = (u64)pow(n, 1.0 / 2);
numDPowers = (u64)(log(pow(n, 2.00001 / 3)) / log(2.0)) + 1;
dPrime = new double[(nToTheThird + 1) * (numDPowers + 1)];
S1Val = 1;
S1Mode = 0;
S3Vals = new u64[nToTheThird + 1];
S3Modes = new u64[nToTheThird + 1];
ended = false;
maxSieveValue = (u64)(pow(n, 2.00001 / 3));
for (u64 j = 2; j < nToTheThird + 1; j++) {
S3Modes[j] = 0;
S3Vals[j] = 1;
}
}
static void deallocPools() {
delete factorsMultiplied;
delete totalFactors;
delete factors;
delete numPrimeBases;
delete DPrime;
delete binomials;
delete dPrime;
delete S3Vals;
delete S3Modes;
delete primes;
}
/* This finds all the primes less than n^1/3, which will be used for sieving and generating complete factorizations of numbers up to n^2/3*/
static void fillPrimes() {
u64* primesieve = new u64[nToTheThird + 1];
primes = new u64[nToTheThird + 1];
numPrimes = 0;
for (u64 j = 0; j <= nToTheThird; j++) primesieve[j] = 1;
for (u64 k = 2; k <= nToTheThird; k++) {
u64 cur = k;
if (primesieve[k] == 1) {
primes[numPrimes] = k;
numPrimes++;
while (cur <= nToTheThird) {
primesieve[cur] = 0;
cur += k;
}
}
}
delete primesieve;
}
/* This resets some state used for the sieving and factoring process.*/
static void clearPools() {
for (u64 j = 0; j < nToTheThird; j++) {
numPrimeBases[j] = -1;
factorsMultiplied[j] = 1;
totalFactors[j] = 0;
}
}
/* We can use sieving on our current n^1/3 sized block of numbers to
get their complete prime factorization signatures, with which we can then
quickly compute d_k' values.*/
static void factorRange() {
for (u64 j = 0; j < numPrimes; j++) {
// mark everything divided by each prime, adding a new entry.
u64 curPrime = primes[j];
if (curPrime * curPrime > curBlockBase + nToTheThird) break;
u64 curEntry = (curBlockBase % curPrime == 0) ? 0 : curPrime - (curBlockBase % curPrime);
while (curEntry < nToTheThird) {
if (curEntry + curBlockBase != 0) {
factorsMultiplied[curEntry] *= curPrime;
totalFactors[curEntry]++;
numPrimeBases[curEntry]++;
factors[curEntry * logn + numPrimeBases[curEntry]] = 1;
}
curEntry += curPrime;
}
// mark everything divided by each prime power
u64 cap = (u64)(log((double)(nToTheThird + curBlockBase)) / log((double)curPrime) + 1);
u64 curbase = curPrime;
for (u64 k = 2; k < cap; k++) {
curPrime *= curbase;
curEntry = (curBlockBase % curPrime == 0) ? 0 : curPrime - (curBlockBase % curPrime);
while (curEntry < nToTheThird) {
factorsMultiplied[curEntry] *= curbase;
totalFactors[curEntry]++;
if (curEntry + curBlockBase != 0)factors[curEntry * logn + numPrimeBases[curEntry]] = k;
curEntry += curPrime;
}
}
}
// account for prime factors > n^1/3
for (u64 j = 0; j < nToTheThird; j++) {
if (factorsMultiplied[j] < j + curBlockBase) {
numPrimeBases[j]++;
totalFactors[j]++;
factors[j * logn + numPrimeBases[j]] = 1;
}
}
}
/* By this point, we have already factored, through sieving, all the numbers in the current n^1/3 sized block we are looking at.
With a complete factorization, we can calculate d_k'(n) for a number.
Then, D_k'(n) = d_k'(n) + D_k'(n-1).*/
static void buildDivisorSums() {
for (u64 j = 1; j < nToTheThird + 1; j++) {
if (j + curBlockBase == 1 || j + curBlockBase == 2) continue;
for (u64 k = 0; k < logn; k++) {
DPrime[j * logn + k] = DPrime[(j - 1) * logn + k] + d2(factors, j - 1, k, numPrimeBases[j - 1] + 1, totalFactors[j - 1]);
}
}
for (u64 j = 0; j < logn; j++) DPrime[j] = DPrime[nToTheThird * logn + j];
}
/* This general algorithm relies on values of D_k' <= n^2/3 and d_k' <= n^1/3. This function calculates those values of d_k'.*/
static void find_dVals() {
curBlockBase = 1;
clearPools();
factorRange();
buildDivisorSums();
for (u64 j = 2; j <= nToTheThird; j++) {
for (u64 m = 1; m < numDPowers; m++) {
double s = 0;
for (u64 r = 1; r < numDPowers; r++)
s += pow(-1.0, (double)(r + m)) * (1.0 / (r + m + 1)) * (DPrime[j * logn + r] - DPrime[(j - 1) * logn + r]);
dPrime[j * (numDPowers + 1) + m] = s;
}
}
}
static void resetDPrimeVals() {
curBlockBase = 0;
for (u64 k = 0; k < nToTheThird + 1; k++)
for (u64 j = 0; j < logn; j++)
DPrime[k * logn + j] = 0;
}
/* This function is calculating the first two sums of http://www.icecreambreakfast.com/primecount/primecounting.html#4_4
It is written to rely on values of D_k' from smallest to greatest, to use the segmented sieve.*/
static void calcS1() {
if (S1Mode == 0) {
while (S1Val <= ceilval) {
u64 cnt = (n / S1Val - n / (S1Val + 1));
for (u64 m = 1; m < numDPowers; m++)
t += cnt * (m % 2 == 1 ? -1 : 1) * (1.0 / (m + 1)) * DPrime[(S1Val - curBlockBase + 1) * logn + m];
S1Val++;
if (S1Val >= n / nToTheHalf) {
S1Mode = 1;
S1Val = nToTheHalf;
break;
}
}
}
if (S1Mode == 1) {
while (n / S1Val <= ceilval) {
for (u64 m = 1; m < numDPowers; m++)
t += (m % 2 == 1 ? -1 : 1) * (1.0 / (m + 1)) * DPrime[(n / S1Val - curBlockBase + 1) * logn + m];
S1Val--;
if (S1Val < nToTheThird + 1) {
S1Mode = 2;
break;
}
}
}
}
/* This loop is calculating the 3rd term that runs from 2 to n^1/3 in http://www.icecreambreakfast.com/primecount/primecounting.html#4_4*/
static void calcS2() {
for (u64 j = 2; j <= nToTheThird; j++)
for (u64 k = 1; k < numDPowers; k++)
t += (n / j - 1) * pow(-1.0, (double)k) * (1.0 / (k + 1)) * (DPrime[j * logn + k] - DPrime[(j - 1) * logn + k]);
}
/* This loop is calculating the two double sums in http://www.icecreambreakfast.com/primecount/primecounting.html#4_4
It is written to rely on values of D_k' from smallest to greatest, to use the segmented sieve.*/
static void calcS3() {
for (u64 j = 2; j <= nToTheThird; j++) {
if (S3Modes[j] == 0) {
u64 endsq = (u64)(pow(n / j, .5));
u64 endVal = (n / j) / endsq;
while (S3Vals[j] <= ceilval) {
u64 cnt = (n / (j * S3Vals[j]) - n / (j * (S3Vals[j] + 1)));
for (u64 m = 1; m < numDPowers; m++)
t += cnt * DPrime[(S3Vals[j] - curBlockBase + 1) * logn + m] * dPrime[j * (numDPowers + 1) + m];
S3Vals[j]++;
if (S3Vals[j] >= endVal) {
S3Modes[j] = 1;
S3Vals[j] = endsq;
break;
}
}
}
if (S3Modes[j] == 1) {
while (n / (j * S3Vals[j]) <= ceilval) {
for (u64 m = 1; m < numDPowers; m++)
t += DPrime[(n / (j * S3Vals[j]) - curBlockBase + 1) * logn + m] * dPrime[j * (numDPowers + 1) + m];
S3Vals[j]--;
if (S3Vals[j] < nToTheThird / j + 1) {
S3Modes[j] = 2;
break;
}
}
}
}
}
/* This is the most important function here. How it works:
* first we allocate our n^1/3 ln n sized pools and other variables.
* Then we go ahead and sieve to get our primes up to n^1/3
* We also calculate, through one pass of sieving, values of d_k'(n) up to n^1/3
* Then we go ahead and calculate the loop S2 (check the description of the algorithm), which only requires
* values of d_k'(n) up to n^1/3, which we already have.
* Now we're ready for the main loop.
* We do the following roughly n^1/3 times.
* First we clear our sieving variables.
* Then we factor, entirely, all of the numbers in the current block sized n^1/3 that we're looking at.
* Using our factorization information, we calculate the values for d_k'(n) for the entire range we're looking,
* and then sum those together to have a rolling set of D_k'(n) values
* Now we have values for D_k'(n) for this block sized n^1/3
* First we see if any of the values of S1 that we need to compute are in this block. We can do this by
* (see the paper) walking through the two S1 loops backwards, which will use the D_k'(n)
* values in order from smallest to greatest
* We then do the same thing will all of the S3 values
* Once we have completed this loop, we will have calculated the prime power function for n.
*
* This loop is essentially calculating
* http://www.icecreambreakfast.com/primecount/primecounting.html#4_4
*/
static double calcPrimePowerCount(u64 nVal) {
n = nVal;
allocPools(n);
fillPrimes();
find_dVals();
calcS2();
resetDPrimeVals();
for (curBlockBase = 0; curBlockBase <= maxSieveValue; curBlockBase += nToTheThird) {
clearPools();
factorRange();
buildDivisorSums();
ceilval = curBlockBase + nToTheThird - 1;
if (ceilval > maxSieveValue) {
ceilval = maxSieveValue;
ended = true;
}
calcS1();
calcS3();
if (ended) break;
}
deallocPools();
return t;
}
static u64 countprimes(u64 num) {
double total = 0.0;
for (u64 i = 1; i < log((double)num) / log(2.0); i++) {
double val = calcPrimePowerCount(invpow(num, i)) / (double)i * mu[i];
total += val;
}
return total + .1;
}
int scaleNum = 10;
int main(int argc, char* argv[]) {
int oldClock = (int)clock();
int lastDif = 0;
printf(" Time\n");
printf(" Increase\n");
printf(" for x%d\n", scaleNum);
printf(" __ Input Number __ __ Output Number __ _ MSec _ _ Sec _ Input\n");
printf(" \n");
for (u64 i = scaleNum; i <= 1000000000000000000; i *= scaleNum) {
printf("%17I64d(10^%4.1f): ", i, log((double)i) / log(10.0));
u64 total = (u64)(countprimes(i) + .00001);
int newClock = (int)clock();
printf(" %20I64d %8d : %4d: x%f\n",
total, newClock - oldClock, (newClock - oldClock) / CLK_TCK,
(lastDif) ? (double)(newClock - oldClock) / (double)lastDif : 0.0);
lastDif = newClock - oldClock;
oldClock = newClock;
}
_getch();
return 0;
}
If you've made it this far, thanks for following along!