While being a big fan of Python I am also in the continuous pursuit of better performance.
Like it or not, but when seeking for better performance in many cases you have to put away high level framework/programming-languages and go back to basic; many times basic means good old C.
After having developed some different Python scripts to calculate the Pi-greco up to a given number of digits (later I will post the most advanced version I came up with) I decided it was time to make a step forward and write it in C.
Because I am still away from home this time the test machine instead of being the usual FX-8120 is my Lenovo Thinkpad equipped with a 2.1 GHz 2C4T 3 MB of L3 cache i3-2310M.
On Windows the most known C (it also has a C++ wrapper) library for arbitrary precision math is MPIR.
One of the most important “features” of the library is that it is written in both C and Assembly, they also provide different versions of the Assembly code, so you can use the one that is specifically written for the CPU architecture of the machine you will run your program on.
The full source code of the library can be downloaded, modified and compiled by the user to better suit his needs.
The only downside is that the super optimized Assembly code included in the latest stable release of the library ins’t really up to date, in fact there still is no support for Sandy Bridge and AMD Bulldozer CPUs.
Anyway I think no one should complain about it, MPIR is an open source project so we can wait for an official release with the added features or get our hands dirty and write the code we need by ourselves.
Since I am lazy I will wait for the official team to realease the updated Assembly; since then I can live with the current stuff.
Talking about “current stuff” I have noticed that AMD K10 library’s version is much faster than P4, C2D and Nehalem ones even if I am using an Intel Sandy Bridge CPU; and I am not talking about 2-3% faster but 15-20%.
Anyway, while the Python - I have the feeling that there still something left here and there, tho I don’t think I can make it as fast as the C implementation is - script (which uses GMPY2) requires 30+ seconds to compute one million digits the program written in C takes only 1.4 seconds to complete the same task.
Both programs use the same algorithm: Chudnovsky brothers formula

  • binary splitting to further speed up the computation.
    Here is the part of the code deputed to do the math:
//------------------------------------------------------------------//
// pi_chudnovsky_bs
// description: compute PI to the given precision
void pi_chudnovsky_bs(mp_bitcnt_t digits, char* cpuid)
{
	reset();
	double start = seconds();

	mpf_set_default_prec (prec);

	__int64 C3 = 262537412640768000;

	printf ("Computing DIGITS_PER_TERM: ");
	//double C3_OVER_24 = 10939058860032000;
	mpz_t C3_OVER_24_p;
	mpz_init_set_str (C3_OVER_24_p, "10939058860032000", 10);
	double DIGITS_PER_TERM = log10 ((double) 151931373056000);
	double end = seconds();
	double end_2 = end - start;
	printf ("%g seconds\n", end_2);

	printf ("Computing Binary Splitting parameters: ");
	unsigned long N = (digits / DIGITS_PER_TERM) + 1;
	end = seconds();
	end_2 = end - start;
	printf ("%g seconds\n", end_2);

	printf ("Computing Binary Splitting terms: ");

	mpz_t Pab;
	mpz_init (Pab);
	mpz_t Qab;
	mpz_init (Qab);
	mpz_t Tab;
	mpz_init (Tab);

	bs(0, N, C3_OVER_24_p, &Pab, &Qab, &Tab);

	end = seconds();
	end_2 = end - start;
	printf ("%g seconds\n", end_2);

	// one_squared = 10**(2*digits)
	printf ("Computing squaring: ");
	mpz_t one_squared;
	mpz_init_set_ui (one_squared, 10);
	mpz_pow_ui (one_squared, one_squared, (2 * digits));

	end = seconds();
	end_2 = end - start;
	printf ("%g seconds\n", end_2);

	// sqrtC = sqrt(10005 * one_squared)
	printf ("Computing square root: ");
	mpz_t sqrtC;
	mpz_init (sqrtC);
	mpz_mul_ui (sqrtC, one_squared, 10005);
	mpz_sqrt (sqrtC, sqrtC);

	end = seconds();
	end_2 = end - start;
	printf ("%g seconds\n", end_2);

	// pi = (Q*426880*sqrtC) // T
	printf ("Computing PI-Greco");
	mpz_t pi;
	mpz_init (pi);
	mpz_mul_ui (pi, Qab, 426880);
	mpz_mul (pi, pi, sqrtC);
	mpz_div (pi, pi, Tab);

	// count end time
	end = seconds();
	end_2 = end - start;
	printf ("\nCompleted in: %g seconds\n", end_2);
	//printf ("\nPI: ");
	//mpf_out_str (stdout, 10, 0, pi);

	// delete temorary variables
	mpz_clear (Pab);
	mpz_clear (Qab);
	mpz_clear (Tab);

	printf ("\n");
	writeResultToFile (pi, cpuid, digits);
	printf ("\n");
	writeInfoToFile (digits, cpuid, end);
}

//------------------------------------------------------------------//
// bs
// description: makes the binary splitting math
void bs (unsigned long a_i, unsigned long b_i, mpz_t C3_OVER_24,
                             mpz_t* Pab, mpz_t* Qab, mpz_t* Tab)
{
	mpf_set_default_prec (prec);

	mpz_t a;
	mpz_init_set_ui (a, a_i);
	mpz_t b;
	mpz_init_set_ui (b, b_i);

	if ((b_i - a_i) == 1)
	{
		if (a_i == 0)
		{
			// Pab = Qab = ONE
			mpz_set_ui(*Pab, 1);
			mpz_set_ui(*Qab, 1);
		} else {
			// Pab = (6*a-5)*(2*a-1)*(6*a-1);
			mpz_t p1;
			mpz_init (p1);
			mpz_mul_ui (p1, a, 6);
			mpz_sub_ui (p1, p1, 5);

			mpz_t p2;
			mpz_init (p2);
			mpz_mul_ui (p2, a, 2);
			mpz_sub_ui (p2, p2, 1);

			mpz_t p3;
			mpz_init (p3);
			mpz_mul_ui (p3, a, 6);
			mpz_sub_ui (p3, p3, 1);

			mpz_mul (*Pab, p1, p2);
			mpz_mul (*Pab, *Pab, p3);
			// delete temorary variables
			mpz_clear (p1);
			mpz_clear (p2);
			mpz_clear (p3);

			// Qab = a^3 * C3_OVER_24
			mpz_t a3;
			mpz_init (a3);
			mpz_pow_ui (a3, a, 3);

			mpz_mul (*Qab, a3, C3_OVER_24);
			// delete temorary variables
			mpz_clear (a3);
		}
		// Tab = Pab * (13591409 + 545140134*a)
		mpz_t t1;
		mpz_init (t1);
		mpz_mul_ui (t1, a, 545140134);
		mpz_add_ui (t1, t1, 13591409);
		mpz_mul (*Tab, *Pab, t1);
		// delete temorary variables
		mpz_clear (t1);

		if (a_i & 1)
			mpz_neg (*Tab, *Tab);
	} else {
		unsigned long m = (a_i + b_i) / 2;

		// divide et impera, first chunk
		mpz_t Pam;
		mpz_init (Pam);
		mpz_t Qam;
		mpz_init (Qam);
		mpz_t Tam;
		mpz_init (Tam);

		bs(a_i, m, C3_OVER_24, &Pam, &Qam, &Tam);

		// divide et impera, second chunk
		mpz_t Pmb;
		mpz_init (Pmb);
		mpz_t Qmb;
		mpz_init (Qmb);
		mpz_t Tmb;
		mpz_init (Tmb);

		bs(m, b_i, C3_OVER_24, &Pmb, &Qmb, &Tmb);

		mpz_mul (*Pab, Pam, Pmb);
		mpz_mul (*Qab, Qam, Qmb);

		mpz_init (*Tab);
		mpz_t T1;
		mpz_init (T1);
		mpz_t T2;
		mpz_init (T2);
		mpz_mul (T1, Qmb, Tam);
		mpz_mul (T2, Pam, Tmb);
		mpz_add (*Tab, T1, T2);

		// delete temorary variables
		mpz_clear (Pam);
		mpz_clear (Pmb);
		mpz_clear (Qam);
		mpz_clear (Qmb);
		mpz_clear (Tam);
		mpz_clear (Tmb);
		mpz_clear (T1);
		mpz_clear (T2);
	}

	mpz_clear (a);
	mpz_clear (b);
}

I know that at a first glance it will looks a big mess, I hope the comments I spread in the code could be useful to understand what the program really does.
If you are interested here is a link to the whole source code (already compiled K10’s MPIR library are included): source code.
Also here is the executable version (Win X64 AMD K10): exe.