Saturday, March 10, 2012

Counting subsequences in a stream of digits

While reading some posts on comp.lang.lisp the other day I came across an interesting problem that I have been trying to solve:

> Write a program that counts the number of times the subsequence "1, 2, 3"
> appears in the first 100,000 digits of pi. Note "subsequence" does not imply
> consecutive digits. For example, "1, 2, 3" appears in "3.141592653" twice.
> You may use Google or the link above to find the digits of pi. Please include
> the correct count when submitting your application.

We had a candidate submit a solution in Python which ran in O(n) time.
The algorithm was essentially correct, but unfortunately, he used a
numpy.array to hold some counters, and suffered integer overflow. Had
he used regular python lists, he would have been fine.

It is an interesting problem, but the most intriguing part is the assertion that the candidate turned in a algorithm which ran in O(n) time. That means the run time scales linearly at worst with the number of inputs. I guess maybe I need to go back and revisit some algorithm texts, dust the rust off my understanding of asymptotic notation, and brush up on my algorithms. The solution I have come up with so far is O(n^2) with a constant multiple (~ 1/100) that has it between O(n^2) and O(n log n) if I plot things out with gnuplot.

My solution is a pretty simple one. First I do one pass and determine the number of 1s (or whatever the first digit is that I tell it) so I can allocate storage for counters. I could skip this and just use a linked list, but this is linear time and not a big deal. Then I walk the digits again. When I scan a 1, I increment a counter, when I scan a 2 I walk a list of counters the size of the counter tracking ones incrementing each counter, and when I scan a 3 I walk the list same list of counters and add their values to the variable tracking the number of instances of the subsequence. It runs pretty quickly, going through 1000000 digits in about 34 seconds, and 3/10 of a second for the 100000 digits stated in the original problem. Output is given below:

snits@cantor:~/proj/c/pi=>time ./seqpi pi-1000000.txt 100000
sequence 1, 2, 3: 100000 168843528006 102081958

real 0m0.345s
user 0m0.342s
sys 0m0.001s

snits@cantor:~/proj/c/pi=>time ./seqpi pi-1000000.txt
sequence 1, 2, 3: 1000000 167568126814676 10031551523

real 0m33.897s
user 0m33.739s
sys 0m0.004s

The 1st number is the number of digits scanned, the 2nd is the number of occurrences of the subsequence, and the last number was a crude operation counter that counted things like scanning a digit, and incrementing a counter, or updating the tally,

It is very interesting stuff, and one thing that fascinates me about all this is the fact that a typical home computer has the power to these things this quickly. This program and the prime sieve program in earlier posts are single threaded applications so they don't even take advantage of all the cores on the system, but they still do a lot of work very quickly. I remember in the 5th grade, Dr. Van Huesen brought his vic-20 into class and had it run a basic program to count to 1 million I believe. It ran for over a day, sitting there spitting out the numbers on the screen. It is amazing how far technology has progressed in the past 25 years.

Friday, March 9, 2012

Sieve of Eratosthenes revisited

I recently revisited this after it coming up last year when my nephew was playing around with primes. This time I decided to implement a basic bit-array myself in c. Without the overhead of calls to the boost dynamic bitset in c++ the runtime dropped even more.

Back when I had initially run the c++ version this was the output for 40 billion:

snits@perelman:~/proj/cpp=>time ./se 40000000000
1711955433 primes between 2 and 40000000000. Greatest prime is 39999999979

real 90m48.631s
user 90m40.850s
sys 0m3.589s

Here is the output for running the c version with my own bitarray:

On my i7 system:

snits@cantor:~/proj/c=>time ./sieve 40000000000
1711955433 primes between 2 and 40000000000. Greatest prime is 39999999979

real 16m49.468s
user 16m23.482s
sys 0m6.471s

On my q6600 system:

snits@perelman:~/proj/c=>time ./sieve 40000000000
1711955433 primes between 2 and 40000000000. Greatest prime is 39999999979

real 24m57.106s
user 24m47.370s
sys 0m2.562s

That is a nice gain due to removing the overhead of calls to the dynamic bitset class in C++.