Daily rant: 2007

Monday, December 31, 2007

Naive fibonacci in pypy

As this is my last post this year, I'll just keep it short. I've been busy doing all sorts of things, so here's just a fast revisit to our old friend, the naive fibonacci algorithm, this time featuring Pypy. I downloaded the 1.0.0 version and ran the by now famous naive fibonacci. To get to the point, it took 184 seconds. Yep, you read it right---3 minutes. The same thing that took 27 seconds in CPython 2.5, or around 3 seconds using Psyco. Now, I didn't read much about pypy, but as I understand it, it should be faster than the CPython implementation... And AFAIK, the guys that made psyco abandoned that project to go work on bringing similar functionality to pypy. Now, I hate to bitch (no, no I don't:) ), but this really seems kinda slow. What's going on here? It's probably the function call overhead that slows it down. One should look at the code of pypy and see what exactly they put on the stack each time... The point being, I'll stick to CPython for a while more, at least until the naive fib runs under 20 seconds :) ('Cos what's more important in life than running a fast fibonacci algorithm?)
Happy holidays everyone

Saturday, December 22, 2007

Python 3000 - how mutable is immutable?

I just downloaded Py3k alpha 2 and decided to play with it a bit. Well, beside the print becoming a function, things seemed to be more or less the same as in 2.5. Then I decided to play with the more esoteric thingies in Python.
It has been long known that Python immutables aren't really immutable. For example, if you enter the following code in 2.5:


a = ([1], 2)
a[0] += [2]

you will get an error. However, printing a reveals that it has indeed been modified (a is now ([1, 2], 2)). Ok, so what happened to this little bug/hack/feature/gotcha/whatever-you-want-to-call-it in Py3k? Well, entering the code above also raises the same error, and again it actually modifies a. Now, I don't know is this "feature" is documented anywhere, but I really don't see any point in it being a part of the language (if someone can give me an example where it would be useful to have this kind of behavior, let me know). So, why is it still present in Py3k, when it's been known for a while now? I guess someone either thinks this is a good thing or they just forgot about it, either way---guys please change this, it's annoying me :) (and what better motivation to change the language than a rant by some guy on the Internet that you never met or heard of?)
But, the fun doesn't stop here kids! Try the following:


a = ({1:2}, 2)
a[0][2] = 3

This little piece of code gets executed without any problems, both in 2.5 and in Py3k. I wonder what is the rationale behind the fact that if we try to change a list (in place) that is part of a tuple, we get an error, while changing a dict happens silently. Ok, I do admit that doing


a = ([1], 2)
a[0].append(2)

also doesn't throw an error, but this only confuses me more. So, I'd like to know---are tuples mutable or immutable? Or better yet, how mutable are immutables?
As a final thought, consider the following code:


a = ({1:2}, 2)
a[0][2] = a
print a

In 2.5, this code will execute with no problems, and printing a will yield in
({1: 2, 2: ({...}, 2)}, 2)
thereby showing us that we have an infinite dictionary on our hands (or that's what I think these three dots stand for). However, doing the same thing in Py3k (i.e., running the above code and trying to print a---remember that in Py3k print is a function), we get


Traceback (most recent call last):
  File "", line 1, in 
TypeError: __repr__ returned non-string (type bytes)

Whoa! WTF?? What does this mean? What ever it means it tells me nothing about what is actually happening here. As far as I can see, there is something wrong with the __repr__ function---it isn't cut out to handle this kind of objects. So, the first thing that came to my mind was "Hey, cool, they won't print infinite dictionaries any more. They'll allow their construction, but when I try to print them, I'll get an error (a weird one, to say the least, but an error still). This kinda suxx, but ok, I guess I'll have to live with it." No. I was wrong. Try


print(a[0])

The infinite dict in the first position of the tuple prints with no problems. So, why is it a problem to print the tuple (which basically has one number, two parenthesis, and one comma extra)? No idea. Of course, trying to print a[0][3] yields an error, print a[0][3][0] goes without a problem, and so on. I guess that's why they call it alpha version :) I sincerely hope that this won't be a part of the language when a stable release of Py3k is out.

Saturday, December 15, 2007

OMG my iPod likes Barbara Streisand

A few days ago, I updated my iPod with some new songs (I was getting kinda bored listening the same ones for about two years now). I put most of these new songs in my 80's playlist (about a dozen new songs altogether). So, the last few days I had my iPod on every time I went home from work and noticed one strange thing - every time I listened to it, one of the songs was "The way we were" by Barbara Streisand (yes, I occasionly listen to Barbara, so what of it?). What was strange about it is that the shuffle option was always on and that there were 95 other songs in the playlist. To give the exact numbers, I heard Barbara four out of four days in a row, where I listened about 15 songs each day (the commute home takes about 45 minutes - that's 3 minutes per song), and, as I said, there were 96 songs in total to choose from.
This somehow seemed odd to me, not to say wierd. That's why I decided to see just what is the probability to observe this strange behaviour. So, back to basic statistics. The probability that any song will be selected from a playlist of 96 songs is 1/96. However, as one song finishes, the next one is selected from the remaining 95 songs (I assume that this is how the iPod's shuffle works - it shouldn't select a song that was already played until all others are played). So the probability that any song will be selected if two songs are played is 1/96 + 1/95. You get the picture how the story goes from here... So, the probability that you heard a song after playing 15 of them (in a playlist of 96 songs) equals to 1/96 + ... + 1/82 = 0.1689, that is, around 17%. This means that I have a 17% chance to hear "The way we were" on one of my trips home. However, I heard it four days in a row. Since hearing a particular song on one of my trips can be regarded as an experiment with two outcomes (I either heard it or not), and since I start the new shuffle every day (meaning that each experiment is independent of the previous one), we can say that we are talking about a binomial distribution. In such a setting, we can easily compute the probability that I hear Barbara four days in a row - it's simply 0.1689^4 = 8.145 * 10^-4. That is, there is an 8 in 10000 chance that I will hear Barbara four days in a row if I put my iPod on shuffle. Yet, there it was, I heard it. Four days in a row. As my knowledge of statistics ends here, I will not try to do a hypotesis test or anything like that. If there is someone out there that is willing and capable to do this, I would love to hear from him/her.
All this lead me to do some research online. Well, it turns out I'm not the only one having this shuffle issue. Many users have already complained how the "random" playing isn't random at all and how their iPods have souls. Here's just two references - http://forums.ilounge.com/archive/index.php/t-4575.html http://www.blackwell-synergy.com/doi/pdf/10.1111/j.1540-4609.2007.00132.x?cookieSet=1. The second one is a really interesting paper with some more references to people with the same problem. So, it turns out that the machines have already started developing their own "mind". Yes, our machine overlords are taking over the world, and they have started with picking their favourite music. I'm just glad to know my iPod loves Barbara Streisand - I think it'll make an ok master.

Saturday, December 8, 2007

Running fast code in Python

As I have suspected, the numbers I got for execution time of the Haskell fibonacci program (look at my previous posts) are wrong, in the sense that I compiled the program without optimisation, as most normal users will do in real applications. That is why I feel that I should redeem myself and give the numbers for the program with optimisation turned on. So, when compiling the program using

ghc -O2 fib.hs -o fib -no-recomp (thanks to Don Stewart)

execution time is 0.64s (mean time of five runs). This time is much closer to the one reported on the Haskell blog (0.48s). Ok, so my hat goes off to Haskell---it really is faster than Python some 50 times.
But, being a Python fan, I couldn't help but feel that something is wrong here. Is Haskell really that good? It's fast, easy to parallelize, pure functional and it probably cooks for you after a hard day at work. So, why should I use Python?
Well, the first thing that comes to my mind is the fact that ghc generated a 925kB exe file from a 1kB file that computes fibonacci numbers. In comparison, gcc (3.4.2) generated 16 kB for a fibonacci program that ran 0.75s (compiled using -O2). While it is still impresive that Haskell was faster than C, I cannot help but think that it is just too much to have an exe file almost 1000 times larger than the source. While this is certainly a drawback of Haskell, it still wasn't enough to return my faith in Python. Then, another thing came to my mind---if Haskell can use optimisations, why can't Python? I mean, if I used -O2 in ghc, why can't I do somehting similar in Python?
Of course, running Python with -OO gave no real speedup---it shook off about one second in execution time. Instead of 26s, I got 25s for the non-parallel case, and instead of 18s, I got 17s for the non-parallel case. But, the great thing about Python are its packages. I remember a friend of mine telling me about a thing called Psyco. It was some kind of a package that optimised execution of Python code, so I decided to give it a shot. After downloading and installing Psyco (http://psyco.sourceforge.net/), I just added two lines to my source (I give the full source here):


from time import time
import psyco
psyco.full()

def fib(n):
   if n == 0 or n == 1:
      return n
   else:
      return fib(n-1) + fib(n-2)

if __name__ == "__main__":
    start_time = time()
    for i in range(36):
        print "n=%d => %d" % (i, fib(i))
        
    print "Time elapsed: %f" % (time() - start_time)

So, the lines I added were import psyco, and psyco.full(). I ran this thing and voila---the time of execution was 1.93s (mean time of five runs). Wow! That was all I could think. Remember, this was the non-parallel case, so it went from 25 seconds to 1.9s. Or, in terms of Haskell speed, Haskell now beat Python by on 3x. And with no extra effort (ok, adding two lines to the source) and no parallelization.
Unfortunately, psyco didn't help the parallel version---it still took 18s. But still, I think this is pretty amazing and it definitely brought back faith in Python. So, it's three times slower, but the synatx is much cleaner and it doesn't generate a thousand times bigger executable. I'm sticking to Python :)
(Note: some parts of this post are biased and down right offensive to Haskell fans (just kidding about the last one:) ). My intention is not to start a flame war or anything like that, I just felt that someone should stand up and take Python's side:). Seriously though, if you have any objections on my methodology, I would like to hear from you.)

Wednesday, December 5, 2007

Even xkcd is leaving Perl :)

For those of you out there that don't read xkcd:

Monday, December 3, 2007

Holy shmoly Haskell doesn't smoke Python away (that much)

Recently, there has been some talk about how Haskell smokes Python and Ruby away in the naive fibonacci algorithm. While it is obvious that Haskell will beat both due to the code being compiled and not interpreted, I decided to see if I got the same results as reported on the Haskell hacking blog, and also see how Python handles parallel processing. Note that I will refer to the Haskell hacking blog simply as Haskell blog from now on.

Since I have no experience in Ruby, I will only compare Python and Haskell. So, here is the code I used to test the two.
Python:


from time import time

def fib(n):
   if n == 0 or n == 1:
      return n
   else:
      return fib(n-1) + fib(n-2)
      
if __name__ == "__main__":
    
    start_time = time()
    for i in range(36):
        print "n=%d => %d" % (i, fib(i))
        
    print "Time elapsed: %f" % (time() - start_time)

So, as we can see, this code is pretty much the same as that reported in the blog about Haskell.


Haskell:

import Control.Monad
import Text.Printf

fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
main = forM_ [0..35] $ \i ->
    printf "n=%d => %d\n" i (fib i)

This is also the same as in the blog, the only difference being that I import what is needed to make it run (for some reason, the import statements aren't included in the Haskell blog). Since I have no idea how to measure time in Haskell (any hint on how to do that is appreciated), I used the ptime command for Windows (available at http://www.pc-tools.net/win32/ptime/).

So, here are the results of the test:

Python 2.5: 26.07 s
Haskell (GHC 6.8.1): 3.873 s (mean of 5 runs)

So, the results for Python are pretty much the same as reported in the Haskell blog (25.16 s is reported there), but the ones for Haskell are not, in fact the Haskell blog reports 0.48 s, which is some four times faster than I got on my computer. What is the reason for this? I have no idea. I compiled the program using "ghc fib.hs", maybe they used some code optimizations? Anyway, the Haskell advantage seems to be melting, although Haskell is still some 6-7 times faster. However, the Haskell blog proceeds to use parallel processing in Haskell, but uses no such thing in Python. So, I decided to see if Python can take advantage of my two cores to shake some time off this 26 seconds. I used the Parallel Python package (available at http://www.parallelpython.com/) to make use of my two cores. Here is the modified Python program:


from time import time
import pp

def fib(n):
   if n == 0 or n == 1:
      return n
   else:
      return fib(n-1) + fib(n-2)


if __name__ == "__main__":
    start_time = time()
    ppservers = ()
    job_server = pp.Server(ppservers=ppservers)
    print "Starting pp with", job_server.get_ncpus(), "workers"
    jobs = [(input, job_server.submit(fib, (input,), (), ())) for input in range(36)]
    for input, job in jobs:
        print "n=%d => %d" % (input, job())
        
    print "Time elapsed: %f" % (time() - start_time)

The result: 17.86 s
So, Python went from 26.07s to 17.86s, which means it ran 31.5% faster than on only one core. Unfortunately, I had no success in running the parallel Haskell code reported on the Haskell blog (for some reason, I always got "`pseq` not in scope" error while compiling). However, I can use the numbers given on the Haskell blog, where Haskell went from 0.48s to 0.42s on two cores. This means a 12.5% reduction in execution time. Although I admit that is takes slightly more intervention from the programmer to make the parallelism in Python work, the gain in speed is also more significant. As a final note, let us compare 3.873s (Haskell) and 17.86s (parallel Python) - the ratio is around 4.6, let's say 5. So, Haskell doesn't smoke Python away THAT much.

Tuesday, November 20, 2007

Trie in Python

Here is my humble implementation of a trie in Python, just something to get me started.
Any comments on improving the code are welcome


import re, types
from collections import defaultdict

def tokenize(s, removePunctuation = True):
    if removePunctuation:
        p = re.compile('[\.,!?()\[\]{}:;"\'<>/ \n\r\t]')
        return [el for el in p.split(s) if el]
    else:
        t = []
        s = re.split('\s', s)
        p = re.compile('(\W)')
        for phrase in s:
            words = p.split(phrase)
            for word in words:
                t.append(word)
        return [el for el in t if el]

class Node(object):
   
    def __init__(self):
        self.freq = 0
        self.next = defaultdict(int)

class Trie(object):
   
    def __init__(self, inFile):

        self.nodes = []
        self.nodes.append(Node())
        self.n = 1
        self.numberOfWords = 0

        for line in file(inFile):
            words = tokenize(line)
            for w in words:
                currNode = 0
                for char in w:
                    if self.nodes[currNode].next[char] == 0:
                        self.nodes.append(Node())
                        self.nodes[currNode].next[char] = self.n
                        currNode = self.n
                        self.n += 1
                    else:
                        currNode = self.nodes[currNode].next[char]
                self.nodes[currNode].freq += 1
        self.numberOfWords = len([node for node in self.nodes if node.freq != 0])

    def __getitem__(self, word):
        """Return the frequency of the given word."""

        if isinstance(word, types.StringTypes):
            currNode = 0
            for char in word:
                if self.nodes[currNode].next[char] == 0:
                    raise AttributeError("No such word: %s" % word)
                else:
                    currNode = self.nodes[currNode].next[char]
            return self.nodes[currNode].freq
        else:
            raise TypeError()

    def __len__(self):
        """Return the number of nodes in the trie."""

        return self.n

Daily rant