Daily rant

Running separate xsession in Ubuntu

2010-05-23T09:44:00.000-07:00

Every time I upgrade to a new version of Ubuntu, it breaks my xsession (which runs bear Xmonad). And every time I have to mess around for half an hour to find what needs to be done to get it back. All that needs to be done is to modify the file /usr/share/xsessions/xmonad.desktop to look like this:


[Desktop Entry]
Encoding=UTF-8
Name=XMonad
Comment=Lightweight tiling window manager
Exec=/etc/X11/Xsession
Icon=xmonad.png
Type=XSession

Twitter corpus v0.1

2010-03-13T05:05:00.000-08:00

I am pleased to announce that we have published the first version of our Twitter corpus. All the data was collected from Twitter's streaming API over a period of about two months (November 11th 2009 until February 1st 2010). You can download the corpus from our social media website (which we just set up recently). There is an accompanying paper which gives some statistics about the corpus. One things that might interest all the 13-year old girls out there is that it seems Justin Bieber > Nick Jonas (look at table 3 in the paper). I believe that the Twitter corpus will be of interest to anyone working in social media research and/or NLP. We do plan to release subsequent versions as we get more data (and we might release old data starting from April 2009, but more on this later).

Limits in stdint.h

2009-07-10T07:14:00.000-07:00

I've recently had problems with compiling my code that used the numeric limits defined in stdint.h, UINT64_MAX in particular. The error I was getting was


error: ‘UINT64_MAX’ was not declared in this scope

What I didn't know is that simply including stdint.h wasn't enough. The macro __STDC_LIMIT_MACROS has to be defined before the point where stdint.h is included, otherwise the limits are not defined. The best way of defining this macro, IMHO, is by compiling with -D__STDC_LIMIT_MACROS, instead of manually defining the macro somewhere in code. Hope this post will save someone a few minutes (hours?) if they run into similar problems.

Do public prediction markets really fail?

2009-06-20T03:11:00.000-07:00

A few days ago I came across this article explaining why public prediction markets fail. The article gives an example where three different PMs failed to pick a winner in American Idol (Betfair) or Britain's got talent (Hubdub and Intrade). While it was definitely an interesting read, I feel that the author didn't take some things into account.
First, the success of PMs depends on the notion of participants playing risk neutral strategies (see, for example, the Manski 2005 paper) -- when people play with fake money, this may well not hold (they will tend to risk more than usual because they have nothing to lose).
Also, PMs are said to be more accurate than other ways of aggregating opinions such as polls, or exert opinions. It would be nice if the author had compared these predictions with some predictions made by opinion polls or experts and shown how the predictions differ.
Then there's the thing with data from Hubdub. As noted in the article, Hubdub's market concerned with Britain's got talent failed to predict the true outcome, giving Susan Boyle 78% chance of winning. What was not taken into account is that there were in fact more markets on Hubdub concerned with Britain's got talent. The one used in the article can be found here. However, here's another market that accurately predicted that "Susan Boyle OR Diversity" will win (Diversity won). Note the following: the market that got Susan Boyle wrong had 25k$ of activity, whereas the one that got it right had twice as much activity. So, if you were to make any decisions based on PM predictions, you would probably go for the one with more activity and you would be right. I don't know if Intrade of Betfair had more markets for the same event.
Anyway, I agree in the bottom line that accuracy of PMs depends on how much information its participants have, and that, ultimately, they will fail some of the time. However, we should be careful about making statements like "public prediction markets fail", especially since there are so many examples when they don't (this might be a topic for another post). And even when they do fail, it's important to understand why.

Wasting just a little bit more time: a tale of dzen and reddit

2009-06-08T14:41:00.000-07:00

Are you tired of switching from the console to browser, and then pressing F5 to see what's new on reddit? So am I. There's got to be a better way of seeing new posts on reddit without leaving the comfort of the command line. And there is. If you run Linux with dzen, here's a possible solution. Simply use this script. It checks reddit and outputs the first ten titles in a way that can be used by dzen. It should be clear from the source how to make the script output more than ten titles or how to monitor something other than the main reddit (e.g., programming, pics, funny or others subreddits). Now all one needs to do is run this script every n seconds (depends on how often you expect new articles to arrive), you can use something like this. One last thing left to do is add a line to your .xsession that calls your checker script, e.g., I have the following line:


~/.dzen/checkReddit.sh | dzen2 -y 800 -ta c -l 10 -bg '#2c2c32' -fg 'grey70' -p -e 'entertitle=uncollapse;leavetitle=collapse' &

And here are the results (the reddit bar sits quietly at the bottom of your screen until you move your mouse over it, when you get the situation shown on the picture to the right):

Of course, once you see what's new you still have to switch to your browser if you want to actually follow the links, but this should at least save you the trouble of switching only to find out that there's nothing new. Note that there are many things left to be done here, this is just a little sample of how you can waste a few more minutes of your life.

Flickr

2009-04-23T14:15:00.001-07:00

This is a test post from , a fancy photo sharing thing.

2009-01-17T15:56:00.000-08:00

Firefox 3 download day - Mozilla just got into politics

2008-06-17T15:57:00.000-07:00

Well, Firefox 3 finally arrived and all is well in the world. I was just browsing the downloads map and was ready to call it a night when something caught my eye. Something was strange about that map... Ah, that's it, there seems to be one state missing on the map. For all of you that don't know which one is missing, let me give you a hint -- it's a small state just north of Albania, south of Serbia. You guessed it, it's Kosovo. Yes, Kosovo. The state that got its independence in February this year is not plotted as a separate country, but rather as a part of Serbia (of which it was a part until recently). Now, it may be considered overranting, but I wonder why isn't Kosovo listed as a separate country. After all, it is recognized by the US (country which Mozilla calls home), the UK, France, Italy, Germany... The list goes on. And it's not like the guys at Mozilla didn't have time to prepare, Kosovo got its independence in February which gave them some four months to add it to the countries list. So, why can't I see how many people from Kosovo downloaded Firefox 3? And more importantly, who decides if a country is relevant enough to be added to a company's list? It seems that nowadays a country has to be recognized not only by other countries, but also by companies if it wishes to make a meaningful existence...

Saving e-mails to disk (from a gmail account)

2008-04-07T01:49:00.000-07:00

As I was trying to download all my spam messages to disk (I'm building a database of spam messages), I realised that this is no simple task. Well, one thing I could do is save the messages using Thunderbird or Outlook, but since I don't use either (I consider the Gmail web interface very nice and user-friendly) this one is out of the question. However, after a little browsing I discovered a wonderful Python package called libgmail. Long story short, here is the script I used to download all the messages from the spam folder:


#!/usr/bin/env python
'''
savemsg.py -- Download all messages from a specified folder
License: GPL 2.0
'''

import sys
from getpass import getpass
import libgmail

if __name__ == "__main__":
    try:
        name = sys.argv[1]
    except IndexError:
        name = raw_input("Gmail account name: ")
        
    pw = getpass("Password: ")

    ga = libgmail.GmailAccount(name, pw)

    print "\nPlease wait, logging in..."

    try:
        ga.login()
    except libgmail.GmailLoginFailure,e:
        print "\nLogin failed. (%s)" % e.message
        sys.exit(1)
    else:
        print "Login successful.\n"

    FOLDER_list = {'U_INBOX_SEARCH' : 'inbox',
                   'U_STARRED_SEARCH' : 'starred',
                   'U_ALL_SEARCH' : 'all',
                   'U_DRAFTS_SEARCH' : 'drafts' ,
                   'U_SENT_SEARCH' : 'sent',
                   'U_SPAM_SEARCH' : 'spam',
                   }

    FOLDER_list = raw_input('Choose a folder (inbox, starred, all, drafts, sent, spam): ')
    folder = ga.getMessagesByFolder(FOLDER_list)

    for thread in folder:
        for msg in thread:
            print "Downloading message %s " % msg.id
            encIndexStart = msg.source.find('charset=')
            if encIndexStart != -1:
                encIndexEnd = (msg.source.find(' ', encIndexStart),\
                               msg.source.find('\n', encIndexStart),\
                               msg.source.find(';', encIndexStart),\
                               msg.source.find('"', encIndexStart+10))
                encIndexEnd = [ind for ind in encIndexEnd if ind != -1]
                encIndexEnd = min(encIndexEnd)
                enc = msg.source[encIndexStart + 8:encIndexEnd]
                enc = enc.replace('"', '').replace(';', '')
            else:
                enc = 'ascii'
            print "Detected encoding %s\n"  % enc
            try:
                f = open(msg.id + " " + msg.subject + ".txt", 'w')
            except:
                # message subject contains characters forbidden by the os in the
                # file name, use just message id
                f = open(msg.id + ".txt", 'w')
            try:
                f.write(msg.source.decode(enc).encode('utf-8'))
            except:
                f.write(msg.source)
            f.close()
    print "\n\nDone."

One could use the script to download messages from any gmail folder. The encoding of the message is automatically recognized and the message is saved in UTF-8 to facilitate later processing. Of course, you have to have libgmail installed to run the script. It is also very easy to adapt the script to use it for any other purpose (I actually wrote this script by changing one of the demo scripts that come with libgmail).

Reddit alien in LaTeX

2008-03-04T04:04:00.000-08:00

What to do on a lazy sunday afternoon? Well, there are probably a thousand useful
things one could do, but I chose to do something completely useless. And here it is, the reddit alien drawn in LaTeX (using PsTricks) :)


\documentclass{memoir}

\usepackage{pstricks}
\usepackage{pst-all}

\begin{document}

\thispagestyle{empty}

%\psset{linecolor=red}
\psset{linewidth=1pt}
\begin{pspicture}(6,9)
%\psgrid[subgriddiv=1,griddots=10,gridlabels=10pt](0,0)(12,12)
\psframe(0,0)(6,9)
% head
\psellipse(3,6)(1.8,1)
% ears
\psarc{-}(1.5,6.5){0.4}{38}{230}
\psarc{-}(4.5,6.5){0.4}{-50}{142}
% eyes
\pscircle[fillstyle=solid,linestyle=none,fillcolor=orange](2.3,6.3){0.3}
\pscircle[fillstyle=solid,linestyle=none,fillcolor=orange](3.7,6.3){0.3}
% mouth
\psbezier(2.3,5.6)(2.5,5.3)(3.5,5.3)(3.7,5.6)
% the tentacle thingie
\psline(2.99,6.99)(3.6,8)
\psline(3.581,7.993)(4.1,7.8)
\pscircle(4.35,7.65){0.3}
% arms
\psarc{-}(2.8,4.3){1.1}{130}{231}
\psarc{-}(3.15,4.3){1.1}{-48}{50}
% body
\psbezier(2.1,5.15)(2.0,4.7)(1.8,3.3)(2.6,2.7)
\psbezier(3.85,5.15)(4.1,4.65)(4.1,3.3)(3.4,2.7)
% feet
\psline(1.65,2.7)(4.3,2.7)
\psarc{-}(2.1,2.685){0.45}{70}{180}
\psarc{-}(3.85,2.685){0.45}{0}{105}
\end{pspicture}

\end{document}

Note that I used the memoir document class, but you could also use article if you don't have memoir available (the results would be the same). The alien does not look exactly the same as the official reddit one, but who cares? And here is what you should get when you run this example through LaTeX:

And people say LaTeX suxx...

Deferred printing in LaTeX

2008-02-25T01:43:00.000-08:00

Ever written a textbook? Ever written any document that has questions/problems and answers to those problems? Usually when you do this, you want to have questions printed in one place (or questions in each section of the document), and answers printed at the end of the document (so your readers, which are usually students, would try to actually solve the problem rather than just look at the solution which is right under the problem).
Now, I don't know how most of you usually do this, but up until now the only way I knew to do this is the "brute force" approach---I would simply write the questions and then, at the end of the document, I would write the answers. The obvious drawback of this approach is that you have to look up each individual question when writing the answers (just to see what it was) and this can quickly become very boring. Or, if you have the questions/answers on paper, they are usually written in pairs question/answer so you have to first type all the questions (ignoring the answers), and then go back to the begining of the paper and type all the answers (ignoring the questions). Anyway, I hope you get the picture why I hate doing this.
Wouldn't it be great if I could somehow have the question and its answer in the same place in the source of the document (notice that we are not talking about WYSIWYG text processors here), but then defer the printing of answers in the output document (that is, print the answers at the end of the document)? The solution to my problem comes in a form of the LaTeX box mechanism. Just take a look at this minimal example:


\documentclass{article}

\newbox\answerscollect
\newcounter{problem}
\setcounter{problem}{0}
\renewcommand{\theproblem}{\textit{Problem \arabic{problem}.}\,}

\def\answer#1{\par\setbox\answerscollect=\vbox{\unvbox\answerscollect\vspace{5mm}\theproblem\,#1\par}}
\def\printanswers{\unvbox\answerscollect\setbox\answerscollect=\vbox{}}
\def\initbox{\setbox\answerscollect=\vbox{Answers:\par}}

\newcommand{\problem}[1]{\addtocounter{problem}{1}\par\theproblem\textit{#1}}

\begin{document}

\initbox

\section{Intro}
Some text before...

\problem{What is the capital of the US?}
\answer{Washington, D.~C.}

\problem{What is 2+2?}
\answer{4}

\section{Foo}

Some other text goes here...

\section{Answers}

\printanswers

\end{document}

Now I know it looks a bit complicated, but it really isn't. What we are doing is creating a box and storing all the answers in it, and then, when the time is write, we simply print all the contents of that box using the \printanswers command. This way, we can simply have the problem and its solution (answer) in the same place in the source of the document, but have the answers appear in a totally different place in the output pdf (or postscript or whatever). Now, I'm not going to go into details of the LaTeX code itself here. Those who know LaTeX will pretty much understand the code, and those who don't should first learn the basics before trying to get this. Anyway, the code can be used out-of-the-box; if someone would like to change something but doesn't know how, you can contact me. Last, but not least, here is how the output pdf looks like:

UPDATE: As Evan noted, if you are going to use multiple files, don't use \include. Instead, use the \input command which does not create new .aux files.

Unifying expressions in Python

2008-01-31T01:09:00.000-08:00

Most of you that had some AI course in college probably heard of unification of expressions. All of you who ever programmed (and I use the term loosly) in Prolog know what this is. For all others, just a note that I'm not going to explain here what is unification and how it's done. If you don't know what it is, check out, for example this WP article.
Here we can see the algorithm for finding the most general unifier of two expressions in pseudo-code (check out the bottom of the page). And here is my implementation of unifying expressions in Python. I'm not going to explain the code here---if you're interested, you can do this yourself. I'll just give a brief example of how you can use this. So, suppose that you have the following two expressions:
expr1 = P(f(a), g(y), f(w))
and
expr2 = P(x, g(f(x)), y),
where P is a predicate, g and f are functions, and x, y and w are variables (this is an actual assignment from a test we gave our students some two months ago). Lets solve this using Python:


import mgunifier as mg    # just to make the listing shorter

# declare predicates, functions, and variables
P = mg.predicate("P")
g = mg.function("g")
f = mg.function("f")
x = mg.variable("x")
y = mg.variable("y")
w = mg.variable("w")
a = mg.constant("a")

expr1 = [P, [f, a], [g, y], [f, w]]
expr2 = [P, x, [g, [f, x]], y]
uni = mg.mgUnifier(expr1, expr2)     # find the most general unifier
new1 = uni(expr1)
new2 = uni(expr2)
print "The most general unifier is ", uni
# check if the two expressions are the same after unification
print new1
print new2

That's all there is to it. If someone has an idea on how to make things more simple or improve the code in any way, I am always glad to hear from you. Now, those of you that had some experience with Prolog probably know that, although unification is the most basic tool of this language, there's a little bug/wart when using it (this applies to SWI Prolog implementation, I don't know how others handle it). Typing, for example,


?- X = f(X).

results in X = f(**), meaning that variable X is now f(f(f(f(...)))). This is, of course, wrong, because the left and the right side of the equal sign will never be the same (and if you don't believe me, check out the algorithm). Why exactly is Prolog doing this is beyond me. Anyway, try the following code:


import mgunifier as mg

P = mg.predicate("P")
g = mg.function("g")
f = mg.function("f")
x = mg.variable("x")
y = mg.variable("y")
w = mg.variable("w")
a = mg.constant("a")

first = [P, x, [g, y], z]
second = [P, [f, a], z, y]
uni = mg.mgUnifier(first, second)
print uni

Not only that you get an error, but you can also see exactly where the algorithm failed.
Now, what's the whole point of this post? To demonstrate some Python implementation of an algorithm no one cares about? Well, partially. I was looking through the web for something similar and found only this. As it didn't satisfy me, I wrote my own program that unifies expressions, so anyone looking for something like this doesn't have to. But, what I'd really like people to see (and read) is this (copy-paste from the source):


def mgUnifier(k1, k2):
    """Return the most general unifier of two expressions k1 and k2."""
    
    if not k1 or not k2: return supstitution()
    if isinstance(k1, (constant, variable, function, predicate)) or isinstance(k2, (constant, variable, function, predicate)):
        if k1 == k2:
            return supstitution()
        if isinstance(k1, variable):
            if k1 in k2:
                return error(`k1` + " in " + `k2`)
            else:
                return supstitution(k2, k1)
        if isinstance(k2, variable):
            if k2 in k1:
                return error(`k2` + " in " + `k1`)
            else:
                return supstitution(k1, k2)
        if not isinstance(k1, variable) and not isinstance(k2, variable):
            return error(`k1` + " and " + `k2` + " cannot be unified!")

    alpha = mgUnifier(head(k1), head(k2))
    if isinstance(alpha, error):
        return alpha
    k3 = alpha(tail(k1))
    k4 = alpha(tail(k2))
    beta = mgUnifier(k3, k4)
    if isinstance(beta, error):
        return beta
    return alpha(beta)

This beautifully (IMHO) demonstrates the power and clarity of Python. Compare this code to the algorithm pseudo-code---it's almost the same. This is what I love about Python---I can simply copy the algorithm's pseudo-code and then just implement the "magic" that happens in the background.
Like I said, the purpose of this program is primarly educational, the code could probably be better written, but I stuck with the implementation that was most similar to the algorithm. Keeping that in mind, I appreciate any feedback on the code.

Real-life Python

2008-01-21T01:15:00.000-08:00

I just came home all tired and the minute I saw the place it hit me---I didn't
tidy up in ages. There were books everywhere, clothes all over the place, dirty
dishes... As I'm always tired when I come home from work, I never feel like cleaning
the place up. Somehow Python came up in my head as the solution to my problem (well,
not really a solution:) ). What if we could execute Python code in real life. I
mean, wouldn't it be great if we could do something like:


apartment.sort()
dishes.wash()

As I of course didn't feel like cleaning, I decided I'd better spend my time writing
about how I would use Python in real life. Beside cleaning the house, I think Python
could be of great help in the following problems as well.

Understanding women
Did you ever have a fight with your girlfirend/wife (it's a rhetorical question,
of course you did)? Did you ever wonder "What is going on in that mind of hers"?
Well, think about this one:


print girlfriend.__doc__

And if that didn't help, wouldn't it be great to be able to do something like


import dis
print dis.dis(girlfriend.think)

This would not only help you understand her, but also enable you to predict how
will she react. That's right, no more "What did I do wrong"---just look at the
code and you will know.

The answer to life
For all those in doubt as to whether they are wasting their life going to church
every Sunday, here's an ellegant Pythonic solution:


import time
if life.endswith(death):
    while alive:
        party()
        time.sleep(12*60*60)
    die()
else:
    import random
    religions = [christianity, islam, buddhism, hinduism]
    myReligion = random.choice(religions)
    for day in allDays:
        if day != sunday:
            be_humble()
        else:
            repent_sins(myReligion)
    #no more days, end of time is here
    go_to_heaven(myReligion)

What would you use Python for in real life?

How to insert watermark in LaTeX

2008-01-09T04:57:00.000-08:00

For my first post this year, I'll start off with something a bit different. This time we'll take a look at how to insert a watermark in a document using LaTeX. Note here that by watermark I mean some text that will appear on the background of each page of the document.
As with everything else in the wonderful world of LaTeX, there are more ways to do this. I know of two, which I will write about here. The first one is fairly simple, just include the package draftcopy and voila. Here's a minimal example of how it works:


\documentclass{article}
\usepackage{draftcopy}

\title{Lorem ipsum}

\begin{document}
\maketitle

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

\end{document}

Running the previous example through LaTeX gives the following output:

Note that running pdflatex on this example will NOT insert the watermark. I use LaTeX->dvi->ps->pdf route to obtain the pdf.
Of course the draftcopy package gives a lot of options to customize the watermark, of which I believe \draftcopyName is the most useful. If you want the word ENTWURF to be printed instead of DRAFT, which is used by default, you would add the line "\draftcopyName{ENTWURF}{155}" in the preamble. The number 155 is the scaling factor for the font---play around with it to get the scaling you want.
For those that like to do it the hard way (like me:) ), here is a piece of code that does pretty much the same thing:


          \AddToShipoutPicture{%
            \setlength{\@tempdimb}{.5\paperwidth}%
            \setlength{\@tempdimc}{.5\paperheight}%
            \setlength{\unitlength}{1pt}%
            \put(\strip@pt\@tempdimb,\strip@pt\@tempdimc){%
        \makebox(600,-700){\rotatebox{45}{\textcolor[gray]{0.75}%
        {\fontsize{6cm}{6cm}\selectfont{DRAFT}}}}%
            }%
        }

To use this code, you have to include the following packages: graphicx, eso-pic, and type1cm. Here is a minimal example:


\documentclass{article}

\usepackage{graphicx}
\usepackage{type1cm}
\usepackage{eso-pic}
\usepackage{color}

\makeatletter
\AddToShipoutPicture{%
            \setlength{\@tempdimb}{.5\paperwidth}%
            \setlength{\@tempdimc}{.5\paperheight}%
            \setlength{\unitlength}{1pt}%
            \put(\strip@pt\@tempdimb,\strip@pt\@tempdimc){%
        \makebox(0,0){\rotatebox{45}{\textcolor[gray]{0.75}%
        {\fontsize{6cm}{6cm}\selectfont{DRAFT}}}}%
            }%
}
\makeatother

\title{Lorem ipsum}

\begin{document}
\maketitle

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

\end{document}

Be sure not to forget the \makeatletter and \makeatother commands. Unlike with draftcopy, running pdflatex on this example WILL insert the watermark in the pdf. I find this code nice because I can directly tinker with the various options. For example, if I want to change the angle of the text, I'll change the number inside the \rotatebox, if I want to move the text around the page, I'll change the two numbers inside \makebox, if I want to change the grayscale, I'll change the number inside \textcolor, etc.
UPDATE: The above code, as Ivijay pointed out, puts the word DRAFT on every page of the document. If you want it to appear on only one, you should use the

AddToShipoutPicture*

instead of

AddToShipoutPicture

. This command only adds the watermark to one page (first one if you leave the watermark code in your preamble).

UPDATE: Neno wants to know how to use a picture instead of text. Well, it's just a matter of replacing the lines


\makebox(0,0){\rotatebox{45}{\textcolor[gray]{0.75}%
{\fontsize{6cm}{6cm}\selectfont{DRAFT}}}}%

with something like


\includegraphics{image.eps}

Of course, you might have to play around with adjusting the width and height of the picture, depending on its size. Also, you might want to change tempdimb and tempdimc to properly position the image.

UPDATE: hcr suggests using draftwatermark package to do this sort of thing. I've just had a look at the package, and using it really seems simple. For most of the things I did here, there is a nice command that does that. However, some things are missing from the package. First, the color of the text is always grey (you can choose the grey scale, but can't have the watermark text in, say, red. There does not seem to be a way to change the typeface of the font for the watermark text, and the text is always centered, you can't move it around. And finally, there is no way you could insert a picture as a watermark (which is what Neno wanted to do), and IMHO this could be a big issue for many people who may want their company's logo as a watermark. So anyway, draftwatermark is a great choice if you just want to insert a word (centered) in the background of each page. If you need anything more sophisticated than that, you will have to do something in the line of what I described here.

UPDATE: Amy wants to know is there a way to put the watermark only on some pages, preferably using something like \begin{watermark} and \end{watermark}. Well, Amy, there is no easy solution to your problem. One thing you could do is use \AddToShipoutPicture* that places the watermark only on one page, and then paste the code wherever you need it. Of course, this is ugly. I have managed to hack together some solution that seems to work, but the code is really ugly and I don't advise anyone to use it unless you really need to :). So, here goes:


\documentclass{article}

\usepackage{graphicx}
\usepackage{type1cm}
\usepackage{eso-pic}
\usepackage{color}
\usepackage{everypage}

\newenvironment{water}{\AddEverypageHook{\waterb}}{\AddThispageHook{\waterb}\AddEverypageHook{\watere}}

\makeatletter
\newcommand{\waterb}{
\AddToShipoutPicture*{%
            \setlength{\@tempdimb}{.5\paperwidth}%
            \setlength{\@tempdimc}{.5\paperheight}%
            \setlength{\unitlength}{1pt}%
            \put(\strip@pt\@tempdimb,\strip@pt\@tempdimc){%
        \makebox(0,0){\rotatebox{45}{\textcolor[gray]{0.75}%
        {\fontsize{6cm}{6cm}\selectfont{DRAFT}}}}%
            }%
}}
\makeatother

\makeatletter
\newcommand{\watere}{
\AddToShipoutPicture*{%
            \setlength{\@tempdimb}{.5\paperwidth}%
            \setlength{\@tempdimc}{.5\paperheight}%
            \setlength{\unitlength}{1pt}%
            \put(\strip@pt\@tempdimb,\strip@pt\@tempdimc){%
        \makebox(0,0){\rotatebox{45}{\textcolor[gray]{1}%
        {\fontsize{6cm}{6cm}\selectfont{DRAFT}}}}%
            }%
}}
\makeatother

\title{Lorem ipsum}

\begin{document}

page one

\newpage

\begin{water}
 First watermarked page
 \newpage
 Second watermarked page
 \newpage
 Third watermarked page
\end{water}

\newpage

This page should not be watermarked
\end{document}

Well, Amy, let me know if this works for you...

Naive fibonacci in pypy

2007-12-31T05:55:00.000-08:00

As this is my last post this year, I'll just keep it short. I've been busy doing all sorts of things, so here's just a fast revisit to our old friend, the naive fibonacci algorithm, this time featuring Pypy. I downloaded the 1.0.0 version and ran the by now famous naive fibonacci. To get to the point, it took 184 seconds. Yep, you read it right---3 minutes. The same thing that took 27 seconds in CPython 2.5, or around 3 seconds using Psyco. Now, I didn't read much about pypy, but as I understand it, it should be faster than the CPython implementation... And AFAIK, the guys that made psyco abandoned that project to go work on bringing similar functionality to pypy. Now, I hate to bitch (no, no I don't:) ), but this really seems kinda slow. What's going on here? It's probably the function call overhead that slows it down. One should look at the code of pypy and see what exactly they put on the stack each time... The point being, I'll stick to CPython for a while more, at least until the naive fib runs under 20 seconds :) ('Cos what's more important in life than running a fast fibonacci algorithm?)
Happy holidays everyone

Python 3000 - how mutable is immutable?

2007-12-22T11:26:00.000-08:00

I just downloaded Py3k alpha 2 and decided to play with it a bit. Well, beside the print becoming a function, things seemed to be more or less the same as in 2.5. Then I decided to play with the more esoteric thingies in Python.
It has been long known that Python immutables aren't really immutable. For example, if you enter the following code in 2.5:


a = ([1], 2)
a[0] += [2]

you will get an error. However, printing a reveals that it has indeed been modified (a is now ([1, 2], 2)). Ok, so what happened to this little bug/hack/feature/gotcha/whatever-you-want-to-call-it in Py3k? Well, entering the code above also raises the same error, and again it actually modifies a. Now, I don't know is this "feature" is documented anywhere, but I really don't see any point in it being a part of the language (if someone can give me an example where it would be useful to have this kind of behavior, let me know). So, why is it still present in Py3k, when it's been known for a while now? I guess someone either thinks this is a good thing or they just forgot about it, either way---guys please change this, it's annoying me :) (and what better motivation to change the language than a rant by some guy on the Internet that you never met or heard of?)
But, the fun doesn't stop here kids! Try the following:


a = ({1:2}, 2)
a[0][2] = 3

This little piece of code gets executed without any problems, both in 2.5 and in Py3k. I wonder what is the rationale behind the fact that if we try to change a list (in place) that is part of a tuple, we get an error, while changing a dict happens silently. Ok, I do admit that doing


a = ([1], 2)
a[0].append(2)

also doesn't throw an error, but this only confuses me more. So, I'd like to know---are tuples mutable or immutable? Or better yet, how mutable are immutables?
As a final thought, consider the following code:


a = ({1:2}, 2)
a[0][2] = a
print a

In 2.5, this code will execute with no problems, and printing a will yield in
({1: 2, 2: ({...}, 2)}, 2)
thereby showing us that we have an infinite dictionary on our hands (or that's what I think these three dots stand for). However, doing the same thing in Py3k (i.e., running the above code and trying to print a---remember that in Py3k print is a function), we get


Traceback (most recent call last):
  File "", line 1, in 
TypeError: __repr__ returned non-string (type bytes)

Whoa! WTF?? What does this mean? What ever it means it tells me nothing about what is actually happening here. As far as I can see, there is something wrong with the __repr__ function---it isn't cut out to handle this kind of objects. So, the first thing that came to my mind was "Hey, cool, they won't print infinite dictionaries any more. They'll allow their construction, but when I try to print them, I'll get an error (a weird one, to say the least, but an error still). This kinda suxx, but ok, I guess I'll have to live with it." No. I was wrong. Try


print(a[0])

The infinite dict in the first position of the tuple prints with no problems. So, why is it a problem to print the tuple (which basically has one number, two parenthesis, and one comma extra)? No idea. Of course, trying to print a[0][3] yields an error, print a[0][3][0] goes without a problem, and so on. I guess that's why they call it alpha version :) I sincerely hope that this won't be a part of the language when a stable release of Py3k is out.

OMG my iPod likes Barbara Streisand

2007-12-15T10:11:00.000-08:00

A few days ago, I updated my iPod with some new songs (I was getting kinda bored listening the same ones for about two years now). I put most of these new songs in my 80's playlist (about a dozen new songs altogether). So, the last few days I had my iPod on every time I went home from work and noticed one strange thing - every time I listened to it, one of the songs was "The way we were" by Barbara Streisand (yes, I occasionly listen to Barbara, so what of it?). What was strange about it is that the shuffle option was always on and that there were 95 other songs in the playlist. To give the exact numbers, I heard Barbara four out of four days in a row, where I listened about 15 songs each day (the commute home takes about 45 minutes - that's 3 minutes per song), and, as I said, there were 96 songs in total to choose from.
This somehow seemed odd to me, not to say wierd. That's why I decided to see just what is the probability to observe this strange behaviour. So, back to basic statistics. The probability that any song will be selected from a playlist of 96 songs is 1/96. However, as one song finishes, the next one is selected from the remaining 95 songs (I assume that this is how the iPod's shuffle works - it shouldn't select a song that was already played until all others are played). So the probability that any song will be selected if two songs are played is 1/96 + 1/95. You get the picture how the story goes from here... So, the probability that you heard a song after playing 15 of them (in a playlist of 96 songs) equals to 1/96 + ... + 1/82 = 0.1689, that is, around 17%. This means that I have a 17% chance to hear "The way we were" on one of my trips home. However, I heard it four days in a row. Since hearing a particular song on one of my trips can be regarded as an experiment with two outcomes (I either heard it or not), and since I start the new shuffle every day (meaning that each experiment is independent of the previous one), we can say that we are talking about a binomial distribution. In such a setting, we can easily compute the probability that I hear Barbara four days in a row - it's simply 0.1689^4 = 8.145 * 10^-4. That is, there is an 8 in 10000 chance that I will hear Barbara four days in a row if I put my iPod on shuffle. Yet, there it was, I heard it. Four days in a row. As my knowledge of statistics ends here, I will not try to do a hypotesis test or anything like that. If there is someone out there that is willing and capable to do this, I would love to hear from him/her.
All this lead me to do some research online. Well, it turns out I'm not the only one having this shuffle issue. Many users have already complained how the "random" playing isn't random at all and how their iPods have souls. Here's just two references - http://forums.ilounge.com/archive/index.php/t-4575.html http://www.blackwell-synergy.com/doi/pdf/10.1111/j.1540-4609.2007.00132.x?cookieSet=1. The second one is a really interesting paper with some more references to people with the same problem. So, it turns out that the machines have already started developing their own "mind". Yes, our machine overlords are taking over the world, and they have started with picking their favourite music. I'm just glad to know my iPod loves Barbara Streisand - I think it'll make an ok master.

Running fast code in Python

2007-12-08T08:55:00.000-08:00

As I have suspected, the numbers I got for execution time of the Haskell fibonacci program (look at my previous posts) are wrong, in the sense that I compiled the program without optimisation, as most normal users will do in real applications. That is why I feel that I should redeem myself and give the numbers for the program with optimisation turned on. So, when compiling the program using

ghc -O2 fib.hs -o fib -no-recomp (thanks to Don Stewart)

execution time is 0.64s (mean time of five runs). This time is much closer to the one reported on the Haskell blog (0.48s). Ok, so my hat goes off to Haskell---it really is faster than Python some 50 times.
But, being a Python fan, I couldn't help but feel that something is wrong here. Is Haskell really that good? It's fast, easy to parallelize, pure functional and it probably cooks for you after a hard day at work. So, why should I use Python?
Well, the first thing that comes to my mind is the fact that ghc generated a 925kB exe file from a 1kB file that computes fibonacci numbers. In comparison, gcc (3.4.2) generated 16 kB for a fibonacci program that ran 0.75s (compiled using -O2). While it is still impresive that Haskell was faster than C, I cannot help but think that it is just too much to have an exe file almost 1000 times larger than the source. While this is certainly a drawback of Haskell, it still wasn't enough to return my faith in Python. Then, another thing came to my mind---if Haskell can use optimisations, why can't Python? I mean, if I used -O2 in ghc, why can't I do somehting similar in Python?
Of course, running Python with -OO gave no real speedup---it shook off about one second in execution time. Instead of 26s, I got 25s for the non-parallel case, and instead of 18s, I got 17s for the non-parallel case. But, the great thing about Python are its packages. I remember a friend of mine telling me about a thing called Psyco. It was some kind of a package that optimised execution of Python code, so I decided to give it a shot. After downloading and installing Psyco (http://psyco.sourceforge.net/), I just added two lines to my source (I give the full source here):


from time import time
import psyco
psyco.full()

def fib(n):
   if n == 0 or n == 1:
      return n
   else:
      return fib(n-1) + fib(n-2)

if __name__ == "__main__":
    start_time = time()
    for i in range(36):
        print "n=%d => %d" % (i, fib(i))
        
    print "Time elapsed: %f" % (time() - start_time)

So, the lines I added were import psyco, and psyco.full(). I ran this thing and voila---the time of execution was 1.93s (mean time of five runs). Wow! That was all I could think. Remember, this was the non-parallel case, so it went from 25 seconds to 1.9s. Or, in terms of Haskell speed, Haskell now beat Python by on 3x. And with no extra effort (ok, adding two lines to the source) and no parallelization.
Unfortunately, psyco didn't help the parallel version---it still took 18s. But still, I think this is pretty amazing and it definitely brought back faith in Python. So, it's three times slower, but the synatx is much cleaner and it doesn't generate a thousand times bigger executable. I'm sticking to Python :)
(Note: some parts of this post are biased and down right offensive to Haskell fans (just kidding about the last one:) ). My intention is not to start a flame war or anything like that, I just felt that someone should stand up and take Python's side:). Seriously though, if you have any objections on my methodology, I would like to hear from you.)

Even xkcd is leaving Perl :)

2007-12-05T06:04:00.000-08:00

For those of you out there that don't read xkcd:

Holy shmoly Haskell doesn't smoke Python away (that much)

2007-12-03T06:13:00.000-08:00

Recently, there has been some talk about how Haskell smokes Python and Ruby away in the naive fibonacci algorithm. While it is obvious that Haskell will beat both due to the code being compiled and not interpreted, I decided to see if I got the same results as reported on the Haskell hacking blog, and also see how Python handles parallel processing. Note that I will refer to the Haskell hacking blog simply as Haskell blog from now on.

Since I have no experience in Ruby, I will only compare Python and Haskell. So, here is the code I used to test the two.
Python:


from time import time

def fib(n):
   if n == 0 or n == 1:
      return n
   else:
      return fib(n-1) + fib(n-2)
      
if __name__ == "__main__":
    
    start_time = time()
    for i in range(36):
        print "n=%d => %d" % (i, fib(i))
        
    print "Time elapsed: %f" % (time() - start_time)

So, as we can see, this code is pretty much the same as that reported in the blog about Haskell.


Haskell:

import Control.Monad
import Text.Printf

fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
main = forM_ [0..35] $ \i ->
    printf "n=%d => %d\n" i (fib i)

This is also the same as in the blog, the only difference being that I import what is needed to make it run (for some reason, the import statements aren't included in the Haskell blog). Since I have no idea how to measure time in Haskell (any hint on how to do that is appreciated), I used the ptime command for Windows (available at http://www.pc-tools.net/win32/ptime/).

So, here are the results of the test:

Python 2.5: 26.07 s
Haskell (GHC 6.8.1): 3.873 s (mean of 5 runs)

So, the results for Python are pretty much the same as reported in the Haskell blog (25.16 s is reported there), but the ones for Haskell are not, in fact the Haskell blog reports 0.48 s, which is some four times faster than I got on my computer. What is the reason for this? I have no idea. I compiled the program using "ghc fib.hs", maybe they used some code optimizations? Anyway, the Haskell advantage seems to be melting, although Haskell is still some 6-7 times faster. However, the Haskell blog proceeds to use parallel processing in Haskell, but uses no such thing in Python. So, I decided to see if Python can take advantage of my two cores to shake some time off this 26 seconds. I used the Parallel Python package (available at http://www.parallelpython.com/) to make use of my two cores. Here is the modified Python program:


from time import time
import pp

def fib(n):
   if n == 0 or n == 1:
      return n
   else:
      return fib(n-1) + fib(n-2)


if __name__ == "__main__":
    start_time = time()
    ppservers = ()
    job_server = pp.Server(ppservers=ppservers)
    print "Starting pp with", job_server.get_ncpus(), "workers"
    jobs = [(input, job_server.submit(fib, (input,), (), ())) for input in range(36)]
    for input, job in jobs:
        print "n=%d => %d" % (input, job())
        
    print "Time elapsed: %f" % (time() - start_time)

The result: 17.86 s
So, Python went from 26.07s to 17.86s, which means it ran 31.5% faster than on only one core. Unfortunately, I had no success in running the parallel Haskell code reported on the Haskell blog (for some reason, I always got "`pseq` not in scope" error while compiling). However, I can use the numbers given on the Haskell blog, where Haskell went from 0.48s to 0.42s on two cores. This means a 12.5% reduction in execution time. Although I admit that is takes slightly more intervention from the programmer to make the parallelism in Python work, the gain in speed is also more significant. As a final note, let us compare 3.873s (Haskell) and 17.86s (parallel Python) - the ratio is around 4.6, let's say 5. So, Haskell doesn't smoke Python away THAT much.

Trie in Python

2007-11-20T03:15:00.000-08:00

Here is my humble implementation of a trie in Python, just something to get me started.
Any comments on improving the code are welcome


import re, types
from collections import defaultdict

def tokenize(s, removePunctuation = True):
    if removePunctuation:
        p = re.compile('[\.,!?()\[\]{}:;"\'<>/ \n\r\t]')
        return [el for el in p.split(s) if el]
    else:
        t = []
        s = re.split('\s', s)
        p = re.compile('(\W)')
        for phrase in s:
            words = p.split(phrase)
            for word in words:
                t.append(word)
        return [el for el in t if el]

class Node(object):
   
    def __init__(self):
        self.freq = 0
        self.next = defaultdict(int)

class Trie(object):
   
    def __init__(self, inFile):

        self.nodes = []
        self.nodes.append(Node())
        self.n = 1
        self.numberOfWords = 0

        for line in file(inFile):
            words = tokenize(line)
            for w in words:
                currNode = 0
                for char in w:
                    if self.nodes[currNode].next[char] == 0:
                        self.nodes.append(Node())
                        self.nodes[currNode].next[char] = self.n
                        currNode = self.n
                        self.n += 1
                    else:
                        currNode = self.nodes[currNode].next[char]
                self.nodes[currNode].freq += 1
        self.numberOfWords = len([node for node in self.nodes if node.freq != 0])

    def __getitem__(self, word):
        """Return the frequency of the given word."""

        if isinstance(word, types.StringTypes):
            currNode = 0
            for char in word:
                if self.nodes[currNode].next[char] == 0:
                    raise AttributeError("No such word: %s" % word)
                else:
                    currNode = self.nodes[currNode].next[char]
            return self.nodes[currNode].freq
        else:
            raise TypeError()

    def __len__(self):
        """Return the number of nodes in the trie."""

        return self.n