Monday, April 7, 2008

Saving e-mails to disk (from a gmail account)

As I was trying to download all my spam messages to disk (I'm building a database of spam messages), I realised that this is no simple task. Well, one thing I could do is save the messages using Thunderbird or Outlook, but since I don't use either (I consider the Gmail web interface very nice and user-friendly) this one is out of the question. However, after a little browsing I discovered a wonderful Python package called libgmail. Long story short, here is the script I used to download all the messages from the spam folder:


#!/usr/bin/env python
'''
savemsg.py -- Download all messages from a specified folder
License: GPL 2.0
'''

import sys
from getpass import getpass
import libgmail

if __name__ == "__main__":
try:
name = sys.argv[1]
except IndexError:
name = raw_input("Gmail account name: ")

pw = getpass("Password: ")

ga = libgmail.GmailAccount(name, pw)

print "\nPlease wait, logging in..."

try:
ga.login()
except libgmail.GmailLoginFailure,e:
print "\nLogin failed. (%s)" % e.message
sys.exit(1)
else:
print "Login successful.\n"

FOLDER_list = {'U_INBOX_SEARCH' : 'inbox',
'U_STARRED_SEARCH' : 'starred',
'U_ALL_SEARCH' : 'all',
'U_DRAFTS_SEARCH' : 'drafts' ,
'U_SENT_SEARCH' : 'sent',
'U_SPAM_SEARCH' : 'spam',
}

FOLDER_list = raw_input('Choose a folder (inbox, starred, all, drafts, sent, spam): ')
folder = ga.getMessagesByFolder(FOLDER_list)

for thread in folder:
for msg in thread:
print "Downloading message %s " % msg.id
encIndexStart = msg.source.find('charset=')
if encIndexStart != -1:
encIndexEnd = (msg.source.find(' ', encIndexStart),\
msg.source.find('\n', encIndexStart),\
msg.source.find(';', encIndexStart),\
msg.source.find('"', encIndexStart+10))
encIndexEnd = [ind for ind in encIndexEnd if ind != -1]
encIndexEnd = min(encIndexEnd)
enc = msg.source[encIndexStart + 8:encIndexEnd]
enc = enc.replace('"', '').replace(';', '')
else:
enc = 'ascii'
print "Detected encoding %s\n" % enc
try:
f = open(msg.id + " " + msg.subject + ".txt", 'w')
except:
# message subject contains characters forbidden by the os in the
# file name, use just message id
f = open(msg.id + ".txt", 'w')
try:
f.write(msg.source.decode(enc).encode('utf-8'))
except:
f.write(msg.source)
f.close()
print "\n\nDone."


One could use the script to download messages from any gmail folder. The encoding of the message is automatically recognized and the message is saved in UTF-8 to facilitate later processing. Of course, you have to have libgmail installed to run the script. It is also very easy to adapt the script to use it for any other purpose (I actually wrote this script by changing one of the demo scripts that come with libgmail).