Monday, April 7, 2008

Saving e-mails to disk (from a gmail account)

As I was trying to download all my spam messages to disk (I'm building a database of spam messages), I realised that this is no simple task. Well, one thing I could do is save the messages using Thunderbird or Outlook, but since I don't use either (I consider the Gmail web interface very nice and user-friendly) this one is out of the question. However, after a little browsing I discovered a wonderful Python package called libgmail. Long story short, here is the script I used to download all the messages from the spam folder:


#!/usr/bin/env python
'''
savemsg.py -- Download all messages from a specified folder
License: GPL 2.0
'''

import sys
from getpass import getpass
import libgmail

if __name__ == "__main__":
try:
name = sys.argv[1]
except IndexError:
name = raw_input("Gmail account name: ")

pw = getpass("Password: ")

ga = libgmail.GmailAccount(name, pw)

print "\nPlease wait, logging in..."

try:
ga.login()
except libgmail.GmailLoginFailure,e:
print "\nLogin failed. (%s)" % e.message
sys.exit(1)
else:
print "Login successful.\n"

FOLDER_list = {'U_INBOX_SEARCH' : 'inbox',
'U_STARRED_SEARCH' : 'starred',
'U_ALL_SEARCH' : 'all',
'U_DRAFTS_SEARCH' : 'drafts' ,
'U_SENT_SEARCH' : 'sent',
'U_SPAM_SEARCH' : 'spam',
}

FOLDER_list = raw_input('Choose a folder (inbox, starred, all, drafts, sent, spam): ')
folder = ga.getMessagesByFolder(FOLDER_list)

for thread in folder:
for msg in thread:
print "Downloading message %s " % msg.id
encIndexStart = msg.source.find('charset=')
if encIndexStart != -1:
encIndexEnd = (msg.source.find(' ', encIndexStart),\
msg.source.find('\n', encIndexStart),\
msg.source.find(';', encIndexStart),\
msg.source.find('"', encIndexStart+10))
encIndexEnd = [ind for ind in encIndexEnd if ind != -1]
encIndexEnd = min(encIndexEnd)
enc = msg.source[encIndexStart + 8:encIndexEnd]
enc = enc.replace('"', '').replace(';', '')
else:
enc = 'ascii'
print "Detected encoding %s\n" % enc
try:
f = open(msg.id + " " + msg.subject + ".txt", 'w')
except:
# message subject contains characters forbidden by the os in the
# file name, use just message id
f = open(msg.id + ".txt", 'w')
try:
f.write(msg.source.decode(enc).encode('utf-8'))
except:
f.write(msg.source)
f.close()
print "\n\nDone."


One could use the script to download messages from any gmail folder. The encoding of the message is automatically recognized and the message is saved in UTF-8 to facilitate later processing. Of course, you have to have libgmail installed to run the script. It is also very easy to adapt the script to use it for any other purpose (I actually wrote this script by changing one of the demo scripts that come with libgmail).








3 comments:

Christian Jauvin said...

Hi,

I once tried to do something similar (with libgmail as well), and was quickly stopped by a "detected account anormal usage" message that freezed my GMail account (and my enthusiasm) for a couple of hours.

Filox said...

I've heard that people had troubles with that. However, I must say that I've been using this script for a few days now (every day) and I still haven't had any problems with it. I'll write if something changes...

Christian Jauvin said...

In fact I must add that I could run the script for a couple hundreds of message threads (600 or 700 if I remember correctly) before being stopped. Obviously Gmail is not meant to be hammered by a running script like that, as it is not a real Web service (libgmail is not supported officially).