The Setup
On 9 July 2013, I became aware of a problem with my email. I use OS X on my primary machine, which is usually running the latest beta of everything. I run beta software so that you don’t have to.
My primary mail client on OS X is Mail, the one that’s included “in the box” with OS X. I like it because it stores each email in its own file, with an emlx extension.[1] This will be important later.
The problem presented when I was in one of my Inbox directories. I have three email accounts in Mail, all backed by Google Apps (because problems can happen, and I can access email away from home, etc.).
I noticed that the message preview wasn’t showing in some messages. When I’d click on that email, the body would be blank, with “No Sender” and “No Subject” in the header. So I told Mail to rebuild the mailbox.
It was then that I realised that the 13 000 emails in that Inbox had dropped to around 5 000.
I went through every directory, rebuilding each one. At the end of the process, I’d lost roughly half of my estimated 90 000 emails.
But it’s a Mac, and I’ve got Time Machine! So I restored to my most recent backup, that being 1:00am on the same day. I could bear to lose half a day.
When the restore was done, the same problem persisted: half of my email was still gone. This could only be a result of some corruption.
Using rsync and some creative scripting, I managed to restore directory-level backups from Time Machine, reaching back to 9 January 2013. During this process I noticed how the GUID changed across backups. This gave me my first inkling into what went wrong. Each time Mail is updated by an OS X beta or point release, the entire thing is reindexed. You’ve seen this if you’ve ever updated the operating system.
Between 22 and 29 May (Time Machine goes to weekly backups for older content to save space), the GUID changed, but the number of files was significantly smaller.
Later I narrowed it down to something that happened on 26 May 2013 (from timestamps on those emlx files I mentioned earlier).
This is what I think happened:
- The 10.8.4 beta 12E55 was seeded on 24 May 2013. I was in Chicago on training until 25 May 2013, so I didn’t install it
- I recall also using a Google Apps plugin that migrated all of my Gmail content to a Google Apps account on 25 May
- Mail had been open all this time, so it had downloaded around 70 000 additional messages that I didn’t want
- I decided to delete the new messages, using a combination of Smart Folders, flags (also important), and a lack of patience
- After doing what I thought was a cleanup of those 70 000 new messages, I installed the new beta
- The beta rebuilt my mail directories with a new GUID, but *crashed* during this process.
From what I can make out, I lost my mail this way:
- Deleting 70 000 messages was tricky, and I didn’t give Mail enough time to perform the cleanup operation in the background, which caused orphaned files
- Upgrading to the new version caused a change to the GUID, meaning a new directory structure
- Only half of the messages were migrated to the new GUID before Mail crashed
- When it was opened again, the Index file (a SQLite database) was complete, but half of the files were garbage-collected because they were orphaned under an old GUID.
That it took me over a month to discover a problem speaks to the volume of email I keep. Yes, I’m a hoarder. No, I will never change.
Second Act
Ever since the near-loss in October of last year of every photograph I’d taken since moving to Canada (due to a combination of factors), I’ve used CrashPlan+, Time Machine, and Carbon Copy Cloner to keep copies of my data, including email. So as I noted previously, I had a directory-level copy of email going back to 9 January 2013.
There were just under half a million files in that directory. I don’t know if those of you reading this far have ever looked at a Mail store, but it’s not pretty.
Everything on my machine is stored in the default location: /Users/randolph/Library/Mail/V2
.
Under there is the MailData
directory, containing metadata about the mailboxes, accounts, settings and so on.
There is also a directory for Mailboxes
, which maps to the “On My Mac” directories you see in Mail.
Finally, each mail account has a directory, for POP3 and IMAP accounts.
Irrespective of the mail account or Mailbox directory, each contains an mbox
directory, which is treated in a special way by Mail, but is just another directory containing sub-directories and files.
For example, in my iCloud account’s inbox, I have this email file:
/Users/randolph/Library/Mail/V2/AosIMAP-[username]/INBOX.mbox/[GUID]/Data/0/1/2/Messages/210298.emlx
Attachments are saved in a similar fashion. Notice how the directory name matches the emlx filename:
/Users/randolph/Library/Mail/V2/AosIMAP-[username]/INBOX.mbox/[GUID]/Data/0/1/2/Attachments/1/210298/2/2.jpg
From what I can see, the directory numbering order seems to be sequential, as it adds new directories (regardless of the level).
The half-million restored files, from each respective Time Machine backup, looked something like this. There were at least four GUID directories in each mbox directory, and as you can imagine, hundreds of thousands of duplicate files (I was expecting under 100 000 unique emails, so I had over 400 000 to get rid of).
Deduplication
The first trick of deduplication was to remove identical files. Since I didn’t touch a large portion of emails since January, I expected a significant amount to be deleted. Over 50 000 files were identical to others, so those were summarily erased.
Now I stood with a problem: over 400 000 files with no easy way to deduplicate.
I did some reading on the emlx format. Since OS X 10.4, Apple has made the format proprietary, but still based on Maildir. At the top of each file is a file size in ASCII, followed by the message itself in Maildir format, and the attachment in MIME format, followed by some XML metadata with flag information, UIDs, etc.
I realised I could safely delete every single file that was not an emlx file from my backup, and just store each message in one directory to make it much easier to programmatically determine duplicates.
I wrote a C# application that extracted the Message-ID
field from the emlx file, and stored it in a class of type MailItem
(which I created to keep Message-ID
and FileInfo
). I then iterated through a unique list of Message-IDs, and for each one, deleted any duplicate files.
In retrospect, I should have used the UID
in the XML data at the end of the file, but I used Message-ID
because it was closer to the top of the email, and thus quicker to locate.
For some reason, a lot of messages did not have a Message-ID
, so these were not even considered for deletion.
In another interesting discovery, messages created by PHP scripts (specifically my ncane.com site) were generating the same Message-ID
. I evaluated these as non-essential and deleted them too.
After running the deduplication C# code, I was left with 124 000 files. This was still 30% more than I was expecting, but this was suitable enough to import back into Mail.
The Result
My first import attempt failed. I discovered that Mail doesn’t like more than 40 000 or so items in a mailbox. Bear in mind that the underlying directory structure creates multiple sub-directories using a sequence, so I’m not sure why this limitation exists. Either way, I had to manually create six new mbox directories in which to spread out the emails so that no directory would contain more than 10 000 emails.
During the import, Mail generated its directory structure as I described above, and extracted all of the attachments as expected. The whole process took about two hours to process 124 000 files.
Interestingly, and what I expected, some of the emails did not import. These would have been emails with duplicate UIDs. Mail is smart enough to ignore those. This is the value I should have used to deduplicate the files to begin with.
Lessons Learnt
Apple Mail is a great product, and is spectacular at managing large volumes of mail, because it stores each one in a self-contained emlx file. While this makes importing into other applications tricky, the Maildir format can be easily read with a text parser.
Apple Mail, however, is slower than I expected to move emails between different mailboxes. I believe that if I’d been more patient with cleaning up the additional 70 000 messages that came in as a result of merging two Google accounts, and then made a full backup at the time, the problem may not have been as damaging. It is this large number of orphaned files that I think caused the upgrade process to fail when I upgraded to build 12E55.
I now understand the format of emlx files better, and I am confident that Apple Mail is still the best tool going forward. I just need to be more patient moving things between mailboxes, and limiting the number of emails to 10 000 per mailbox.
1. I was burnt in 2006 by a tragic hard drive failure, and at the time used Outlook exclusively. PST files containing all email prior to 2006 was lost.