Dovecot Mail Storage Timo Sirainen Me Timo Sirainen

  • Slides: 27
Download presentation
Dovecot Mail Storage Timo Sirainen

Dovecot Mail Storage Timo Sirainen

Me: Timo Sirainen • Born 1979 in Finland • First C 64 BASIC programs

Me: Timo Sirainen • Born 1979 in Finland • First C 64 BASIC programs around 1988 • Open source coding since about 1998 – Irssi IRC client 1999 -2004, still widely used • Worked as programmer since 1999 • Went to university in 2006 • Dovecot project started in 2002 – Working full time on it since about 2007 – 2009: Rackspace, USA – 2010: SAPO, Portugal

Dovecot • Open source IMAP/POP 3 server – Only mail retrieval to clients, no

Dovecot • Open source IMAP/POP 3 server – Only mail retrieval to clients, no mail sending • First version released in 2002 • Mostly written by me – Except Sieve by Stephan Bosch • High performance is an important goal – Disk I/O is typical bottleneck -> everything optimized to reduce it

Talk Overview • • • Traditional mailbox formats Dovecot indexes Dovecot mailbox formats Full

Talk Overview • • • Traditional mailbox formats Dovecot indexes Dovecot mailbox formats Full text search indexes Future ideas

mbox • One file per mailbox • Metadata in headers that are filtered out

mbox • One file per mailbox • Metadata in headers that are filtered out – X-UID, Status, X-Keywords, etc. • Deleting requires moving data around – Fragile: corruption if crashes in the middle – Slow when deleting old messages • May become fragmented with constant appends • But non-fragmented file is fast to read

Maildir • One file per message – Reading through all files can be slow

Maildir • One file per message – Reading through all files can be slow • Message flags in filename (name: 2, <flags>) – Lots of renaming – Finding the current filename can be difficult • Maildir is lockless? Not so much, Dovecot uses write/sync lock – Otherwise files can temporarily be lost during renames • Was the file really deleted or just renamed?

Dovecot Index Files • Main index – List of messages – Message flags –

Dovecot Index Files • Main index – List of messages – Message flags – Offsets to cache records • Cache file – Message size, some headers, etc. – Keep only data that client actually uses • Different clients want different data for different amount of time

Dovecot Main Index • In two files: – dovecot. index: Somewhat recent snapshot –

Dovecot Main Index • In two files: – dovecot. index: Somewhat recent snapshot – dovecot. index. log: Recent changes • All changes go through the log • Readers read snapshot to memory and apply latest changes from log – Once opened, only need to read log updates • Very efficient with remote filesystems (NFS, cluster FSes)! • Snapshot is updated “once in a while” – Tries to minimize disk I/O – Writes are usually more expensive than reads • Log also useful for finding “what changed” events for IMAP clients

Dovecot Cache • The main reason for Dovecot’s good performance • Different IMAP clients

Dovecot Cache • The main reason for Dovecot’s good performance • Different IMAP clients want different data – Caching data that client doesn’t use wastes disk space and disk I/O • Flexible format, allows adding any number of fields – Per-field caching decisions: “no”, “temporary”, “permanent” • Cached fields never change (IMAP guarantees) – Data is added without locking -> duplicate data is possible • Once in a while the file is recreated -> deleted and unwanted records are dropped

Locking • Lock waits are bad – Higher user visible latency – Timeout failures

Locking • Lock waits are bad – Higher user visible latency – Timeout failures during high load • Dovecot v 0. 99 used traditional read/write index locks – Locking timeout problems – Redesigned v 1. 0 to do lockless reads

Lockless reads: rename() • For: – Small files – Rarely changing files – If

Lockless reads: rename() • For: – Small files – Rarely changing files – If a large part of the file changes • Writer – – – #1 Temp file Lock If file has changed, read+update internal state Write the updated data to temp file rename() over the original file Unlock • Reader – Just read the file. rename() #2

Lockless reads: Appends • For append-only files with “size” header in each written record

Lockless reads: Appends • For append-only files with “size” header in each written record Size Data • Writer – Lock – Write data with size=0 – Write size with each byte’s highest bit set to 1 – Unlock • Reader Bits Content 0 -6 Bits 0 -6 of size 7 Always 1 8 -14 Bits 7 -13 of size 15 Always 1 etc. – Read one record at a time – Stop when seeing a size that isn’t fully written

Lockless writes in future? • open(path, O_APPEND) usually provides atomic writes – Except with

Lockless writes in future? • open(path, O_APPEND) usually provides atomic writes – Except with NFS – write() may also return less bytes than intended? (signal, out of space) – read() during a write may see incomplete data?

Single-dbox • One file per message (u. <IMAP UID>) • Files have immutable metadata

Single-dbox • One file per message (u. <IMAP UID>) • Files have immutable metadata section – GUID, POP 3 UIDL, received date, etc. • Advantages over Maildir: – Filenames don’t change – No IMAP UID <-> filename mapping required • Flags stored only in Dovecot index files – Automatically creates dovecot. index. backup once in a while – When fixing corruption, tries very hard to preserve flags based on (corrupted) index and backup files

Multi-dbox • Multiple messages in a single file (m. <id>) – File format same

Multi-dbox • Multiple messages in a single file (m. <id>) – File format same as with single-dbox • Multiple files in a single mailbox – Files are about 2 MB (configurable) • Larger files -> less fragmentation, but deletion slower • Preallocation – Can be rotated every n days (for incremental backups) – Delayed (ioniced) nightly deletions (“doveadm purge”) • Crash or power loss can’t corrupt or lose data • Tries very hard to preserve as much data as possible in case of (filesystem) corruption. – Saves a backup of the original broken file

Benchmarks • Realistic IMAP benchmarks are difficult to do • Depends on clients and

Benchmarks • Realistic IMAP benchmarks are difficult to do • Depends on clients and user behavior

Benchmarks • Reading 10 k messages via IMAP SSD, OSX, HFS+ Uncached Cached mbox

Benchmarks • Reading 10 k messages via IMAP SSD, OSX, HFS+ Uncached Cached mbox 2. 9 s 1. 6 s Maildir 3. 9 s 0. 6 s Single-dbox 3. 9 s 0. 6 s Multi-dbox 1. 5 s 0. 4 s HDD, Linux, ext 4 Uncached Cached mbox 2. 8 s 2. 3 s Maildir 8. 0 s 0. 9 s Single-dbox 6. 8 s 0. 9 s Multi-dbox 1. 6 s 0. 7 s

Benchmarks: # NFS ops mbox Maildir Reads Lookup sdbox Access Getattr mdbox 0 5000

Benchmarks: # NFS ops mbox Maildir Reads Lookup sdbox Access Getattr mdbox 0 5000 10000 15000 20000 • Reading 10 k messages via IMAP • Above: uncached, below: cached 25000 30000 35000

Benchmarks: # NFS ops Random IMAP commands sent with: imaptest logout=5 msgs=1000 delete=10 expunge=10

Benchmarks: # NFS ops Random IMAP commands sent with: imaptest logout=5 msgs=1000 delete=10 expunge=10 secs=60 seed=1 mbox Read Maildir Write Readdir sdbox L+A+G Other mdbox 0 50000 L+A+G = lookup + access + getattr 100000 150000 200000

New dbox-only Features

New dbox-only Features

Alternative Mail Storage • Users rarely access their old mails • Lower performance storage

Alternative Mail Storage • Users rarely access their old mails • Lower performance storage is cheaper -> Move old mails there • dbox supports “alternative path” setting: If u. * or m. * file isn’t found from primary path, it’s looked up from alternative path – Files could even be moved with /bin/mv • But easier/safer with “doveadm altmove” – This would be difficult with Maildir because its filenames change

Detached Mail Attachments • MIME parts can be saved to external files – Only

Detached Mail Attachments • MIME parts can be saved to external files – Only if they’re large enough (default: 128 k. B) – Also can be filtered based on Content-Type, etc. headers • Avoid extra disk seek for downloading attachments that clients automatically display inline • Supports saving base 64 encoded MIME parts decoded (25% less disk space) – Only if re-encoding can be done to 100% original • dbox-only – Metadata contains pointers to external parts • Saving is done via simplified “filesystem API”

Single Instance Storage • Storage’s internal deduplication – Could be enabled only for attachment

Single Instance Storage • Storage’s internal deduplication – Could be enabled only for attachment storage • Dovecot’s SIS – FS API backend – Based on file hashes and hard links • Hash is configurable (e. g. SHA 256 + size) – Byte-by-byte verification after hash found a) Never, trust hash uniqueness (not implemented) b) Immediate comparison during saving c) Delayed (nightly) comparison and deduplication

Dovecot SIS • Attachments saved to “HA/SH/HASH-GUID” under global attachment dir (e. g. /var/attachments/)

Dovecot SIS • Attachments saved to “HA/SH/HASH-GUID” under global attachment dir (e. g. /var/attachments/) – GUID guarantees filename uniqueness – e. g. file with hash “ 123456” is saved to 12/34/123456 -GUID – “HA” and “SH” may be symlinks to other mounts • SIS is done by hard linking HA/SH/hashes/HASH to HA/SH/HASH-GUID if it exists. – Basically: “ln hashes/123456 -guid” – No attempts to create cross-mount hard links • Safe to move/backup/restore attachment files – But hashes/HASH is auto-deleted only when its link count drops from 2 to 1. External changes may leak it.

Full Text Search Indexes • Dovecot has abstract FTS API • IMAP protocol says

Full Text Search Indexes • Dovecot has abstract FTS API • IMAP protocol says search is about “substring matching” (e. g. “ello” matches “hello”) – Almost no FTS engines support this – Few people seem to care about this anymore • Currently supported FTS backends: – Squat: Dovecot’s own indexer, supports substring matching. • Currently index updating is too inefficient – Apache Solr

FTS: Solr • Solr is a search engine server using Lucene • Dovecot talks

FTS: Solr • Solr is a search engine server using Lucene • Dovecot talks to Solr via HTTP • Sharding via per-user fts_solr setting

Future • FS API used for indexes and dbox – Support for key-value databases

Future • FS API used for indexes and dbox – Support for key-value databases – Asynchronous disk I/O