To solve this problem I decided to index the 'reduced' data with elasticsearch, that is a search engine based on lucene. As a result, I can make quick searches upon millions of unique e-mail transactions and get a result in a matter of seconds.
Structuring the e-mail transactions
First of all, I have to clarify that each e-mail transaction is a unique operation in a single e-mail server. Therefore, sending an e-mail may involve several e-mail transactions because it is traversing several e-mail servers throughout our infrastructure. At the end, each of these unique transactions are represented by the following data structure:
- from: A string representing the sender's e-mail address.
- to: Array of recipients. Each element in the array is a dictionary with the following fields:
- recipent: one of the recipient's e-mail address
- relay: the next hop in the delivery chain. It can be a server or local delivery.
- status: A string that informs about the result of the operation.
Since a particular e-mail can be stuck in the queue and re-delivered later, we may see the same recipient appearing several times, but with different status messages: ie. deferred and successfully sent.
- msgid: unique message identifier that represents a particular sent e-mail. Not to confuse with an e-mail transaction that is unique within an e-mail server.
- client: The previous hop in the e-mail chain.
- date: String representing the time in which the transaction took place.
- host: the server that is executing the e-mail transaction.
Data retention policy
I have configured the index to automatically drop entries older than 90 days. In my opinion, we do not need to keep this data for a longer period of time, because the troubleshooting tends to involve transactions that took place in the previous days. Of course, In case it is needed, we still keep the raw logs for a longer period and they can be manually parsed or, simply, re-indexed.
Reducing and indexing the raw logs
The script elasticsearch.py parses the raw logs and indexes them by using the REST API.
There is a very dirty hack here, since many syslog servers do not send the year in the log messages. The script assumes we are indexing logs from the current year and makes a dirty trick on January 1st. In case you want to index a log file from the previous year you can use the -y flag to force this case. Yes, it is as dirty as I said :) Modern syslog servers like rsyslog or syslog-ng can be configured to output the year, but it may not be the case with older installations.
The script search.py connects to the elasticsearch server and sends the queries.
A query example would be the following one:
./search.py -f email@example.com -d '2011-12-15 00:00:00' -D '2011-12-17 00:00:00' -h smtp -r 10The above query searches all the e-mail transactions that match:
client: bob.desktop [10.10.10.10]
date: 2011-12-15 15:27:19
to: firstname.lastname@example.org' , relay='smtp.bar.com', status='Sent (Ok: queued as 5298B228B8B)'
client: bob.desktop [10.10.10.10]
date: 2011-12-15 14:39:25
to: email@example.com' , relay='smtp.bar.com', status='Sent (Ok: queued as 16F49228B8B)'
- The sender is firstname.lastname@example.org
- Between 2011-12-15 00:00:00 and 2011-12-17 00:00:00
- The transactions occur in the server called smtp
- Only output the 10 most recent transactions.
All the scripts can be found here.