One of the situations I have to face really often is investigate e-mail problems such: delays, e-mail not arriving, bounces, user account does not exist, etc.. or, simply, whether a user sent an e-mail or not.
If reading the logs created by Postfix or Sendmail is a pain 'per se', trying to understand what happened in a scenario with multiple intermediate servers is a nightmare. Therefore, doing log management is key to succeed (in this particular situation or any other that involves large quantities of data).
There is nothing fancy in my setup. I am just using a centralized syslog server to collect all the raw logs created by the e-mail servers. Then, we can trace the problem from one single place, that is good, but trying to understand what happened is still a pain. We still cannot see the forest for the trees.
I have come up with a Python script ( search_email_transactions.py ) that parses Postfix and Sendmail logs. It searches all the e-mail transactions that match the message id, the sender, the recipient and the date.
The output is self-explanatory:
# cat /var/log/all | ./search_email_transactions.py -l - -f ^email@example.com -i 4EDC80B6.firstname.lastname@example.org
date: Dec 5 09:28:42
to: email@example.com' , relay='smtp.bar.com[10.10.10.1]:25', status='sent (250 2.0.0 pB58ShMJ024096 Message accepted for delivery)'
In this example, it only outputs one unique transaction but in a scenario with multiple servers we should have a listing of all the transactions and all the servers involved, giving us the full picture.
Since it is time and CPU consuming, I am planning to modify this script to treat all the raw logs and dump all this information to a search engine like elasticsearch, making the troubleshooting faster while still having the raw logs in case I need to go deeper.