In the two previous posts I described how I made sense of my e-mail logs, by parsing and storing them in a search engine called elasticsearch. This transformation helped me to quickly trace problems in a distributed environment with many servers. Of course, the alternative was to manually grep the logfiles, which is a waste of time. In this closing post, I would like to present the last tweak I have made in the script and the conclusions after some months of tests.
The previous version was reading all the logs from a file (ie, the syslog file) to store the transactions in the elasticsearch index. It is handy if we have to start indexing from scratch, but it is quite inefficient if we need to keep the index up-to-date because, when executing the script with Cron, we keep reading/storing the same transactions again and again.
At the same time I was writing this script, I started using graylog2 to keep a short term cache of my logs in order to speed up searches and also to use their nifty GUI as well:) As a side effect, they were using elasticsearch, which helped me to continue using my scripts, which is nice. As you may already notice, this offers new possibilities, like querying the graylog2 index with the REST API, that permits to search by keywords and time frames.
As a result, I updated the script to query the graylog2 index and extract all the postfix/sendmail logs given a time frame (60 minutes by default). This change helps a lot to reduce the load when it is executed by Cron and, at the same time, it permits to call it in smaller periods of time (ie, every 15 minutes).
By using available tools, we can greatly improve our capacity to resolve incidents, while, at the same time, doing it on a budget. The alternative would be spending hours grepping log files and trying to understand what happened, which is not an easy task.
This post is just using e-mail logs as an example, but this can also be applied to other raw data. One possible option would be parsing the logs of the firewalls, netflow, IDS, HIDS,etc. running in our infrastructure and storing them in a structured way, so we can do searches to investigate an ongoing incident. Buy doing so, it would help us to correlate the alerts and to find until which extent and attacker has penetrated in our network.
This concept is similar to the way Sguil handles incidents happening in our network. I like the idea of being able to jump between IDS alerts, netflow and Pcap files so easily, because it gives lots of visibility to the analyst and helps to make an informed decision.
Before I finish, I would like to talk a bit about the drawbacks. Parsing the data and doing some automation is really useful and it greatly helps to solve incidents faster (it is really difficult to find somebody that disagrees), but the picture you have of what is happening is as good as your regular expressions. Yes, the more complex the data is, the more pain you will suffer and you may miss some parts of the picture. You have to be aware and you may find yourself checking the logs from time to time :)