When a shell is not enough

Infrastructure monitoring and the journey to the cloud

2018-10-27T08:00:00+00:00

At work we are in the process of moving from a datacenter centric infrastructure to AWS. This is a great journey with many interesting challenges and today we will discuss one of them, the monitoring of our infrastructure.

Taking into account we are quite busy, dedicating resources and money to prototype and later migrate to a new monitoring platform is not high priority unless we detect important gaps. “If it works, don’t touch it”, they said.

Well, at the moment we have a quite stable monitoring stack that meets our requirements but it’s obvious that it doesn’t adapt well to the cloud, that is very dynamic by nature, because the current monitoring stack relies on manual changes in configuration files and commits to the software repo.

Current setup

We are using Munin for graphing, that also forwards the alerts to our Icinga satellites, that are running in different locations. The setup is great right now, but we have identified the following gaps:

Very static configuration. We still could use the AWS SDK to rebuild the configuration file automatically, but that could potentially cause other issues, because Munin is quite sensitive when it comes to syntax errors.
Munin doesn’t scale because it’s written in Perl. The fact that it needs to run every 5 minutes is a good indicator.
It also relies in lots of dependencies and scripts in the client. For instance, it cannot be used to monitor Docker containers or cloud managed services, like RDS.
Having a graph resolution of 5 minutes is not a big drama but it’s not usable if we want to do application instrumentation.

Searching for the right solution

Going fully managed

A simple solution is going to a fully managed monitoring platform, in our case Cloudwatch. This way we could get rid of the burden that is the setup and management.

In this scenario, we thought of preparing some tooling to help the developers to setup their own alerts and dashboards, mainly with the help of Terraform and the Python AWS SDK. A great solution, one may think, but we found some issues, being the big monthly costs one of them.

Monitoring a fleet of 100 EC2 instances, to put an example, would cost around 30 Euros/month, alerts included. But the costs explode as soon as we want to use some metrics that don’t exist out of the shelf. This is: disk utilization, memory usage, number of processes, Nginx metrics, application instrumentation, etc.

AWS charges you 0,30 Euros per custom metric, times the number of resources (Ec2 instances). Having an example of 5 custom metrics, this would be:

100 Ec2 instances x 5 metrics x 0.30 Eur/month = 150 Eur/month

It’s clear that paying thousands of Euros every year for the monitoring alone cannot be justified. We have to search for alternatives.

Improving our current stack

Searching a bit in Google we found that Icinga has a plugin to import resources from AWS. This is: Ec2 instances, load balancers, RDS databases and auto scaling groups.

Keeping Iginga would be a good solution because we don’t need to learn and setup a new platform and we know it’s reliable. Alternatively, we could also use something like Grafana to generate dashboards with the metrics.

After playing around with a prototype I found several issues.

I couldn’t find an easy way to filter which resources I want to monitor. Something as simple as filtering by tag to only monitor the production resources. It may exist, because the importer configuration has a field for a filter expression but I found no documentation. As a result, the full inventory is imported.
The plugin seems to be in an early stage of development and it’s maintained by the community.
Again, we cannot monitor things like containers. Probably it will be implemented over time.
We cannot pull standard metrics from Cloudwatch. If we want to monitor an RDS database, we have to use a Nagios plugin.
The import tasks need to be run periodically to keep the inventory current.
The import works by autogenerating the same configuration files that we now maintain in our software repos and then reloading the Icinga configuration. Sometimes the imports fail because of syntax errors (self inflicted pain), causing the synchronization to fail. This would leave us blind if it’s not spotted.

In general, it feels that the setup it’s not appropriate for us. It would be a good solution for an organization that is using cloud services but, at the same time, is going to keep a quite static fleet. Also, not using hosted services at all.

What is the industry using?

We need:

Monitor some static resources, like long running EC2 instances.
Metrics from short living Ec2 instances in autoscaling groups.
Perhaps some batch jobs that run for some hours and then the resources are destroyed.
Hosted services like RDS and probably Kubernetes in the future.

What is the industry using to monitor such a diverse environment? It seems that all points to Prometheus.

Prometheus, like Munin, uses a pull model but it escalates better because it’s written in Go and can auto-discover all the AWS services we are planning to use. Finally, for the graphing, it supports Grafana out of the box and the alerting capabilities are also fine for us (email, sms, webhooks, etc.).

Regarding the costs, it also seems to be a good solution, because it’s open source and it doesn’t have big hardware requirements nor we have to setup big database clusters. It also can easily run in a small Ec2 instance or in docker containers.

The only down side at the moment is the steep learning curve, because the platform is composed of many microservices, but it shouldn’t be a big issue. We will try in the lab and some staging environments but, after all the alternatives, it seems to be the right move.

DNS over TLS forwarding with Unbound and Quad9

2018-04-02T08:00:00+00:00

In my previous post I explained how to build a DNS sinkhole with Unbound by downloading block lists from different sources. I also tried to use dnscrypt in the setup, but I had to disable it because the service provided was unreliable.

Yesterday Cloudflare announced that they were providing a “privacy-first consumer DNS service”, whatever it means.

Announcing 1.1.1.1: the fastest, privacy-first consumer DNS service - https://t.co/xiM3yllWHj pic.twitter.com/5keff8uuD2
— Cloudflare (@Cloudflare) April 1, 2018

Since it’s Easter and I have more free time than usual, I thought it would be cool to have a look and update my DNS sinkhole at home.

While I was searching for information related to DNS over TLS, that is one of the main features provided by Cloudflare, I came across Quad9, that it’s offering the same service. They have been in the news a lot but I didn’t play attention because the media outlets only reported it as an alternative to Google DNS and back then I was too busy.

In a nutshell, Quad9 is a sinkhole that blocks DNS requests to malicious domains, that is pretty much the same I am doing at home with Unbound and a shell script, but with more resources. My blackhole hast more than 30K domains blacklisted, that is not bad at all :)

At the end, I decided to use the DNS over TLS resolvers from Quad9, but you can find the resolvers from Cloudflare commented out in the configuration file. I will keep my own list of blocked domains for the time being, but I may kill it in the future because my configuration fails every now and then when the domain names have non-acii characters.

The minimum configuration options are:

ssl-upstream tells Unbound to use TLS to communicate with the upstream server.
ip_add@port to define the upstream server.

Additionally I am using configuration parameters that come in handy:

minimal-responses: yes

Reduces the size of the response when possible to improve the performance a bit.
prefetch: yes

Fetch the about to expire cache elements.
qname-minimisation: yes

Best effort to send minimum amount of info to the upstream servers but not super helpful.

Notice that Unbound is not running daemonized because it’s being monitored by the Daemontools supervisor. That is also why the configuration and control files are not placed in the usual locations.

server:
        interface: 10.10.10.10
        access-control: 127.0.0.0/8 allow
        access-control: 10.10.10.0/24 allow
        do-daemonize: no
        logfile: ""

        username: unbound
        directory: /usr/local/var/service/unbound
        chroot: /usr/local/var/service/unbound
        pidfile: /usr/local/var/service/unbound/unbound.pid
        verbosity: 1

        minimal-responses: yes
        prefetch: yes
        qname-minimisation: yes

        # we are doing DNS over TLS
        ssl-upstream: yes

        root-hints: /usr/local/var/service/unbound/config/root.hints
        # my DNS zone at home
        include:    /usr/local/var/service/unbound/config/local.zone
        # autogenerated every night to block malicious domains
        include:    /usr/local/var/service/unbound/config/blackhole.zone


forward-zone:
   name: "."
   forward-addr: 9.9.9.9@853         # quad9.net primary
   forward-addr: 149.112.112.112@853 # quad9.net secondary
   #forward-addr: 1.1.1.1@853        # cloudflare primary
   #forward-addr: 1.0.0.1@853        # cloudflare secondary


remote-control:
       control-enable: yes
       control-interface: /usr/local/var/service/unbound/control.clt
       control-use-cert: no

Fixing OpenSC after updating to MacOS Sierra

2018-03-17T12:10:00+00:00

Sierra introduced restrictions to the ssh-agent (new version of OpenSSH) by limiting the PKCS#11 libraries that can be loaded to a list of whitelisted directories. As of now, this is public domain because I must be one of the last persons updating from El Capitan to Sierra. Yes, I am not an early adopter!

So, until now I had an alias in my .bashrc that was loading my SSH key in the Yubikey to the ssh agent. The alias was just fine but the library is now outside the trusted path, that is “/usr/lib:/usr/local/lib”.

alias load_key="ssh-add -s /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so"
alias unload_key="ssh-add -e /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so"

As always, the error message is everything but useful:

$ ssh-add -s /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so
Enter passphrase for PKCS#11:
Could not add card "/Library/OpenSC/lib/pkcs11/opensc-pkcs11.so": agent refused operation

I had to run the ssh-agent in debug mode to understand what was happening (Google is your friend) and the output said: provider not whitelisted.

$ ssh-agent -d -a /tmp/agent.socket
SSH_AUTH_SOCK=/tmp/agent.socket; export SSH_AUTH_SOCK;
echo Agent pid 2918;
debug2: fd 3 setting O_NONBLOCK
debug3: fd 4 is O_NONBLOCK
debug1: type 20
refusing PKCS#11 add of "/Library/OpenSC/lib/opensc-pkcs11.so": provider not whitelisted
debug1: XXX shrink: 3 < 4

$ SSH_AUTH_SOCK="/tmp/agent.socket" ssh-add -s /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so
Enter passphrase for PKCS#11:
Could not add card "/Library/OpenSC/lib/pkcs11/opensc-pkcs11.so": agent refused operation

Goggling again, the first hit is the bug report that was opened last year.

MacOS: cannot use /usr/local/lib/opensc-pkcs11.so (provider not whitelisted)

At the end, I had to modify my aliases to the library located in the trusted path.

alias load_key="ssh-add -s /usr/local/lib/opensc-pkcs11.so"
alias unload_key="ssh-add -e /usr/local/lib/opensc-pkcs11.so"

The original post where I setup SSH public key authentication with security tokens

Backing up my Githup repos

2018-02-05T16:45:00+00:00

I am using runwhen together with daemontools to launch and monitor the backup. The run script used by the svc service executes runwhen commands to sleep until the next run (every hour) and then launch the backup script. The service is running in a dedicated jail.

The run script listed below uses some runwhen commands (rw-add,rw-matchand rw-sleep) to wake-up every hour and setuidgid to run the service with an unprivileged user.

#!/bin/sh

exec 2>&1

exec setuidgid gitbackup \
rw-add   n d1S             now1s        \
rw-match \$now1s ,M=00     wake         \
rw-sleep \$wake                         \
/home/gitbackup/update.sh

The actual backup script that iterates over all the git repos and fetches the changes.

#!/bin/sh

exec 2>&1

cd /usr/home/gitbackup/backup
echo "===="
date
echo "===="

for repo in `ls -d1 *.git`; do
        cd $repo && /usr/local/bin/git fetch --all
        cd -
done
echo "===="

checking the output log

$ cat /var/service/backups/log/main/current | tai64nlocal
2018-02-05 18:00:00.098641500 ====
2018-02-05 18:00:00.150083500 Mon Feb  5 18:00:00 CET 2018
2018-02-05 18:00:00.180056500 ====
2018-02-05 18:00:00.211689500 Fetching origin
2018-02-05 18:00:01.073738500 From https://github.com/xgarcias/ansible-cmdb-freebsd-template
2018-02-05 18:00:01.073743500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:01.091577500 Fetching origin
2018-02-05 18:00:02.185366500 From https://github.com/xgarcias/ansible-daemontools
2018-02-05 18:00:02.185371500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:02.203049500 Fetching origin
2018-02-05 18:00:04.180310500 From https://github.com/xgarcias/ansible-macbook
2018-02-05 18:00:04.180315500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:04.198104500 Fetching origin
2018-02-05 18:00:06.448429500 From https://github.com/xgarcias/daemontools-dyndns
2018-02-05 18:00:06.448434500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:06.466266500 Fetching origin
2018-02-05 18:00:08.299785500 From https://github.com/xgarcias/daemontools-poudriere
2018-02-05 18:00:08.299790500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:08.321755500 Fetching origin
2018-02-05 18:00:09.749956500 From https://github.com/xgarcias/daemontools-unbound-sinkhole
2018-02-05 18:00:09.749961500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:09.771744500 Fetching origin
2018-02-05 18:00:11.113934500 From https://github.com/xgarcias/elasticsearch-plugin-readonlyrest
2018-02-05 18:00:11.113939500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:11.135774500 Fetching origin
2018-02-05 18:00:12.703191500 From https://github.com/xgarcias/freebsd_local_ports
2018-02-05 18:00:12.703197500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:12.724967500 Fetching origin
2018-02-05 18:00:13.583204500 From https://github.com/xgarcias/xgarcias.github.io
2018-02-05 18:00:13.583209500  * branch            HEAD       -> FETCH_HEAD
2018-02-05 18:00:13.601461500 ====

Querying ANS/IP records via non rate-limited REST API

2018-02-03T10:20:00+00:00

Querying ANS/IP records via non rate-limited unauthenticated REST API.

More Info

Also, you can use @DuckDuckGo to get the same results with the !Arin and !Ripe bang searches.

Periodic reminder of ARIN's public, non rate-limited, unauthenticated REST API for ASN/IP/Network lookups

docs: https://t.co/iMWUOFZcgr pic.twitter.com/Yc7HQ69xrI
— Andrew Morris (@Andrew___Morris) January 27, 2018

You can also use !Arin and !Ripe bang searches on @DuckDuckGo to quickly lookup IP information
— Greg Bray (@GBrayUT) January 27, 2018

Blocklist for browser based cryptominers

2018-02-03T10:20:00+00:00

List of DNS records and IP addresses to prevent cryptomining in the browser or orther applications

Blocklist for browser based cryptominers. Time to add it to my DNS sinkhole ;)https://t.co/OxUHsiV1J4
— Xavier Garcia (@shellguardians) February 2, 2018

Centrally managed Bhyve infrastructure with Ansible, libvirt and pkg-ssh

2017-05-15T15:08:00+00:00

At work we’ve been using Bhyve for a while to run non-critical systems. It is a really nice and stable hypervisor even though we are using an earlier version available on FreeBSD 10.3. This means we lack Windows and VNC support among other things, but it is not a big deal.

After some iterations in our internal tools, we realised that the installation process was too slow and we always repeated the same steps. Of course, any good sysadmin will scream “AUTOMATION!” and so did we. Therefore, we started looking for different ways to improve our deployments.

We had a look at existing frameworks that manage Bhyve, but none of them had a feature that we find really important: having a centralized repository of VM images. For instance, SmartOS applies this method successfully by having a backend server that stores a catalog of VMs and Zones, meaning that new instances can be deployed in a minute at most. This is a game changer if you are really busy in your day-to-day operations.

Since we are not great programmers, we decided to leverage the existing tools to achieve the same results. This is, having a centralised repository of Bhyve images in our data centers. The following building blocks are used:

The ZFS snapshot of an existing VM. This will be our VM template.
A modified version of oneoff-pkg-create to package the ZFS snapshots.
pkg-ssh and pkg-repo to host a local FreeBSD repo in a FreeBSD jail.
libvirt to manage our Bhyve VMs.
The ansible modules virt, virt_net and virt_pool.

Workflow:

We write a yml dictionary to define the parameters needed to create a new VM:
- VM template (name of the pkg that will be installed in /bhyve/images)
- VM name, cpu, memory, domain template, serial console, etc.
This dictionary will be kept in the corresponding host_vars definition that configures our Bhyve host server.
The Ansible playbook:
- installs the package named after the VM template (ZFS snapshot).e.g. pkg install FreeBSD-10.3-RELEASE-ZFS-20G-20170515.
- uses cat and zfs receive to load the ZFS snapshot in a new volume.
- calls the libvirt modules to automatically configure and boot the VM.
The Sysadmin logs in the new VM and adjusts the hostname and network settings.
Run a separate Ansible playbook to configure the new VM as usual.

Once automated, the installation process needs 2 minutes at most, compared with the 30 minutes needed to manually install VM plus allowing us to deploy many guests in parallel.

Resources:

Sample config for FreeBSD https://people.freebsd.org/~rodrigc/libvirt-bhyve/libvirt-bhyve.html
bhyve driver for libvirt http://libvirt.org/drvbhyve.html
virsh examples https://wiki.libvirt.org/page/VM_lifecycle#Creating_a_domain
migrating VMs w/o shared storage https://hgj.hu/live-migrating-a-virtual-machine-with-libvirt-without-a-shared-storage/
xml reference http://libvirt.org/formatdomain.html
Virtual networking https://wiki.libvirt.org/page/VirtualNetworking

OpenNTPD, leap seconds and other horror stories

2017-01-03T11:00:00+00:00

In case you are not informed, there was a leap second on December 31, 2016. I don’t know you, but I’ve read many horror stories about things going terribly wrong after leap seconds and sysadmins in despair being paged at night. Well, today I am going to share one of those stories with you and I hope it will be terrifying.

Horror story

Like diligent sysadmins, we monitor the ntpd services on our servers (OpenNTPD in our case) and we will be alerted if a noticeable clock offset happens. Of course, in the event of a leap second, all the servers should trigger an alert and the corresponding recovery. The leap second was inserted as 23:59:60 on December 31 and the servers slowly chewed the difference in around 3 hours.

But… Here comes the horror story. Some of the servers didn’t recover at all. The graphs showed that the offset was still around -900 ms ( an extra second was introduced, therefore we were one second behind). At the end we had to restart openntpd as a quick remediation.

Below you can find the status of one of the servers, for reference.

$ ntpctl -s all

4/4 peers valid, clock synced, stratum 3
peer
   wt tl st  next  poll          offset       delay      jitter
176.9.31.215 from pool de.pool.ntp.org
    1 10  2 1474s 1502s      -984.909ms     6.266ms     0.175ms
62.116.162.126 from pool de.pool.ntp.org
  *  1 10  2  733s 1640s      -984.824ms     1.105ms     0.126ms
78.46.79.68 from pool de.pool.ntp.org
    1 10  3  888s 1509s      -984.824ms     6.380ms     0.138ms
46.4.54.78 from pool de.pool.ntp.org
    1 10  2 3087s 3098s       105.306ms     6.295ms     0.130ms

You may notice that one of the peers has a positive offset and it doesn’t make any sense because an extra second was introduced as already explained above. I hope you can smell the stink at this moment because it is quite strong.

Well, digging in the logs I also found the following line:

ntpd[1438]: reply from 46.4.54.78: not synced (alarm), next query 3228s

Yes, openntpd was unhappy with that peer and decided to stop the time synchronisation until the issue is solved. Notice that this is really bad situation because we don’t control that peer at all. The only option is to restart openntpd because we configured a Round Robin DNS record.

I decided to do a bit of research and I went to the openntpd’s github repo to read the source code. Particuarly, src/usr.sbin/ntpd/client.c . Here, the NTP packet’s status is evaluated against a bit mask to analyse the LI bits (Leap Indicator)

(msg.status & LI_ALARM) == LI_ALARM || msg.stratum == 0 || msg.stratum > NTP_MAXSTRATUM)

The name LI_ALARM is self explanatory. This bitmask evaluates to true when both bits in the Leap Indicator are set to 1. From the RFC:

LI Leap Indicator (leap): 2-bit integer warning of an impending leap second to be inserted or deleted in the last minute of the current month with values defined in Figure 9. 0 no warning 1 last minute of the day has 61 seconds 2 last minute of the day has 59 seconds 3 unknown (clock unsynchronized)

At this point, I can claim that the peer was totally broken because it ran for hours (and it may be still broken at this point) with the clock unsynchronized and it hit us in a chain reaction. Well, one may expect a minimum quality but these are the risks that you must accept if you use services ran by others (unless you sign on paper a Service Level Agreement).

To understand how risky it can be, we can look at the page that describes how to join an ntp pool. Only an static IP address and a minimum bandwidth is required. This and a couple of recommendations. Hence, the many hobbyists running their own time servers.

ntp.org is running a monitoring system, that can be queried online. The servers with a score lower than 10 will be automatically removed from the pool (mine had -100) and this is a good measure but good luck if they are already active in your ntpd service. They will cause you trouble until you manually restart the service.

Lessons learned

Actively monitor the ntp service.
Monitor the general status: un/synchronized, stratum, num valid peers, etc
Monitor the offset. I do an average of all peers and then apply abs().
Plan carefully and search for a reliable ntp source.
Does your datacenter offer this service? Can you have an SLA?
Avoid country/region pool at pool.ntp.org because they may be run by hobbyists and will cost you pain, even if ntp.org recommends you do to so. Perhaps running the ntp servers provided by your OS vendor is safer.
Perhaps buy a DCF77 receiver to make your own Stratum 1 server but you may need an external antenna if the datacenter walls are too thick.

SSH public key authentication with security tokens

2016-10-21T12:04:00+00:00

I’ve been using a Yubikey for two factor authentication with HOTP for a long time but this crypto hardware has many more functionalities, like storing certificates (RSA and ECC keys).

The use I will describe below allows us to do SSH public key authentication while keeping the private key stored in the device at all times. This gives an extra layer of security, because the key cannot be extracted and the device will be locked if the PIN is bruteforced.

Formally speaking, many of these crypto keys (commonly in the form of a USB device emulating a card reader) support the Personal Identity Verification (PIV) card interface, that allows ECC/RSA sign/decryption operations with the private key stored in the device ( Read the NIST SP 800-78 document for more information). This hardware interface together with the API PKSC11 will allow programs like ssh to perform cryptographic operations with the certificates stored in the device.

One weak point in this scenario is the vendor trust, particularly when it comes to the random number generator implemented in the hardware, that can potentially create weak and easy to bruteforce certificates, but this can be minimized if we use the normal ssh tools to generate the ssh keys and then we import them into the device. In my case, I have followed this path.

Another downside is that NIST SP 800-78 only defines RSA keys up to 2048 bits. You must take this into consideration because the chip may support bigger keys (e.g. for OpenPGP cards) but the PIV interface is up to 2048 unless NIST updates the standard.

Finally, either PKCS11 or OpenSC (I don’t quite remember), do not support ECC keys. You are out of luck in that case.

Preparation

A crypto device that is NIST SP 800-78 compliant, a Yubikey 4 in my case.
An RSA key pair created with ssh-keygen(1).
Install OpenSC in your computer to have the PKCS11 library support and management tools. There are installers available for almost any platform: Windows, OSX, Linux, BSD,etc.

Steps

Convert the RSA private key into pem format.

$ openssl rsa -in ./id_rsa -out id_rsa.pem

Load the private key into the slot 9a in the device. It will ask for the PIN, that you may have changed (look for ‘change-pin’ and ‘change-puk’ in this document). Notice that I’ve setup the ‘pin-policy’ to once and the ‘touch-policy’ to never, effectively asking the PIN only once when I load the key in the ssh-agent, but you can change the behaviour that fits you best (e.g. force a touch every time you want to login via ssh).
```
$ yubico-piv-tool -a import-key -s 9a --pin-policy=once --touch-policy=never  -i id_rsa.pem
```

Transform the public key to a format that is understood by the device

$ ssh-keygen -e -f ./id_rsa.pub -m PKCS8 > id_rsa.pub.pkcs8

Use the public and private keys (the last one in the device) to generate an SSL selfsigned certificate, to be imported later in the device, with a 10 years expiration date (just in case). It will ask for your PIN again.
```
$ yubico-piv-tool -a verify -a selfsign-certificate --valid-days 3650  -s 9a -S "/CN=myname/O=ssh/" -i id_rsa.pub.pkcs8 -o 9a-cert.pem
```

Import the generated certificate.

$ yubico-piv-tool -a verify -a import-certificate -s 9a -i 9a-cert.pem

Using the device together with OpenSSH

In case you don’t have the public key (this step is not needed because I generated the key in my PC), you can extract it with the ssh-keygen. You have to search for the pkcs11 shared library, that is /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so in case of OSX.

$ ssh-keygen -D /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so
  ssh-rsa AAAAB....e1

Then you can tell ssh to interact with the device by pointing to this library instead of using a private key stored in your disk, but it is not very convenient because it will always ask for your pin.

$ ssh -I /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so myserver
 Enter PIN for 'PIV_II (PIV Card Holder pin)':

Loading the key in your ssh-agent is more convenient because it will only ask for the PIN once (following the pin-policy=once) and you can be sure nobody will try to abuse it because the device must be present at all times. Remember that the private key never leaves the device.

$ ssh-add -s /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so
Enter passphrase for PKCS#11:
Card added: /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so 

 bash-3.2$ ssh-add -l
 2048 SHA256:random_hash_value /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so (RSA)

$ ssh-add -e /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so
Card removed: /Library/OpenSC/lib/pkcs11/opensc-pkcs11.so 

$ ssh-add -l
The agent has no identities.

Building a DNS sinkhole in FreeBSD with Unbound and Dnscrypt

2016-08-25T18:30:00+00:00

There is already lots of literature regarding DNS sinkholes and it is a common term in Information Security. In my case, I wanted to give it a try on FreeBSD 10 but I didn’t want to make use of Bind since it was removed from the base distribution in favor of Unbound.

The setup will have the following steps:

Create a jail where the service will be configured (not explained because there is lots of examples in Internet)
Install Unbound
Basic Unbound configuration
Configure Unbound to block DNS queries
Choosing block lists available in Internet
Updating the block lists
Bonus: use dnscrypt to avoid DNS spoofing
Final Unbound configuration file

Configuring our DNS sinkhole

Installing Unbound

I ran my test in FreeBSD 10.1. Sadly, it ships Unbound v. 1.4.x, that is quite old and lacks some nice features. In the end, I had to install dns/unbound form the ports, that currently installs v.1.5.9.

If you are using a most recent FreeBSD distribution (e.g. FreeBSD 10.3), you will not need to install the port.

The only difference is that you will need to use local_unbound_enable=”YES” in /etc/rc.conf instead of unbound_enable=”YES” and the configuration file will be located in /etc/unbound/unbound.conf instead of /usr/local/etc/unbound/unbound.conf.

Basic Unbound configuration

First, we have to download the root-hints, to allow our dns cache to find the right master DNS servers.

# fetch ftp://ftp.internic.net/domain/named.cache -o /usr/local/etc/unbound/root.hints

Then, we edit the unbound.conf.

server:
        interface: 10.10.10.10

        #who can use our DNS cache
        access-control: 10.10.10.0/24 allow

        logfile: "/usr/local/etc/unbound/logs/unbound.log"
        username: unbound
        directory: /usr/local/etc/unbound
        chroot: /usr/local/etc/unbound
        pidfile: /usr/local/etc/unbound/unbound.pid
        verbosity: 1
        root-hints: /usr/local/etc/unbound/root.hints

#remote-control allows us to use the unbound-control
#utility to manage the service from the command line
remote-control:
        control-enable: yes
        control-interface: /usr/local/etc/unbound/local_unbound.ctl
        control-use-cert: no

Please, notice, that all files are located in /usr/local/etc/unbound/. If you are not using the version provided by de ports tree, the base directory will be/var/unbound/ instead.

The last step is to enable and to start the service

# sysrc unbound_enable="YES"

# service unbound start

With this setup, we have a basic DNS cache configurated in our network. Now, you should be able to query the DNS server listening on 10.10.10.10

# host www.google.com 10.10.10.10

Using domain server:

Name: 10.10.10.10

Address: 10.10.10.10#53

Aliases: 

www.google.com has address 74.125.68.147

www.google.com has address 74.125.68.105

www.google.com has address 74.125.68.103

www.google.com has address 74.125.68.99

www.google.com has address 74.125.68.104

www.google.com has address 74.125.68.106

www.google.com has IPv6 address 2404:6800:4003:c02::6

Configure Unbound to block DNS queries

The classic trick in DNS sinkholes is to define authoritative zones in the DNS cache, that will return a defined static IP address (e.g.127.0.0.2) to identify in the logs (or in network devices) when somebody is trying to connect to a blocked domain.

In Unbound, it is a bit more difficult because it is only a basic DNS cache service and lacks some features, but there are some ways around.

unbound.conf(5) has the local-zone directive and it is used to define local DNS zones but we will “abuse it”, by dropping all the queries to these domains. For instance, if we want to drop all the DNS queries asking for google.com (and subdomains) we need to add the directive:

local-zone: "google.com" inform_deny.

This will silently drop the DNS query and it will write an entry in the log file (/usr/local/etc/unbound/logs/unbound.log in our case). The client will see show the query times out.

[1472139065] unbound[28162:0] info: google.com. inform 10.10.10.3@31679 google.com. A IN

[1472139071] unbound[28162:0] info: google.com. inform 10.10.10.3@56551 google.com. A IN

To keep a tidy configuration, we will not add this big list of local-zone directives in the main configuration file but we will include a file thanks to the the include directive, that is located in the server section.

server:

....
        include: /usr/local/etc/unbound/blackhole.zone
....

Choosing block lists available in Internet

I am using the following URLs that should be considered safe, with around 23 thousand domains listed.

http://mirror1.malwaredomains.com/files/justdomains
https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
https://ransomwaretracker.abuse.ch/downloads/RW_DOMBL.txt
http://isc.sans.edu/feeds/suspiciousdomains_Low.txt
http://isc.sans.edu/feeds/suspiciousdomains_Medium.txt
http://isc.sans.edu/feeds/suspiciousdomains_High.txt

Updating the block lists

I’ve written a small shell script that downloads all the lists every night and reloads the Unbound configuration.

Please, notice that reloading Unbound will also flush the DNS cache. A good way to do it is:

Dump the cache

# unbound-control dump_cache > $cache_file

Then, download the files with fetch(1) and regenerate /usr/local/etc/unbound/blackhole.zone

Reload the configuration

# unbound-control reload

Load the cache dump back in Unbound

# unbound-control load_cache < $cache_file

Bonus: use dnscrypt to avoid DNS spoofing

Dnscrypt can be used to avoid some common DNS attacks by encrypting and signing the DNS queries. All traffic will go encrypted using the port 443, both TCP and UDP.

Of course, other issues remain, like DNS spoofing at the server end and the possible logging.

The client is available in the port tree under dns/dnscrypt-proxy and it is really easy to configure. We only need two parameters: the ip:port where we want to listen and which server we want to connect to (aka Resolver)

# sysrc  dnscrypt_proxy_enable="YES"

# sysrc dnscrypt_proxy_flags="-a 10.10.10.10:5353"

# sysrc  dnscrypt_proxy_resolver="dnscrypt.eu-nl"

# service dnscrypt_proxy start

The final step will be configuring Unbound to forward all the DNS queries to dnscrypt. This can be done in the forward-zone section.

forward-zone:

  name: "."

      forward-addr: 10.10.10.10@5353

Final Unbound configuration file

server:

        interface: 10.10.10.10

        access-control: 10.10.10.0/24 allow

        logfile: "/usr/local/etc/unbound/logs/unbound.log"

        username: unbound

        directory: /usr/local/etc/unbound

        chroot: /usr/local/etc/unbound

        pidfile: /usr/local/etc/unbound/unbound.pid

        verbosity: 1

        root-hints: /usr/local/etc/unbound/root.hints

        include: /usr/local/etc/unbound/blackhole.zone

remote-control:

       control-enable: yes

       control-interface: /usr/local/etc/unbound/local_unbound.ctl

       control-use-cert: no

forward-zone:

  name: "."

      forward-addr: 10.10.10.10@5353