Monday, May 15, 2017

Centrally managed Bhyve infrastructure with Ansible, libvirt and pkg-ssh

At work we've been using Bhyve for a while to run non-critical systems.  It is a really nice and stable hypervisor even though we are using an earlier version available on FreeBSD 10.3. This means we lack Windows and VNC support among other things, but it is not a big deal.

After some iterations in our internal tools, we realised that the installation process was too slow and we always repeated the same steps. Of course,  any good sysadmin will scream "AUTOMATION!" and so did we. Therefore, we started looking for different ways to improve our deployments.

We had a look at existing frameworks that manage Bhyve, but none of them had a feature that we find really important: having a centralized repository of VM images. For instance, SmartOS applies this method successfully by having a backend server that stores a catalog of VMs and Zones, meaning that new instances can be deployed in a minute at most. This is a game changer if you are really busy in your day-to-day operations.

Since we are not great programmers, we decided to leverage the existing tools to achieve the same results. This is,  having a centralised repository of Bhyve images in our data centers.  The following building blocks are used:

  • We write a yml dictionary to define the parameters needed to create a new VM:
    • VM template (name of the pkg that will be installed  in /bhyve/images)
    • VM name, cpu, memory, domain template, serial console, etc.
  • This dictionary will be kept in the corresponding host_vars definition that configures our Bhyve host server.
  • The Ansible playbook:
    • installs the package named after the VM template (ZFS snapshot).e.g. pkg install FreeBSD-10.3-RELEASE-ZFS-20G-20170515.
    • uses cat and zfs receive to load the ZFS snapshot in a new volume.
    • calls the libvirt modules to automatically configure and boot the VM.
  • The Sysadmin logs in the new VM and adjusts the hostname and network settings.
  • Run a separate Ansible playbook to configure the new VM as usual.
Once automated, the installation process needs 2 minutes at most, compared with the 30 minutes needed to manually install VM plus allowing us to deploy many guests in parallel.


Tuesday, January 3, 2017

OpenNTPD, leap seconds and other horror stories

In case you are not informed, there was a leap second on December 31, 2016. I don't know you, but I've read many horror stories about things going terribly wrong after leap seconds and sysadmins in despair being paged at night. Well, today I am going to share one of those stories with you and I hope it will be terrifying.

Horror story

Like diligent sysadmins, we monitor the ntpd services on our servers (OpenNTPD in our case) and we will be alerted if a noticeable clock offset happens. Of course, in the event of a leap second,  all the servers should trigger an alert and the corresponding recovery. The leap second was inserted as 23:59:60 on December 31 and the servers slowly chewed the difference in around 3 hours.

But... Here comes the horror story. Some of the servers didn't recover at all. The graphs showed that the offset was still around -900 ms ( an extra second was introduced, therefore we were one second behind). At the end we had to restart openntpd as a quick remediation.

Below you can find the status of one of the servers, for reference.

# ntpctl -s all

4/4 peers valid, clock synced, stratum 3

   wt tl st  next  poll          offset       delay      jitter from pool
    1 10  2 1474s 1502s      -984.909ms     6.266ms     0.175ms from pool
 *  1 10  2  733s 1640s      -984.824ms     1.105ms     0.126ms from pool
    1 10  3  888s 1509s      -984.824ms     6.380ms     0.138ms from pool
    1 10  2 3087s 3098s       105.306ms     6.295ms     0.130ms

You may notice that one of the peers has a positive offset and it doesn't make any sense because an extra second was introduced as already explained above. I hope you can smell the stink at this moment because it is quite strong.

Well, digging in the logs I also found the following line:

ntpd[1438]: reply from not synced (alarm), next query 3228s

Yes,  openntpd was unhappy with that peer and decided to stop the time synchronisation until the issue is solved. Notice that this is really bad situation because we don't control that peer at all. The only option is to restart openntpd because we configured a Round Robin DNS record.

I decided to do a bit of research and I went to the openntpd's github repo to read the source code. Particuarly, src/usr.sbin/ntpd/client.c . Here,  the NTP packet's status is evaluated against a bit mask to analyse the LI bits (Leap Indicator)

(msg.status & LI_ALARM) == LI_ALARM || msg.stratum == 0 ||
msg.stratum > NTP_MAXSTRATUM)

The name LI_ALARM is self explanatory. This bitmask evaluates to true when both bits in the Leap Indicator are set to 1. From the RFC:

LI Leap Indicator (leap): 2-bit integer warning of an impending leap second to be inserted or deleted in the last minute of the current  month with values defined in Figure 9.

0     no warning
1     last minute of the day has 61 seconds
2     last minute of the day has 59 seconds
3     unknown (clock unsynchronized)

At this point, I can claim that the peer was totally broken because it ran for hours (and it may be still broken at this point) with the clock unsynchronized and it hit us in a chain reaction. Well, one may expect a minimum quality but these are the risks that you must accept if you use services ran by others (unless you sign on paper a Service Level Agreement).

To understand how risky it can be, we can look at the page that describes how to join an ntp pool. Only an static IP address and a minimum bandwidth is required.  This and a couple of recommendations. Hence, the many hobbyists running their own time servers. is running a monitoring system, that can be queried online. The servers with a score lower than 10 will be automatically removed from the pool (mine had -100) and  this is a good measure but good luck if they are already active in your ntpd service. They will cause you trouble until you manually restart the service.

Lessons learned

  • Actively monitor the ntp service.
    • Monitor the general status: un/synchronized, stratum, num valid peers, etc
    • Monitor the offset. I do an average of all peers and then apply abs().
  • Plan carefully and search for a reliable ntp source.
    • Does your datacenter offer this service? Can you have an SLA?
    • Avoid country/region pool at because they may be run by hobbyists and will cost you pain, even if recommends you do to so. Perhaps running the ntp servers provided by your OS vendor is safer.
    • Perhaps buy a DCF77 receiver to make your own Stratum 1 server but you may need an external antenna if the datacenter walls are too thick.