Home

  • Backing up my Githup repos

    I am using runwhen together with daemontools to launch and monitor the backup. The run script used by the svc service executes runwhen commands to sleep until the next run (every hour) and then launch the backup script. The service is running in a dedicated jail.

    The run script listed below uses some runwhen commands (rw-add,rw-matchand rw-sleep) to wake-up every hour and setuidgid to run the service with an unprivileged user.

    #!/bin/sh
    
    exec 2>&1
    
    exec setuidgid gitbackup \
    rw-add   n d1S             now1s        \
    rw-match \$now1s ,M=00     wake         \
    rw-sleep \$wake                         \
    /home/gitbackup/update.sh
    

    The actual backup script that iterates over all the git repos and fetches the changes.

    #!/bin/sh
    
    exec 2>&1
    
    cd /usr/home/gitbackup/backup
    echo "===="
    date
    echo "===="
    
    for repo in `ls -d1 *.git`; do
            cd $repo && /usr/local/bin/git fetch --all
            cd -
    done
    echo "===="
    

    checking the output log

    $ cat /var/service/backups/log/main/current | tai64nlocal
    2018-02-05 18:00:00.098641500 ====
    2018-02-05 18:00:00.150083500 Mon Feb  5 18:00:00 CET 2018
    2018-02-05 18:00:00.180056500 ====
    2018-02-05 18:00:00.211689500 Fetching origin
    2018-02-05 18:00:01.073738500 From https://github.com/xgarcias/ansible-cmdb-freebsd-template
    2018-02-05 18:00:01.073743500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:01.091577500 Fetching origin
    2018-02-05 18:00:02.185366500 From https://github.com/xgarcias/ansible-daemontools
    2018-02-05 18:00:02.185371500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:02.203049500 Fetching origin
    2018-02-05 18:00:04.180310500 From https://github.com/xgarcias/ansible-macbook
    2018-02-05 18:00:04.180315500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:04.198104500 Fetching origin
    2018-02-05 18:00:06.448429500 From https://github.com/xgarcias/daemontools-dyndns
    2018-02-05 18:00:06.448434500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:06.466266500 Fetching origin
    2018-02-05 18:00:08.299785500 From https://github.com/xgarcias/daemontools-poudriere
    2018-02-05 18:00:08.299790500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:08.321755500 Fetching origin
    2018-02-05 18:00:09.749956500 From https://github.com/xgarcias/daemontools-unbound-sinkhole
    2018-02-05 18:00:09.749961500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:09.771744500 Fetching origin
    2018-02-05 18:00:11.113934500 From https://github.com/xgarcias/elasticsearch-plugin-readonlyrest
    2018-02-05 18:00:11.113939500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:11.135774500 Fetching origin
    2018-02-05 18:00:12.703191500 From https://github.com/xgarcias/freebsd_local_ports
    2018-02-05 18:00:12.703197500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:12.724967500 Fetching origin
    2018-02-05 18:00:13.583204500 From https://github.com/xgarcias/xgarcias.github.io
    2018-02-05 18:00:13.583209500  * branch            HEAD       -> FETCH_HEAD
    2018-02-05 18:00:13.601461500 ====
    
  • Querying ANS/IP records via non rate-limited REST API

    Querying ANS/IP records via non rate-limited unauthenticated REST API.

    More Info

    Also, you can use @DuckDuckGo to get the same results with the !Arin and !Ripe bang searches.

  • Blocklist for browser based cryptominers

    List of DNS records and IP addresses to prevent cryptomining in the browser or orther applications

  • Centrally managed Bhyve infrastructure with Ansible, libvirt and pkg-ssh

    At work we’ve been using Bhyve for a while to run non-critical systems.  It is a really nice and stable hypervisor even though we are using an earlier version available on FreeBSD 10.3. This means we lack Windows and VNC support among other things, but it is not a big deal.

    After some iterations in our internal tools, we realised that the installation process was too slow and we always repeated the same steps. Of course,  any good sysadmin will scream “AUTOMATION!” and so did we. Therefore, we started looking for different ways to improve our deployments.

    We had a look at existing frameworks that manage Bhyve, but none of them had a feature that we find really important: having a centralized repository of VM images. For instance, SmartOS applies this method successfully by having a backend server that stores a catalog of VMs and Zones, meaning that new instances can be deployed in a minute at most. This is a game changer if you are really busy in your day-to-day operations.

    Since we are not great programmers, we decided to leverage the existing tools to achieve the same results. This is, having a centralised repository of Bhyve images in our data centers.  The following building blocks are used:

    Workflow:

    • We write a yml dictionary to define the parameters needed to create a new VM:
      • VM template (name of the pkg that will be installed  in /bhyve/images)
      • VM name, cpu, memory, domain template, serial console, etc.
    • This dictionary will be kept in the corresponding host_vars definition that configures our Bhyve host server.
    • The Ansible playbook:
      • installs the package named after the VM template (ZFS snapshot).e.g. pkg install FreeBSD-10.3-RELEASE-ZFS-20G-20170515.
      • uses cat and zfs receive to load the ZFS snapshot in a new volume.
      • calls the libvirt modules to automatically configure and boot the VM.
    • The Sysadmin logs in the new VM and adjusts the hostname and network settings.
    • Run a separate Ansible playbook to configure the new VM as usual.

    Once automated, the installation process needs 2 minutes at most, compared with the 30 minutes needed to manually install VM plus allowing us to deploy many guests in parallel.

    Resources:

  • OpenNTPD, leap seconds and other horror stories

    In case you are not informed, there was a leap second on December 31, 2016. I don’t know you, but I’ve read many horror stories about things going terribly wrong after leap seconds and sysadmins in despair being paged at night. Well, today I am going to share one of those stories with you and I hope it will be terrifying.

    Horror story

    Like diligent sysadmins, we monitor the ntpd services on our servers (OpenNTPD in our case) and we will be alerted if a noticeable clock offset happens. Of course, in the event of a leap second,  all the servers should trigger an alert and the corresponding recovery. The leap second was inserted as 23:59:60 on December 31 and the servers slowly chewed the difference in around 3 hours.

    But… Here comes the horror story. Some of the servers didn’t recover at all. The graphs showed that the offset was still around -900 ms ( an extra second was introduced, therefore we were one second behind). At the end we had to restart openntpd as a quick remediation.

    Below you can find the status of one of the servers, for reference.

    $ ntpctl -s all
    
    4/4 peers valid, clock synced, stratum 3
    peer
       wt tl st  next  poll          offset       delay      jitter
    176.9.31.215 from pool de.pool.ntp.org
        1 10  2 1474s 1502s      -984.909ms     6.266ms     0.175ms
    62.116.162.126 from pool de.pool.ntp.org
      *  1 10  2  733s 1640s      -984.824ms     1.105ms     0.126ms
    78.46.79.68 from pool de.pool.ntp.org
        1 10  3  888s 1509s      -984.824ms     6.380ms     0.138ms
    46.4.54.78 from pool de.pool.ntp.org
        1 10  2 3087s 3098s       105.306ms     6.295ms     0.130ms
    

    You may notice that one of the peers has a positive offset and it doesn’t make any sense because an extra second was introduced as already explained above. I hope you can smell the stink at this moment because it is quite strong.

    Well, digging in the logs I also found the following line:

    ntpd[1438]: reply from 46.4.54.78: not synced (alarm), next query 3228s
    

    Yes,  openntpd was unhappy with that peer and decided to stop the time synchronisation until the issue is solved. Notice that this is really bad situation because we don’t control that peer at all. The only option is to restart openntpd because we configured a Round Robin DNS record.

    I decided to do a bit of research and I went to the openntpd’s github repo to read the source code. Particuarly, src/usr.sbin/ntpd/client.c . Here,  the NTP packet’s status is evaluated against a bit mask to analyse the LI bits (Leap Indicator)

    (msg.status & LI_ALARM) == LI_ALARM || msg.stratum == 0 || msg.stratum > NTP_MAXSTRATUM)
    

    The name LI_ALARM is self explanatory. This bitmask evaluates to true when both bits in the Leap Indicator are set to 1. From the RFC:

    LI Leap Indicator (leap): 2-bit integer warning of an impending leap second to be inserted or deleted in the last minute of the current  month with values defined in Figure 9. 0     no warning 1     last minute of the day has 61 seconds 2     last minute of the day has 59 seconds 3     unknown (clock unsynchronized)

    At this point, I can claim that the peer was totally broken because it ran for hours (and it may be still broken at this point) with the clock unsynchronized and it hit us in a chain reaction. Well, one may expect a minimum quality but these are the risks that you must accept if you use services ran by others (unless you sign on paper a Service Level Agreement).

    To understand how risky it can be, we can look at the page that describes how to join an ntp pool. Only an static IP address and a minimum bandwidth is required.  This and a couple of recommendations. Hence, the many hobbyists running their own time servers.

    ntp.org is running a monitoring system, that can be queried online. The servers with a score lower than 10 will be automatically removed from the pool (mine had -100) and  this is a good measure but good luck if they are already active in your ntpd service. They will cause you trouble until you manually restart the service.

    Lessons learned

    • Actively monitor the ntp service.
    • Monitor the general status: un/synchronized, stratum, num valid peers, etc
    • Monitor the offset. I do an average of all peers and then apply abs().
    • Plan carefully and search for a reliable ntp source.
    • Does your datacenter offer this service? Can you have an SLA?
    • Avoid country/region pool at pool.ntp.org because they may be run by hobbyists and will cost you pain, even if ntp.org recommends you do to so. Perhaps running the ntp servers provided by your OS vendor is safer.
    • Perhaps buy a DCF77 receiver to make your own Stratum 1 server but you may need an external antenna if the datacenter walls are too thick.