Containerlab with BGP, keepalived, conntrackd and SNAT

Graph from Containerlab

Background story

I had a go at Containerlab after Nicolas Vibert posted a nice article about it last year (and those of you that have taken Isovalent labs may certainly recognize it). Containerlab relies on docker to spin up a container infrastructure “as code” and have templates for a bunch of networking related simulators.

I do happily have my home infrastructure, aka “homelab”, where I run BGP among other components, but there are moments where it’s rather nice to just spin up something to test and then forget it. And yes, forget for real. I have a bad habit to leave virtual machines powered of for future reference - then some year later it can be a bit tougher to just delete it. Even if I’ve marked accounts with expire dates know (by looking) in logs that the vm was untouched ever since, I still have too boot up and confirm the fact that it can just be deleted. With containers in a simulation environment its alot easier as I would never keep anything of long-term importance there.

I’ve seen how neat Nicolas designed a BGP lab with containerlab and was inspired to one day have a go myself and that arised when I were thinking about a situation of doing SNAT on egress traffic from a shared IP with the help of keepalived and conntrackd from two nodes.

Although I’ve had my own kind of “Nagle moment” (see below), Im really impressed with the potential of Containerlab and decided to try out other stuff as well.

Bryan Cantrill tweets about Nagle algorithm

The “Nagle moment”, sort of

I’m know for sure that my issue was nowhere as complicated as Oxide’s, and it is not really related to Nagle algorithm at all, but for me it was really on the same dimension. It turned out that as I solved the first part in my mission to set up a shared IP, I fixed the SNAT together with conntrackd rather swiftly - which in turn awoke another idea to let the “other site” to have a router acting as the GW out to Internet.

To established the SNAT from site A to site B, I decided to go simple and make the default route on site B back to site A. But this of course causing issues to reach the the container acting as “Internet GW”, so I decided to just comment out the line that marked a default route.

As commenting out one line was such a trivial thing and so easy to get back to previous state, I weren’t even bothering to create a snapshot/backup/GIT branch and had a go, but the GW on site B wasn’t routing outbound traffic from site A and I tried out some tiny changes on FRR before it was time for bed anyway.

Two evenings later I had spare time to proceed with my task, but I forgot where I got stuck and to my horror I couldn’t even do SNAT anymore, which led me to three late evening sessions (4–6 hours, almost crying out of frustration), attempting to change back FRR to a state that allowed me to do SNAT again. I could not reach the destination anymore for some reason. Masquerading? No, of course not, that hid the source IP and replaced it with the router’s so it had to do with the routing somehow.

I began to look at OSPF - but no, it shouldn’t be a requirement and I was certainly not doing that before, so why now? Route-maps? I had really allowed all traffic, but still, was I missing something? EVPN? BFD? Details with the BGP unnumbered peering? Next-hop? No, things I didn’t need last time implement to solve my task shouldn’t all of sudden be necessary again for just having the SNAT in place.

My head was spinning and I couldnt figure out what tiny changes I possibly could have done that made everything stop working like that out of nowhere. Comparing with old copies, looking at the terminal history, terminal scrollback, GIT - everything in my current FRR config looked very similar to what I had configured at the moment. Then I saw it, just a simple pound sign:

Embarassing moment and I felt very nauseas about that comment, and why wouldn’t I want to.. now, wait a minut.. so the night before I had a lousy sleep for going to bed without solving the issue and now one night’s bad sleep for knowing that I indeed just did a tiny change and not even bothered to look at the stanza for spinning up the containers…

The start layout

The idea was that one site, connected with redundant BGP nodes, should travel over an unknown network (Internet?) and peer with redundant BGP nodes at the destination site, without bothering with the whole leaf-spine topology and just abstract away that. Plain and simple. Let one node from site A (be SNAT’ed with the shared IP) connect to a node within site B that echoes the source addr.

Site A and B was eventually renamed to int and ext for readability.

FRR configuration

A bit fat warning on putting these configurations into production, they (FRR instances) will trust neighbors and doesn’t watch out for BGP poisoning. You have been warned.

The idea was to look for simplicity and do with BGP unnumberred.

intbgp1:

cat << EOF > conf.d/intbgp1_frr.conf
!
frr defaults datacenter
hostname intbgp1
log syslog informational
service integrated-vtysh-config
ipv6 forwarding
!
interface lo
  ip address 10.0.0.2/32
!
router-id 10.0.0.2
!
router bgp 64502
  bgp bestpath as-path multipath-relax
  bgp bestpath compare-routerid
  no bgp ebgp-requires-policy
  no bgp network import-check

  neighbor intbgp peer-group
  neighbor intbgp remote-as internal
  neighbor eth2 interface peer-group intbgp
  neighbor intbgp update-source eth2

  neighbor extbgp peer-group
  neighbor extbgp remote-as external
  neighbor extbgp capability extended-nexthop
  neighbor eth1 interface peer-group extbgp
  neighbor extbgp update-source eth1
  !
  address-family ipv4 unicast
   network 10.0.0.2/32
   neighbor intbgp activate
   neighbor extbgp activate
   redistribute connected
  exit-address-family
  !
  address-family ipv6 unicast
   neighbor intbgp activate
   neighbor extbgp activate
   redistribute connected
  exit-address-family
!
line vty
!
end
EOF

intbgp2:

cat << EOF | patch -o conf.d/intbgp2_frr.conf -p0
--- conf.d/intbgp1_frr.conf   2023-10-05 11:48:27.734812969 +0200
+++ conf.d/intbgp2_frr.conf   2023-10-05 11:48:27.734812969 +0200
@@ -1,14 +1,14 @@
 !
 frr defaults datacenter
-hostname intbgp1
+hostname intbgp2
 log syslog informational
 service integrated-vtysh-config
 ipv6 forwarding
 !
 interface lo
-  ip address 10.0.0.2/32
+  ip address 10.0.0.3/32
 !
-router-id 10.0.0.2
+router-id 10.0.0.3
 !
 router bgp 64502
   bgp bestpath as-path multipath-relax
@@ -28,7 +28,7 @@
   neighbor extbgp update-source eth1
   !
   address-family ipv4 unicast
-   network 10.0.0.2/32
+   network 10.0.0.3/32
    neighbor intbgp activate
    neighbor extbgp activate
    redistribute connected
EOF

The extbgp nodes

Ideally the conntrackd and keepalived should be installed on extbgp as well, but I wanted to explore alternatives. As below configuration shows, FRR will keep a shared IP between the two nodes. I’m not sure, but I believe either PBR and send out to a interface that will do the SNAT/Masquerading or some kind of combination with conntrackd (although, I fail to see how to trigger the states from FRR).

extbgp1:

cat << EOF > conf.d/extbgp1_frr.conf
!
frr defaults datacenter
hostname extbgp1
log syslog informational
service integrated-vtysh-config
ipv6 forwarding
!
interface lo
  ip address 10.0.0.4/32
  ip address 10.237.0.253/32
!
router-id 10.0.0.4
!
router bgp 64503
  bgp bestpath as-path multipath-relax
  bgp bestpath compare-routerid
  no bgp ebgp-requires-policy
  no bgp network import-check
  no bgp default ipv4-unicast

  neighbor extbgp peer-group
  neighbor extbgp remote-as internal
  neighbor eth2 interface peer-group extbgp
  neighbor extbgp update-source eth2

  neighbor intbgp peer-group
  neighbor intbgp remote-as external
  neighbor intbgp capability extended-nexthop
  neighbor eth1 interface peer-group intbgp
  neighbor intbgp update-source eth1
  !
  address-family ipv4 unicast
   network 10.0.0.4/32
   network 10.237.0.0/24
   neighbor intbgp activate
   neighbor extbgp activate
   redistribute connected
   network 10.237.0.253/32 route-map primary
  exit-address-family
  !
  address-family ipv6 unicast
   neighbor intbgp activate
   neighbor extbgp activate
   redistribute connected
  exit-address-family
!
route-map primary permit 10
  set community 64502:1
route-map secondary permit 10
  set community 64502:2
!
line vty
!
EOF

extbgp2:

cat << EOF | patch -o conf.d/extbgp2_frr.conf -p0
--- conf.d/extbgp1_frr.conf 2023-10-05 11:48:27.734812969 +0200
+++ conf.d/extbgp2_frr.conf 2023-10-05 11:48:27.734812969 +0200
@@ -1,15 +1,15 @@
 !
 frr defaults datacenter
-hostname extbgp1
+hostname extbgp2
 log syslog informational
 service integrated-vtysh-config
 ipv6 forwarding
 !
 interface lo
-  ip address 10.0.0.4/32
+  ip address 10.0.0.5/32
   ip address 10.237.0.253/32
 !
-router-id 10.0.0.4
+router-id 10.0.0.5
 !
 router bgp 64503
   bgp bestpath as-path multipath-relax
@@ -30,12 +30,12 @@
   neighbor intbgp update-source eth1
   !
   address-family ipv4 unicast
-   network 10.0.0.4/32
+   network 10.0.0.5/32
    network 10.237.0.0/24
    neighbor intbgp activate
    neighbor extbgp activate
    redistribute connected
-   network 10.237.0.253/32 route-map primary
+   network 10.237.0.253/32 route-map secondary
   exit-address-family
   !
   address-family ipv6 unicast
EOF

Keepalived for intbgp

A instance of keepalived is installed to keep the shared IP between the two routers. Inspiration on how to set up keepalived and conntrackd comes from https://satishdotpatel.github.io/ha-with-keepalived-and-conntrackd/.

intbgp1:

cat << EOF > conf.d/intbgp1_keepalived.conf
vrrp_sync_group G1 {
    group {
        EXT
        INT
    }
    notify_master "/etc/conntrackd/primary-backup.sh primary"
    notify_backup "/etc/conntrackd/primary-backup.sh backup"
    notify_fault "/etc/conntrackd/primary-backup.sh fault"
}

vrrp_instance INT {
    state MASTER
    interface eth3
    virtual_router_id 11
    priority 50
    advert_int 1
    unicast_src_ip 10.224.0.1
    unicast_peer {
        10.224.0.2
    }
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.227.0.254/24 dev eth4
    }
    nopreempt
    garp_master_delay 1
}
EOF

intbgp2:

cat << EOF | patch -o conf.d/intbgp2_keepalived.conf -p0
--- conf.d/intbgp1_keepalived.conf 2023-10-05 11:48:27.734812969 +0200
+++ conf.d/intbgp2_keepalived.conf 2023-10-05 11:48:27.734812969 +0200
@@ -9,14 +9,14 @@
 }

 vrrp_instance INT {
-    state MASTER
+    state BACKUP
     interface eth3
     virtual_router_id 11
-    priority 50
+    priority 25
     advert_int 1
-    unicast_src_ip 10.224.0.1
+    unicast_src_ip 10.224.0.2
     unicast_peer {
-        10.224.0.2
+        10.224.0.1
     }
     authentication {
         auth_type PASS
EOF

The primary-backup.sh (non modified example script from conntrackd examples directory) script that are referred in the keepalived.conf :

cat << EOF > conf.d/primary-backup.shcat conf.d/primary-backup.sh
#!/bin/sh
#
# (C) 2006-2011 by Pablo Neira Ayuso <[email protected]>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# Description:
#
# This is the script for primary-backup setups for keepalived
# (http://www.keepalived.org). You may adapt it to make it work with other
# high-availability managers.
#
# Do not forget to include the required modifications to your keepalived.conf
# file to invoke this script during keepalived's state transitions.
#
# Contributions to improve this script are welcome :).
#

CONNTRACKD_BIN=/usr/sbin/conntrackd
CONNTRACKD_LOCK=/var/lock/conntrack.lock
CONNTRACKD_CONFIG=/etc/conntrackd/conntrackd.conf

case "$1" in
  primary)
    #
    # commit the external cache into the kernel table
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -c
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -c"
    fi

    #
    # flush the internal and the external caches
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -f
    if [ $? -eq 1 ]
    then
     logger "ERROR: failed to invoke conntrackd -f"
    fi

    #
    # resynchronize my internal cache to the kernel table
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -R
    if [ $? -eq 1 ]
    then
     logger "ERROR: failed to invoke conntrackd -R"
    fi

    #
    # send a bulk update to backups
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -B
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -B"
    fi
    ;;
  backup)
    #
    # is conntrackd running? request some statistics to check it
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -s
    if [ $? -eq 1 ]
    then
        #
 # something's wrong, do we have a lock file?
 #
     if [ -f $CONNTRACKD_LOCK ]
 then
     logger "WARNING: conntrackd was not cleanly stopped."
     logger "If you suspect that it has crashed:"
     logger "1) Enable coredumps"
     logger "2) Try to reproduce the problem"
     logger "3) Post the coredump to [email protected]"
     rm -f $CONNTRACKD_LOCK
 fi
 $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -d
 if [ $? -eq 1 ]
 then
     logger "ERROR: cannot launch conntrackd"
     exit 1
 fi
    fi
    #
    # shorten kernel conntrack timers to remove the zombie entries.
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -t
    if [ $? -eq 1 ]
    then
     logger "ERROR: failed to invoke conntrackd -t"
    fi

    #
    # request resynchronization with master firewall replica (if any)
    # Note: this does nothing in the alarm approach.
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -n
    if [ $? -eq 1 ]
    then
     logger "ERROR: failed to invoke conntrackd -n"
    fi
    ;;
  fault)
    #
    # shorten kernel conntrack timers to remove the zombie entries.
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -t
    if [ $? -eq 1 ]
    then
     logger "ERROR: failed to invoke conntrackd -t"
    fi
    ;;
  *)
    logger "ERROR: unknown state transition"
    echo "Usage: primary-backup.sh {primary|backup|fault}"
    exit 1
    ;;
esac

exit 0
EOF

Conntrackd for intbgp

To keep the netfilters in state between both routers, conntrackd with a corresponding configuration was put in place.

intbgp1:

cat << EOF > conf.d/intbgp1_conntrackd.conf
Sync {
    Mode FTFW {
        DisableExternalCache Off
        StartupResync on
    }

    UDP {
        IPv4_address 10.223.0.1
        IPv4_Destination_Address 10.223.0.2
        Port 3780
        Interface eth2
        Checksum on
    }
}

General {
    Systemd off
    HashSize 8192
    HashLimit 65535
    LogFile on
    Syslog off
    LockFile /var/lock/conntrack.lock
    UNIX {
        Path /var/run/conntrackd.ctl
        Backlog 20
    }
    SocketBufferSize 262142
    SocketBufferSizeMaxGrown 655355
    NetlinkBufferSize 262142
    NetlinkBufferSizeMaxGrowth 655355
    Filter From Userspace {
        Protocol Accept {
            TCP
            #UDP
            #ICMP # This requires a Linux kernel >= 2.6.31
        }
        Address Ignore {
            IPv4_address 127.0.0.1 # loopback
            IPv4_address 10.0.0.0/24
            IPv4_address 172.20.20.0/24
            IPv4_address 172.18.0.0/16
            IPv4_address 10.227.0.0/24
            IPv4_address 10.223.0.0/24
            IPv4_address 10.224.0.0/24
            IPv4_address 10.179.0.0/24
        }
    }
}
EOF

intbgp2:

cat << EOF | patch -o conf.d/intbgp2_conntrackd.con -p0
--- conf.d/intbgp1_conntrackd.conf 2023-10-05 11:48:27.734812969 +0200
+++ conf.d/intbgp2_conntrackd.conf 2023-10-05 11:48:27.734812969 +0200
@@ -5,8 +5,8 @@
     }

     UDP {
-        IPv4_address 10.223.0.1
-        IPv4_Destination_Address 10.223.0.2
+        IPv4_address 10.223.0.2
+        IPv4_Destination_Address 10.223.0.1
         Port 3780
         Interface eth2
         Checksum on
@@ -38,11 +38,8 @@
             IPv4_address 127.0.0.1 # loopback
             IPv4_address 10.0.0.0/24
             IPv4_address 172.20.20.0/24
-            IPv4_address 172.18.0.0/16
             IPv4_address 10.227.0.0/24
             IPv4_address 10.223.0.0/24
-            IPv4_address 10.224.0.0/24
-            IPv4_address 10.179.0.0/24
         }
     }
 }
EOF

The three virtual switches

The switch configs are more or less in a pristine state, the relevant parts that are changed from original are ports/interfaces with corresponding descriptions and a few VLANs, just because.

intsw0:

cat << EOF > conf.d/intsw0.cfg
vlan internal order descending range 3000 4094
!
hostname intsw0
!
spanning-tree mode none
!
no aaa root
!
username autoadmin privilege 15 role network-admin secret sha512 $6$C0MXmP2mKEqqv5u2$vv6OA.aXVYSE.N99fAJiCWSoalO1yybi1pCFTshfmj2u5USI4Y.dgjBqolaxjW2do.kpd0eGg4JsLGmZSN78F0
!
vrf instance mgmtVrf
!
ip routing
ip routing vrf mgmtVrf
!
ipv6 unicast-routing
ipv6 unicast-routing vrf mgmtVrf
!
vlan 10
  name servers
!
vlan 20
  name clients
!
vlan 30
  name bgp-keepalived
!
interface Loopback0
  description C: cEOS1-Loopback0
  ip address 1.1.1.1/32
  ipv6 address 2001:db8::1:1:1:1/128
!
interface ethernet1
  description L: cEOS2-Eth1
  no switchport
  load-interval 30
  ip address 10.10.10.0/31
  ipv6 address 2001:db8:100::0/127
  ip ospf area 0
  ipv6 ospf 1 area 0
!
interface ethernet2
  description L: intbgp1-eth2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 20
!
interface ethernet3
  description L: intbgp1-eth3
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet4
  description L: intbgp1-eth4
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet5
  description L: intbgp2-eth2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 20
!
interface ethernet6
  description L: intbgp2-eth3
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet7
  description L: intbgp2-eth4
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet8
  description L: inthost0-cplane
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet9
  description L: inthost1-worker
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet10
  description L: inthost2-worker2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet11
  description L: inthost3-worker3
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet12
  description L: inthost4
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface Management0
  description L: Mgmt Interface
  vrf mgmtVrf
  ip address 10.10.10.2/24
  ipv6 address 2001:10:10:10::2/64
!
interface vlan 10
  description H: Servers vlan
  load-interval 30
  ip address 10.227.0.1/24
!
interface vlan 20
  description H: Servers vlan
  load-interval 30
  ip address 10.223.0.1/24
!
router ospf 1
  router-id 1.1.1.3
  redistribute connected
  redistribute static
  log-adjacency-changes details
  bfd default
!
ipv6 router ospf 1
  router-id 1.1.1.3
  redistribute static
  redistribute connected
  log-adjacency-changes details
  bfd default
!
router bfd
   interval 500 min-rx 500 multiplier 3 default
!
management api http-commands
   no shutdown
!
management api gnmi
   transport grpc default
!
management api netconf
   transport ssh default
!
EOF

peersw0:

cat << EOF > conf.d/peersw0.cfg
vlan internal order descending range 3000 4094
!
hostname peersw0
!
spanning-tree mode none
!
no aaa root
!
username autoadmin privilege 15 role network-admin secret sha512 $6$C0MXmP2mKEqqv5u2$vv6OA.aXVYSE.N99fAJiCWSoalO1yybi1pCFTshfmj2u5USI4Y.dgjBqolaxjW2do.kpd0eGg4JsLGmZSN78F0
!
vrf instance mgmtVrf
!
ip routing
ip routing vrf mgmtVrf
!
ipv6 unicast-routing
ipv6 unicast-routing vrf mgmtVrf
!
vlan 40
  name bgp-peers
!
interface Loopback0
  description C: cEOS2-Loopback0
  ip address 1.1.1.1/32
  ipv6 address 2001:db8::1:1:1:1/128
!
interface ethernet1
  description L: cEOS2-Eth1
  no switchport
  load-interval 30
  ip address 10.10.10.0/31
  ipv6 address 2001:db8:100::0/127
  ip ospf area 0
  ipv6 ospf 1 area 0
!
interface ethernet2
  description L: intbgp1
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 40
!
interface ethernet3
  description L: intbgp1
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 40
!
interface ethernet4
  description L: extbgp1
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 40
!
interface ethernet5
  description L: extbgp2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 40
!
interface Management0
  description L: Mgmt Interface
  vrf mgmtVrf
  ip address 10.10.10.3/24
!
interface vlan 40
  description H: BGP peering vlan
  load-interval 30
  ip address 10.0.0.1/24
!
router ospf 1
  router-id 1.1.1.2
  redistribute connected
  redistribute static
  log-adjacency-changes details
  bfd default
!
ipv6 router ospf 1
  router-id 1.1.1.2
  redistribute static
  redistribute connected
  log-adjacency-changes details
  bfd default
!
router bfd
   interval 500 min-rx 500 multiplier 3 default
!
management api http-commands
   no shutdown
!
management api gnmi
   transport grpc default
!
management api netconf
   transport ssh default
!
EOF

extsw0:

cat << EOF > conf.d/extsw0.cfg
vlan internal order descending range 3000 4094
!
hostname extsw0
!
spanning-tree mode none
!
no aaa root
!
username autoadmin privilege 15 role network-admin secret sha512 $6$C0MXmP2mKEqqv5u2$vv6OA.aXVYSE.N99fAJiCWSoalO1yybi1pCFTshfmj2u5USI4Y.dgjBqolaxjW2do.kpd0eGg4JsLGmZSN78F0
!
vrf instance mgmtVrf
!
ip routing
ip routing vrf mgmtVrf
!
ipv6 unicast-routing
ipv6 unicast-routing vrf mgmtVrf
!
vlan 10
  name servers
!
vlan 20
  name clients
!
interface Loopback0
  description C: cEOS2-Loopback0
  ip address 1.1.1.1/32
!
interface ethernet1
  description L: cEOS3-Eth1
  no switchport
  load-interval 30
  ip address 10.10.10.0/31
  ipv6 address 2001:db8:100::0/127
  ip ospf area 0
  ipv6 ospf 1 area 0
!
interface ethernet2
  description L: extbgp1-eth2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet3
  description L: extbgp1-eth3
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet4
  description L: extbgp2-eth2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet5
  description L: extbgp2-eth3
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet6
  description L: exthost0-cplane
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet7
  description L: exthost1-worker
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet8
  description L: exthost2-worker2
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet9
  description L: exthost3-worker3
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet10
  description L: exthost4
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface ethernet11
  description L: extgw0
  load-interval 30
  switchport
  switchport mode access
  switchport access vlan 10
!
interface Management0
  description L: Mgmt Interface
  vrf mgmtVrf
  ip address 10.10.10.4/24
!
interface vlan 10
  description H: Servers vlan
  load-interval 30
  ip address 10.237.0.1/24
!
interface vlan 20
  description H: Servers vlan
  load-interval 30
  ip address 10.233.0.1/24
!
router ospf 1
  router-id 1.1.1.1
  redistribute connected
  redistribute static
  log-adjacency-changes details
  bfd default
!
ipv6 router ospf 1
  router-id 1.1.1.1
  redistribute static
  redistribute connected
  log-adjacency-changes details
  bfd default
!
router bfd
   interval 500 min-rx 500 multiplier 3 default
!
management api http-commands
   no shutdown
!
management api gnmi
   transport grpc default
!
management api netconf
   transport ssh default
!
EOF

Kind

As I intend to run Kubernetes on Kind through the Containerlab I’ve prepared two clusters. The CNI of choice is Cilium (without kube-proxy), but this is work in progress..

cluster one, aka “clab”:

cat << EOF > clab_cluster.yaml
kind: Cluster
name: clab-k8s
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "10.0.0.0/16"
  serviceSubnet: "10.1.0.0/16"
  kubeProxyMode: "none"
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.227.0.2

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.227.0.3
        node-labels: "pool=worker"

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.227.0.4
        node-labels: "pool=worker"

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.227.0.5
        node-labels: "pool=worker"
EOF

cluster two, aka “clab2”:

cat << EOF > clab2_cluster.yaml
kind: Cluster
name: clab2-k8s
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "10.2.0.0/16"
  serviceSubnet: "10.3.0.0/16"
  kubeProxyMode: "none"
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.237.0.2

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.237.0.3
        node-labels: "pool=worker"

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.237.0.4
        node-labels: "pool=worker"

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.237.0.5
        node-labels: "pool=worker"
EOF

Spin up the clusters:

for i in clab clab2
do
kind create cluster --config ${i}_cluster.yaml
done

Containerlab

Then, at last, what this article was really about - setup of Containerlab.

The layout was made to prepare for future installation of Kind Kubernetes clusters (the {int,ext}host{0–3} nodes). Inspiration on how to setup frr configuration comes from https://www.sobyte.net/post/2022-09/containerlab-kind-cilium-bgp/, before looking at that I had more configuration files to keep track of.

Prerequisites

Installation instructions are outlined at https://containerlab.dev/install/ but basically it is to find a Linux environment with a decent amount of memory (I have a vm with 12G), install a recent docker.io and containerlab itself.

For the Arista cEOS (the switches) I followed these instructions in order to fetch the image, but there are probably easier way to do the switching as I really only wanted interfaces and simple connectivity - a simple netshoot image would probably do equal result.

cat << EOF > clab-k8s-conntrack-bgp.yml
name: k8s
topology:
  kinds:
    linux:
      cmd: bash
  nodes:
    intsw0:
      kind: ceos
      image: ceos:4.30.2F
      startup-config: ./conf.d/intsw0.cfg
    peersw0:
      kind: ceos
      image: ceos:4.30.2F
      startup-config: ./conf.d/peersw0.cfg
    extsw0:
      kind: ceos
      image: ceos:4.30.2F
      startup-config: ./conf.d/extsw0.cfg
    inthost0:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab-k8s-control-plane
      exec:
      - ip addr add 10.227.0.2/24 dev net0
      - ip route add 10.237.0.0/24 via 10.227.0.254
    inthost1:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab-k8s-worker
      exec:
      - ip addr add 10.227.0.3/24 dev net0
      - ip route add 10.237.0.0/24 via 10.227.0.254
    inthost2:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab-k8s-worker2
      exec:
      - ip addr add 10.227.0.4/24 dev net0
      - ip route add 10.237.0.0/24 via 10.227.0.254
    inthost3:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab-k8s-worker3
      exec:
      - ip addr add 10.227.0.5/24 dev net0
      - ip route add 10.237.0.0/24 via 10.227.0.254
    inthost4:
      kind: linux
      image: nicolaka/netshoot:latest
      exec:
      - ip addr add 10.227.0.6/24 dev net0
      - ip route add 10.237.0.0/24 via 10.227.0.254
    exthost0:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab2-k8s-control-plane
      exec:
      - ip addr add 10.237.0.2/24 dev net0
    exthost1:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab2-k8s-worker
      exec:
      - ip addr add 10.237.0.3/24 dev net0
    exthost2:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab2-k8s-worker2
      exec:
      - ip addr add 10.237.0.4/24 dev net0
    exthost3:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:clab2-k8s-worker3
      exec:
      - ip addr add 10.237.0.5/24 dev net0
    exthost4:
      kind: linux
      image: quay.io/solo-io/echo-server
      exec:
      - ip addr add 10.237.0.8/24 dev net0
      - ip route replace default via 10.237.0.253
    extgw0:
      kind: linux
      image: frrouting/frr:v8.2.2
      exec:
        - ip addr add 10.237.0.254/24 dev eth1
        - iptables-restore /etc/iptables.conf
      binds:
        - ./conf.d/extgw0_iptables.conf:/etc/iptables.conf
    intbgp1:
      kind: linux
      image: quay.io/frrouting/frr:8.5.3
      exec:
        - ip addr add 10.223.0.1/24 dev eth2
        - ip addr add 10.224.0.1/24 dev eth3
        - apk add openrc conntrack-tools conntrack-tools-openrc keepalived
        - sysctl -w net.ipv4.ip_nonlocal_bind=1
        - sysctl -w net.ipv4.ip_forward=1
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - touch /etc/frr/vtysh.conf
        - /usr/sbin/conntrackd  -d -C  /etc/conntrackd/conntrackd.conf
        - /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf
        - iptables -t nat -A POSTROUTING -s 10.227.0.0/24 -o eth1 -j SNAT --to-source 10.227.0.254
        - iptables -A FORWARD -m state --state RELATED -j ACCEPT
        - iptables -A FORWARD -i eth4 -m state --state ESTABLISHED -j ACCEPT
        - iptables -A FORWARD -i eth1 -m state --state ESTABLISHED -j ACCEPT
        - /usr/lib/frr/frrinit.sh start
      binds:
        - ./conf.d/intbgp1_conntrackd.conf:/etc/conntrackd/conntrackd.conf
        - ./conf.d/intbgp1_frr.conf:/etc/frr/frr.conf
        - ./conf.d/intbgp1_keepalived.conf:/etc/keepalived/keepalived.conf
        - ./conf.d/primary-backup.sh:/etc/conntrackd/primary-backup.sh
        - ./conf.d/intbgp_iptables.conf:/etc/iptables.conf
    intbgp2:
      kind: linux
      image: quay.io/frrouting/frr:8.5.3
      exec:
        - ip addr add 10.223.0.2/24 dev eth2
        - ip addr add 10.224.0.2/24 dev eth3
        - apk add openrc conntrack-tools conntrack-tools-openrc keepalived
        - sysctl -w net.ipv4.ip_nonlocal_bind=1
        - sysctl -w net.ipv4.ip_forward=1
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - touch /etc/frr/vtysh.conf
        - /usr/sbin/conntrackd  -d -C  /etc/conntrackd/conntrackd.conf
        - /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf
        - iptables -t nat -A POSTROUTING -s 10.227.0.0/24 -o eth1 -j SNAT --to-source 10.227.0.254
        - iptables -A FORWARD -m state --state RELATED -j ACCEPT
        - iptables -A FORWARD -i eth4 -m state --state ESTABLISHED -j ACCEPT
        - iptables -A FORWARD -i eth1 -m state --state ESTABLISHED -j ACCEPT
        - /usr/lib/frr/frrinit.sh start
      binds:
        - ./conf.d/intbgp2_conntrackd.conf:/etc/conntrackd/conntrackd.conf
        - ./conf.d/intbgp2_frr.conf:/etc/frr/frr.conf
        - ./conf.d/intbgp2_keepalived.conf:/etc/keepalived/keepalived.conf
        - ./conf.d/primary-backup.sh:/etc/conntrackd/primary-backup.sh
        - ./conf.d/intbgp_iptables.conf:/etc/iptables.conf
    extbgp1:
      kind: linux
      image: quay.io/frrouting/frr:8.5.3
      exec:
        - ip addr add 10.237.0.251/24 dev eth2
        - ip addr add 10.234.0.1/24 dev eth3
        - sysctl -w net.ipv4.ip_nonlocal_bind=1
        - sysctl -w net.ipv4.ip_forward=1
        - touch /etc/frr/vtysh.conf
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - /usr/lib/frr/frrinit.sh start
      binds:
        - ./conf.d/extbgp1_frr.conf:/etc/frr/frr.conf
    extbgp2:
      kind: linux
      image: quay.io/frrouting/frr:8.5.3
      exec:
        - ip addr add 10.237.0.252/24 dev eth2
        - ip addr add 10.234.0.2/24 dev eth3
        - sysctl -w net.ipv4.ip_nonlocal_bind=1
        - sysctl -w net.ipv4.ip_forward=1
        - touch /etc/frr/vtysh.conf
        - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
        - /usr/lib/frr/frrinit.sh start
      binds:
        - ./conf.d/extbgp2_frr.conf:/etc/frr/frr.conf
  links:
    - endpoints: ["intsw0:eth2","intbgp1:eth2"]
    - endpoints: ["intsw0:eth3","intbgp1:eth3"]
    - endpoints: ["intsw0:eth4","intbgp1:eth4"]
    - endpoints: ["intsw0:eth5","intbgp2:eth2"]
    - endpoints: ["intsw0:eth6","intbgp2:eth3"]
    - endpoints: ["intsw0:eth7","intbgp2:eth4"]
    - endpoints: ["intsw0:eth8","inthost0:net0"]
    - endpoints: ["intsw0:eth9","inthost1:net0"]
    - endpoints: ["intsw0:eth10","inthost2:net0"]
    - endpoints: ["intsw0:eth11","inthost3:net0"]
    - endpoints: ["intsw0:eth12","inthost4:net0"]
    - endpoints: ["peersw0:eth2","intbgp1:eth1"]
    - endpoints: ["peersw0:eth3","intbgp2:eth1"]
    - endpoints: ["peersw0:eth4","extbgp1:eth1"]
    - endpoints: ["peersw0:eth5","extbgp2:eth1"]
    - endpoints: ["extsw0:eth2","extbgp1:eth2"]
    - endpoints: ["extsw0:eth3","extbgp1:eth3"]
    - endpoints: ["extsw0:eth4","extbgp2:eth2"]
    - endpoints: ["extsw0:eth5","extbgp2:eth3"]
    - endpoints: ["extsw0:eth6","exthost0:net0"]
    - endpoints: ["extsw0:eth7","exthost1:net0"]
    - endpoints: ["extsw0:eth8","exthost2:net0"]
    - endpoints: ["extsw0:eth9","exthost3:net0"]
    - endpoints: ["extsw0:eth10","exthost4:net0"]
    - endpoints: ["extsw0:eth11","extgw0:eth1"]
EOF

As above configuration refers to Kind nodes, either the Kubernetes clusters needs to be setup (or just uncomment the parts referring to them) before starting the deployment:

sudo -E containerlab deploy -t clab-k8s-conntrack-bgp.yml

The deployment takes a minute or two. Then, when everything is deployed, and everything went as planned, the outgoing traffic should be masked with the shared IP:

$ docker exec -it clab-k8s-inthost3 \
> curl --connect-timeout 4 10.237.0.8:8080 | jq '{RemoteAddr}'
{
  "RemoteAddr": "10.227.0.254:37910"
}

Traceroute looks like this (from a node in intbgp to a node in extbgp):

docker exec -it clab-k8s-inthost4 traceroute 10.237.0.8
traceroute to 10.237.0.8 (10.237.0.8), 30 hops max, 46 byte packets
 1  10.227.0.254 (10.227.0.254)  1.775 ms  2.042 ms  1.462 ms
 2  10.0.0.4 (10.0.0.4)  2.973 ms  2.161 ms  1.197 ms
 3  10.237.0.8 (10.237.0.8)  2.314 ms  4.043 ms  3.303 ms

Then, as that went smooth I began trying out how to let two Kubernetes clusters spin up with Kind (I have to admit that this is my first attempt with Kind as my bhyve environment(s) have serve me rather well), install Cilium, let Cilium peer with the (goBGP) BGP Control-Plane, implement ClusterMesh.. well, let Containerlab go for a real spin. But that seem to be a good reason to write another article.

I’ve published the files at my GitHub repo as well:

https://github.com/tnorlin/containerlab-snat-demo

Refs: https://www.sobyte.net/post/2022-09/containerlab-kind-cilium-bgp/ https://www.linode.com/docs/products/compute/compute-instances/guides/failover-bgp-frr/ https://www.brianlinkletter.com/2021/05/use-containerlab-to-emulate-open-source-routers/ https://satishdotpatel.github.io/ha-with-keepalived-and-conntrackd/