2

On my server (Ubuntu 18.04 LTS) I am running a KVM virtual machine which has been running fine for over a year or so but recently - probably because of some update - the virtual machine would lose connectivity to the network whenever the host is rebooted. I somehow managed to restore connectivity the last two times it happened but this time I just can't get it to work anymore.

I have read a lot of tutorials and other webpages and it feels like tried everything more than once, but - obviously - I must be missing something. There are just too many variable involved at the same time and many of them probably influence each other. So with this question, I'd like to find the best troubleshooting strategy that will allow me (and others) to effectively narrow down the source of connectivity issues as far as possible. More specifically, I'm talking about connectivity issues on KVM virtual machines that are connected via a bridge on Ubuntu 18.04.

I realize that this question has become extremely long, so let me clarify that you can answer the question without having read further than here.

Under the headings below, I'm mentioning my most important areas of uncertainty that need to be navigated when troubleshooting network issues but there is no need to discuss these in any detail in the answers. Take them as possible starting points.

If you prefer to take the config of a specific machine as your point of departure, scroll down to the bottom where I provide such details (under the My Example heading).

netplan

One problem with troubleshooting this on 18.04 is that ubuntu changed to using netplan, which renders a lot of currently available advice obsolete.

The switch to netplan is also a source of confusion in itself because, from what I understand, using netplan entails that all network configuration is done in /etc/netplan/*.yaml and no longer in /etc/network/interfaces, yet when I comment out all content in /etc/network/interfaces, it seems to be written back somehow (possibly by myself via Virtual Machine Manager on Gnome desktop).

It looks like I'm not the only one frustrated with netplan and some recommend switching back to ifupdown, but in order to limit the scope of this question, let's stay within netplan and try to fix things without switching back.

NetworkManager vs systemd-networkd

Another difficulty is that there is at least one relevant difference between Ubuntu 18.04 Server and Ubuntu 18.04 Desktop: server uses systemd-networkd and desktop uses NetworkManager, which entails different troubleshooting paths. To make things worse: what if you originally installed the server edition but later added gnome desktop? (I don't recall what I did, but chances are that that is what I did because my /etc/netplan/01-netcfg.yaml says renderer: networkd while I NetworkManager also seems to be running by default.)

Fix host or vm?

My third area of uncertainty is whether I should be fixing stuff on the host or the virtual machine (or when a change on the host also requires a change on the client). I have so far not paid much attention to the vm, given that it was working fine and I never updated it (except for automatic ubuntu security updates). But the last time I managed to fix the connectivity problem, I did so by duplicating its harddrive and creating a new vm (in my eyes identical) with it. I guess this confirms that the config on the vm was fine (since the vms hardware is configured on the host), but it nevertheless told me that I probably need to pay more attention to the vm's hardware configuration.

Reboot needed to apply changes?

When trying out different fixes, I am also often unsure whether the

Conflicting UIs?

Finally, I realized that it may matter through which UI I make certain changes because they may be writing these changes to different places. I currently have the following interfaces for configuring my vm:

  • command line (mostly used)
  • Virtual Machine Manager (GUI)
  • Wok / Kimchi (web-interface)
  • I also have Webmin with Cloudmin running, but my vm is not showing up there, so I'm not currently using it.

My example

So, although my idea here is to find a somewhat generic troubleshooting strategy, I suppose it is always a good idea to start from a concrete example. So here are some details about my current setup (will add more if requested in comments):

This is my current /etc/netplan/01-netcfg.yaml (and I have no other yaml files in that directory):

network:
  version: 2
  renderer: NetworkManager
  ethernets:
    enp0s31f6:
      dhcp4: no
  bridges:
    br0:
      interfaces: [ enp0s31f6]
      dhcp4: yes
      dhcp6: yes

The only reason I'm using NetworkManager is because I've been trying so hard with systemd-networkd without success, that I thought I'd give NetworkManager a chance (but my hunch is that I should be sticking with systemd-networkd). So, accordingly, I set managed=true in my /etc/NetworkManager/NetworkManager.conf which now looks like this:

[main]
plugins=ifupdown,keyfile

[ifupdown] managed=true

[device] wifi.scan-rand-mac-address=no

virsh net-list --all gives me this:

 Name                 State      Autostart     Persistent
----------------------------------------------------------
 br0                  active     yes           yes
 bridged              inactive   yes           yes
 default              active     yes           yes

The bridge I'm trying to use with my vm is br0.

enter image description here

Here is the config of br0:

enter image description here

The second bridge was an attempt to start over and simply create a new bridge and connect the vm to that, but adding the bridge had no effect, probably because Virtual Machine Manager seems to write that into /etc/network/interfaces rather than a yaml file in /etc/netplan/

Here is my /etc/network/interfaces:

##auto lo br0
##iface lo inet loopback

##auto br1 ##iface br1 inet dhcp

bridge_ports enp0s31f6

bridge_stp on

bridge_fd 0.0

##iface br0 inet dhcp

bridge_ports enp0s31f6

auto br0 iface br0 inet dhcp bridge_ports enp0s31f6 bridge_stp on bridge_fd 0.0 auto br-kvm iface br-kvm inet dhcp bridge_ports enp0s31f6 bridge_stp on bridge_fd 0.0

Note how I commented out everything (to make sure that this file was not somehow affecting my config) only to have it added back at the bottom, as mentioned above.

ifconfig gives me a long list of bridges (most named something like br-a5ffb2301edc) of which I have no idea where they come from (I guess I unknowingly created them in my countless hours of testing). I wont paste them all here, only br0 and the actual ethernet interface:

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.4  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::4e52:62ff:fe09:7e59  prefixlen 64  scopeid 0x20<link>
        ether 4c:52:62:09:7e:59  txqueuelen 1000  (Ethernet)
        RX packets 806319  bytes 84505505 (84.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 307846  bytes 845321927 (845.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp0s31f6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 4c:52:62:09:7e:59 txqueuelen 1000 (Ethernet) RX packets 817196 bytes 101316866 (101.3 MB) RX errors 0 dropped 13 overruns 0 frame 0 TX packets 821152 bytes 876709681 (876.7 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 memory 0xef000000-ef020000

This is how I have been testing network connectivity on my vm:

$ping 8.8.8.8
connect: Network is unreachable

Edit: Here is the content of the vm's /etc/netplan/50-cloud-init.yaml:

network:
    version: 2
#    renderer: networkd
    ethernets:
        ens3:
            addresses: []
            dhcp4: true
            dhcp6: false
            optional: true

I cannot recall why I - months ago - commented out the redererer line (nor do I know what default renderer is assumed now), but this exact config has worked.

I can also mention that it occurred to me that cloud-init might be messing things up for me (on the host) so that I checked /var/log/cloud-init-output.log to see whether it was doing anything:

Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'modules:config' at Fri, 21 Feb 2020 02:24:08 +0000. Up 50.91 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'modules:final' at Fri, 21 Feb 2020 02:24:15 +0000. Up 56.59 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 finished at Fri, 21 Feb 2020 02:24:15 +0000. Datasource DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net].  Up 56.76 seconds
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'init-local' at Fri, 21 Feb 2020 02:59:28 +0000. Up 10.48 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'init' at Fri, 21 Feb 2020 03:04:29 +0000. Up 311.21 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++++
ci-info: +-----------+-------+------------------------------+---------------+--------+-------------------+
ci-info: |   Device  |   Up  |           Address            |      Mask     | Scope  |     Hw-Address    |
ci-info: +-----------+-------+------------------------------+---------------+--------+-------------------+
ci-info: |   br-kvm  | False |              .               |       .       |   .    | f2:7a:46:82:f9:e0 |
ci-info: |    br0    |  True |         192.168.1.4          | 255.255.255.0 | global | 4c:52:62:09:7e:59 |
ci-info: |    br0    |  True | fe80::4e52:62ff:fe09:7e59/64 |       .       |  link  | 4c:52:62:09:7e:59 |
ci-info: | enp0s31f6 |  True |              .               |       .       |   .    | 4c:52:62:09:7e:59 |
ci-info: |     lo    |  True |          127.0.0.1           |   255.0.0.0   |  host  |         .         |
ci-info: |     lo    |  True |           ::1/128            |       .       |  host  |         .         |
ci-info: +-----------+-------+------------------------------+---------------+--------+-------------------+
ci-info: +++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+---------------+-----------+-------+
ci-info: | Route | Destination |   Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+-------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 192.168.1.1 |    0.0.0.0    |    br0    |   UG  |
ci-info: |   1   | 169.254.0.0 |   0.0.0.0   |  255.255.0.0  |    br0    |   U   |
ci-info: |   2   | 192.168.1.0 |   0.0.0.0   | 255.255.255.0 |    br0    |   U   |
ci-info: +-------+-------------+-------------+---------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   1   |  fe80::/64  |    ::   |    br0    |   U   |
ci-info: |   3   |    local    |    ::   |    br0    |   U   |
ci-info: |   4   |   ff00::/8  |    ::   |    br0    |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'modules:config' at Fri, 21 Feb 2020 03:04:33 +0000. Up 315.26 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'modules:final' at Fri, 21 Feb 2020 03:04:39 +0000. Up 321.85 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 finished at Fri, 21 Feb 2020 03:04:40 +0000. Datasource DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net].  Up 322.15 seconds

Seeing that it was active I disabled it with sudo touch /etc/cloud/cloud-init.disabled. But my connectivity problem is still not solved.

Edit2: Here is something else I checked (based on this post) is whether the network interface of my virtual machine is still associated to my bridge. To get the name of the interface, i did virsh domiflist LMS (with LMS being the hostname of my vm) and I got this:

Interface  Type       Source     Model       MAC
-------------------------------------------------------
vnet0      bridge     br0        virtio      52:54:00:f0:0e:f8

It already says br0 there under source, but I'm not sure what exactly that means so I double checked using brctl show br0, which confirmed that vnet0 is associated to br0:

bridge name  bridge id            STP enabled     interfaces
br0          8000.4c5262097e59    yes             enp0s31f6
                                                  vnet0

I was so hoping to find vnet0 to be missing so that I could fix it, but unfortunately, that was not the problem either.

Christoph
  • 191
  • This is not an answer to the question but I thought I'd mention it nonetheless: the only way I was able to get the VM connected to the network was to disable netplan and switch back to /etc/network/interfaces as described here: https://askubuntu.com/a/1052023/775670 But I neither understand why it didn't work nor how I could systematically troubleshoot this, so I'm still curious about answers. – Christoph Mar 06 '20 at 22:34

1 Answers1

0

As you have mentioned, this is a long and difficult question, so the first thing I would do is attempt to simplify your configuration.

There is no mention of iptables in your post, which could be causing your problems. You can review your current rules with iptables -vnL; iptables -t nat -vnL. Alternatively, you can set the kernel to bypass iptables for bridges with: sysctl net.bridge.bridge-nf-call-iptables=0 net.bridge.bridge-nf-call-ip6tables=0 net.bridge.bridge-nf-call-arptables=0

Personally, I hate the extra layer of abstraction that is netplan, and since bridging can be done directly with networkd, I would get rid of netplan.io and NetworkManager, and do it all with networkd. Your network obviously has a DHCP server, so you won't need to use the DNSServer configuration in networkd or dnsmasq. The best wiki for networkd is https://wiki.archlinux.org/index.php/Systemd-networkd - have a full read of it, because there are a few tricks to it, but once you understand them, you can transfer that knowledge to every other major distro.

Troubleshooting networkd isn't that bad once you get the hang of it either:

journalctl -xe | grep networkd

or for full debugging:

mkdir /etc/systemd/system/systemd-networkd.service.d
echo -e "[Service]\nEnvironment=SYSTEMD_LOG_LEVEL=debug" >> /etc/systemd/system/systemd-networkd.service.d/override.conf

From there, you can troubleshoot with tcpdump -nni br0 to ensure that you VMs are actually sending and receiving traffic, which may not be true, if they don't have the virtio driver working well. The e1000 driver seems to work well everywhere.

ThankYee
  • 1,708