0

At least since a week ago, my ubuntu 18.04 sometimes does not have internet access. Despite that it shows in the GUI the wifi icon like normal.

Interestingly, dig @8.8.8.8 google.com works, but ping google.com does not. Websites in the browser do not load either.
(I intend to update this question with more detailed descriptions of what "does not work" means next time I see the error messages.)

When this happens, usually a dhclient -r wlp0s20f3 will not fix it, but a sudo dhclient wlp0s20f3 will temporarily fix it.

Sometimes that outputs RTNETLINK answers: File exists and in that case it seems like (sometimes?) I need to use the gui to turn the wifi off and on again. It seems like doing the same with ifdown/ifup or sudo ifconfig wlp0s20f3 down/up does not work reliably for that, but using the gui does.

How to fix this and no longer have to manually get out of this state?

The attempts below list what I've tried and some additional, possibly useful, information. I believe Observation 7 is the most insightful so far, so please scroll down :)

Attempt 1

I found somewhere the suggestion to modify /etc/network/interfaces to look like this:

# interfaces(5) file used by ifup(8) and ifdown(8)
auto lo
iface lo inet loopback

adding this in th ehopes that it will help me avoiding

that issue where i have to run

sudo dhclient wlp... every time.

auto wlp0s20f3 iface wlp0s20f3 inet dhcp auto enp0s31f6 iface enp0s31f6 inet dhcp

but that did not seem to help, so I removed those changes again after a reboot.

Attempt 2

This issue seems common 1,2,3 but all the answers seem to not be explaining much. This answer suggests it could be related to /etc/resolv.conf and this answer talks about checking whether there is a default route.

Indeed, I had no default route (one time) before restarting the wifi. One time the following worked:

# down interface and delete dhcp leases, then up it again
sudo ifdown wlp0s20f3 ; sudo ifconfig wlp0s20f3 down ; sudo rm /var/lib/dhcp/dhclient.* ; sudo ifup wlp0s20f3 ;

view routes

ip route

still broken

try this:

sudo ifconfig wlp0s20f3 down sudo ifconfig wlp0s20f3 up ip route

now it works???

but next time it did not:

generic@motorbrot:~$ echo "bad:" && ip route
bad:
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ echo "bad:" && ip route
bad:
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ ping 1.1.1.1 -
ping: -: Name or service not known
generic@motorbrot:~$ ping 1.1.1.1 
connect: Network is unreachable
generic@motorbrot:~$ dig @8.8.8.8 google.com
^Cgeneric@motorbrot:~echo "after down:" && ip route
after down:
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ echo "after up:" && ip route
after up:
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.0.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.0.37 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ echo "after down-rm-up:" && ip route
after down-rm-up:
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.0.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.0.37 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ echo "after gui turnoff turnon:" && ip route
after gui turnoff turnon:
default via 192.168.0.1 dev wlp0s20f3 proto dhcp metric 600 
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.0.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.0.37 metric 600 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

Notice that the final, working, ip route shows the route that initially was not there. So something somehow changed.

Approach 3

My /etc/resolv.conf also looks shady every now and then:

# this was the state of the /etc/resolv.conf
# file at the time when my network was currently working after a
# wifi-off-wifi-on action in the gui, but generally had issues
# after some time when I reconnected to a wifi...

domain v.cablecom.net search v.cablecom.net nameserver 62.2.17.61 nameserver 62.2.24.158

But i have my own dns resolver with dnscrypt-proxy running on localhost. So it should actually rather be something like

nameserver 127.0.0.1
options edns0

This is an issue that I have had before at some point, according to my notes. This answer suggests to add dns=none to /etc/NetworkManager/NetworkManager.conf, but that did not work at all back then, until following the comment by Chris Moore to also run sudo service network-manager restart.

However, at the current moment, dns=none is set as such in my NetworkManager.conf:

[main]
plugins=ifupdown,keyfile
# Added 30.07.2020 by LucidBrot to avoid /etc/resolv.conf being overwritten and hence breaking the DNS resolving.
dns=none

[ifupdown] managed=false

[device] wifi.scan-rand-mac-address=no

I can try to perform the sudo service network-manager restart once more, but I would be surprised if it actually helped.

It is also worth pointing out that my /etc/resolv.conf is a symlink. According to redhat this would too make NetworkManager not modify that file. But it evidently did, because I kept track of what I had set that file's contents to.

I do not know what to try next, and I would like to understand what happened, and why, in addition to how to fix it.

generic@motorbrot:/etc$ ls -la | grep resolv
drwxr-xr-x   3 root root        3 Mai  7  2020 resolvconf
lrwxrwxrwx   1 root root       25 Mär 31 10:21 resolv.conf -> /etc/resolv.conf.localdns
-rw-r--r--   1 root root      737 Jul 29  2020 resolv.conf.backup
-rw-r--r--   1 root root       74 Jul 30  2020 resolv.conf.backup2
-rw-r--r--   1 root root      364 Mär 31 10:17 resolv.conf.backup3
-rw-r--r--   1 root root       89 Apr  5 00:06 resolv.conf.localdns

Observation 3

It happened again, so I turned the wifi off and on again. Still not working. At this point I ran the following commands:

generic@motorbrot:~$ ip route
default via 192.168.43.68 dev wlp0s20f3 proto dhcp metric 600 
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.43.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.43.143 metric 600 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ sudo dhclient wlp0s20f3 
[sudo] password for generic: 
generic@motorbrot:~$ ip route
default via 192.168.43.68 dev wlp0s20f3 
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.43.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.43.143 
192.168.43.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.43.143 metric 600 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 

We can see that all that sudo dhclient wlp0s20f3 changed was removing the proto dhcp metric 600 from the default route. After that, internet is working.

NetworkManager or systemd-networkd

A comment suggests there might be different config methods conflicting. I believe I am using NetworkManager, and I believe this output supports that belief:

generic@motorbrot:~$ systemctl list-unit-files | grep networkd
networkd-dispatcher.service                                            enabled  
systemd-networkd-wait-online.service                                   disabled 
systemd-networkd.service                                               disabled 
systemd-networkd.socket                                                disabled 
generic@motorbrot:~$ systemctl list-unit-files | grep NetworkManager
NetworkManager-dispatcher.service                                      enabled  
NetworkManager-wait-online.service                                     enabled  
NetworkManager.service     

Observation 4

Right now I had the problem that the gui thought I was connected, but even dig @8.8.8.8 google.com did not work. So I suspect I have multiple issues at once.

There was no default route at that time. I used the gui to turn wifi off and on again and now the connection worked again, with a default route present:

# before restarting wifi:
generic@motorbrot:~$ ip route
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

after restarting wifi:

generic@motorbrot:~$ ip route default via 192.168.0.1 dev wlp0s20f3 proto dhcp metric 600 169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 192.168.0.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.0.37 metric 600 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

I found some answers [5, 6] mentioning /etc/NetworkManager/NetworkManager.conf when searching again for the problem of a missing default route. On my laptop, it contains managed=false. It seems like this should be true instead, so I changed it for now. However, these answers seem themselves unsure whether this should be managed=true or managed=false...

[main]
plugins=ifupdown,keyfile
# Added 30.07.2020 by LucidBrot to avoid /etc/resolv.conf being overwritten and hence breaking the DNS resolving.
dns=none

[ifupdown] managed=true

[device] wifi.scan-rand-mac-address=no

The answers are saying that requires a service network-manager restart, which I'm doing now. I did a systemctl restart NetworkManager and fascinatingly, my default route is now gone, but the internet connection is still working. An empty line in my routes disappeared.

generic@motorbrot:~$ systemctl status NetworkManager
● NetworkManager.service - Network Manager
   Loaded: loaded (/lib/systemd/system/NetworkManager.service; enabled; vendor p
   Active: active (running) since Tue 2022-04-05 00:12:28 CEST; 1 weeks 0 days a
     Docs: man:NetworkManager(8)
 Main PID: 16747 (NetworkManager)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/NetworkManager.service
           ├─16747 /usr/sbin/NetworkManager --no-daemon
           └─32449 /sbin/dhclient -d -q -sf /usr/lib/NetworkManager/nm-dhcp-help
generic@motorbrot:~$ ip route
default via 192.168.0.1 dev wlp0s20f3 proto dhcp metric 600 
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.0.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.0.37 metric 600 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
generic@motorbrot:~$ systemctl restart NetworkManager
generic@motorbrot:~$ ip route
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 

~~I will report back how that affected behaviour if at all.~~ This did not stop the missing default route issue from happening though. That issue is temporarily fixed by turning off the wifi in the gui and turning it on again, but not by sudo dhclient wlp0s20f3.

Since it seemed to have no observable effect, I have soon changed this back to managed=false.

Observation 5

I think my suspicion is confirmed. After this change I now had a default route on my hotspot but still some issues.

  • websites not loading, domains not resolving with ping
  • Telegram worked
  • dig @8.8.8.8 google.com resolving correctly
  • dig google.com not resolving

So it would have to be an issue with either my local dns resolver or some other networking issue.
The routes looked this way:

generic@motorbrot:~$ ip route
default via 192.168.43.143 dev wlp0s20f3 proto dhcp metric 600 
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.43.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.43.144 metric 600 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

generic@motorbrot:~$ ping google.com ^C generic@motorbrot:~$ dig google.com

; <<>> DiG 9.11.3-1ubuntu1.17-Ubuntu <<>> google.com ;; global options: +cmd ;; connection timed out; no servers could be reached generic@motorbrot:~$ dig @8.8.8.8 google.com

; <<>> DiG 9.11.3-1ubuntu1.17-Ubuntu <<>> @8.8.8.8 google.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17464 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;google.com. IN A

;; ANSWER SECTION: google.com. 59 IN A 142.250.203.110

;; Query time: 44 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Wed Apr 13 09:01:30 CEST 2022 ;; MSG SIZE rcvd: 55

To get my local DoH temporarily working again, sudo dhclient -r wlp0s20f3 did the trick once again.

Observation 6

systemctl status systemd-resolved revealed that it was loaded, disabled, and active (running).

It should be disabled, that's correct. Because I am using dnscrypt-proxy as a local stub and don't need systemd-resolved. But it should not be running... I don't know why it was running, but I stopped it again now.

I have now also deleted my /etc/network/interfaces file, since this answer indicates that I do not want it. It would be used by ifupdown but I am using network-manager.

Observation 7

Following this answer, I have set up auditing for the file my /etc/resolv.conf symlink is pointing towards.

sudo apt install auditd
sudo systemctl status auditd
# shows it is running and enabled
# Set up a rule to watch the file
# and use an arbitrary key for later grepping it:
sudo auditctl -w /etc/resolv.conf.localdns -p wa -k lb_dhclient_issue
# list rules
sudo auditctl -l
# to remove the watch, use the same command but with -W instead of -w and match each other field in the rule.
# i.e.
# sudo auditctl -W /etc/resolv.conf.localdns -p wa -k lb_dhclient_issue

Very soon after, I already see activity on that file:

sudo ausearch -f /etc/resolv.conf.localdns --format text
At 13:47:15 25.04.2022 generic, acting as root, successfully renamed /etc/resolv.conf.localdns.dhclient-new.13892 to /etc/resolv.conf.localdns using /bin/mv
At 13:49:39 25.04.2022 generic, acting as root, successfully renamed /etc/resolv.conf.localdns.dhclient-new.15462 to /etc/resolv.conf.localdns using /bin/mv
At 13:53:08 25.04.2022 generic, acting as root, successfully renamed /etc/resolv.conf.localdns.dhclient-new.17715 to /etc/resolv.conf.localdns using /bin/mv
At 13:56:52 25.04.2022 generic, acting as root, successfully renamed /etc/resolv.conf.localdns.dhclient-new.20232 to /etc/resolv.conf.localdns using /bin/mv
At 13:59:51 25.04.2022 generic, acting as root, successfully renamed /etc/resolv.conf.localdns.dhclient-new.22822 to /etc/resolv.conf.localdns using /bin/mv

Roughly every three minutes, some process under my username (generic) acts as root to move a file to /etc/resolv.conf.localdns. And the source is /etc/resolv.conf.localdns.dhclient-new.22822, which indicates that dhclient is the culprit.

I guess I could use chattr +i /etc/resolv.conf to make it un-editable, but that seems like a dirty approach. For now, I am doing that and it seems to successfully prevent dhclient form changing the file, but I would like to understand what went wrong and how to avoid the same issue in the future, perhaps even a cleaner fix.

Also, I don't really understand why manually running dhclient helped me. I guess that was the problem with the missing default route, which has not been appearing anymore in a while now.

lucidbrot
  • 1,311
  • 3
  • 18
  • 39
  • 1
    At first, up to "how to fix this", just looks like you have a dns problem... If you put google.com's IP(and it is not 8.8.8.8 btw) or yahoo.com's in a browser you may not get the page to load but you should a response from the server indicating that it is google (or yahoo). But the fact that you can "dig" proves that, which means you do have an IP during these times. But are you sure youre getting that IP from your router? Are you sure you are on your wifi's access point all the time and not your neighbor's? ...getting the correct/expected info from the dhcp server. I see 2 different gateways – WU-TANG Apr 10 '22 at 02:56
  • 1
    What is your default gateway's IP? "/etc/network/interfaces", "NetworkManager", "dhclient", "my own dnscrypt-proxy". It seems you have tinkered with a bunch of conflicting configs/methods, and it's hard to follow what's actually controlling your connectivity. I would start with seeing if this problem happens in a live cd/usb session. Then you'll if know if it's a config or external problem. Then i would plan on resetting (removing all) those configs back to basic. Maybe use just NetManager and nothing else(ubuntu's default dns), until you identify the problem/fix. – WU-TANG Apr 10 '22 at 03:21
  • @WU-TANG Thanks a lot for your inputs! Yes, I have tinkered with several things, yes that makes it hard to follow, but they worked well for about a year. It happens in several wifi networks at several locations, so I do believe that my config is to blame and not an external factor. My default gateway (which means, the ip of my router, right?) at home should be 192.168.0.1 but some of the excerpts here are from my phone hotspot or the university network instead. – lucidbrot Apr 10 '22 at 12:00
  • Btw, of the things you listed in your comment, I only consciously did the dnscrypt-proxy setup on purpose (for a local DoH resolver stub). The rest I only started to touch now because things were not working anymore, all of a sudden. – lucidbrot Apr 10 '22 at 12:02
  • Your guess that different conflicting config methods are in play sounds reasonable. Do you happen to know a good resource on which ones conflict and how to deal with them? I have read this but don't feel much wiser – lucidbrot Apr 10 '22 at 12:09

1 Answers1

0

After making the /etc/resolv.conf file immutable using chattr +i /etc/resolv.conf, dhclient stopped modifying my file because it failed to do so, but it did not stop trying to. That was visible in the auditd logs.

However, at some point today I tried to fix some other problems and also performed

  • an apt upgrade and apt autoremove that also added and removed some kernel headers
  • a reboot to windows, where I used lenovo vantage to update a big number of drivers and the BIOS

Although a normal reboot did not help at all so far, the combination of those things seem to have stopped the dhclient from trying. My audit rules only report my manual attempts to change the file now, no longer any failures by dhclient. The last failure of dhclient happened before those two bullet points.

So it seems like the issue was likely introduced by a kernel upgrade, and fixed by another one.


Edit 02. Mai 2022: This is no longer true. This morning, the issue was not present. Right now it happened again, without any reboot in between.

My initial workaround of using chattr to make the file immutable was no longer present (maybe I had removed it again once the audit showed that the dhclient stopped trying) and my symlink from /etc/resolv.conf to /etc/resolv.conf.localdns was gone. The file contained wrong values for the current network (based on the ISP of the network I was at before). Manually fixing the file and setting immutability again fixed it again ... for now.

It seems that Cisco Anyconnect is also meddling in this affair! After setting up the audit logs as explained in the question, I now see this when I use it to connect:

At 18:19:09 02.05.2022 system, acting as root, unsuccessfully opened-file /etc/resolv.conf using /opt/cisco/anyconnect/bin/vpnagentd
At 18:19:09 02.05.2022 system, acting as root, unsuccessfully renamed /etc/resolv.conf.vpnbackup using /opt/cisco/anyconnect/bin/vpnagentd
At 18:19:09 02.05.2022 system, acting as root, unsuccessfully changed-file-ownership-of /etc/resolv.conf to root using /opt/cisco/anyconnect/bin/vpnagentd
At 18:19:09 02.05.2022 system, acting as root, unsuccessfully renamed /etc/resolv.conf.vpnbackup using /opt/cisco/anyconnect/bin/vpnagentd
At 18:19:10 02.05.2022 system, acting as root, unsuccessfully changed-file-ownership-of /etc/resolv.conf to root using /opt/cisco/anyconnect/bin/vpnagentd
At 18:19:10 02.05.2022 system, acting as root, unsuccessfully renamed /etc/resolv.conf.vpnbackup using /opt/cisco/anyconnect/bin/vpnagentd
At 18:19:10 02.05.2022 system, acting as root, unsuccessfully changed-file-ownership-of /etc/resolv.conf to root using /opt/cisco/anyconnect/bin/vpnagentd

So it is possible that Cisco Anyconnect sometimes renames the resolv.conf to /etc/resolv.conf.vpnbackup and then for some reason does not fix it after losing connection... My current "fix" with chattr means that I can not connect to the VPN. It seems this is a known problem

lucidbrot
  • 1,311
  • 3
  • 18
  • 39