Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

Rancher fails to pull system dockers in cloud-config.yml during boot #2882

Closed
bf8392 opened this issue Aug 25, 2019 · 30 comments
Closed

Rancher fails to pull system dockers in cloud-config.yml during boot #2882

bf8392 opened this issue Aug 25, 2019 · 30 comments

Comments

@bf8392
Copy link

bf8392 commented Aug 25, 2019

RancherOS Version: (ros os version)
1.5.4

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
Virtual Box

The VM Fails to connect after several boots and pull system dockers. Especially when alter the user-config.yml under /var/lib/rancher/conf/cloud-config.d I encounter problems. Is there a solution? Or is it not ment to be altered with vi? Is there another way to alter cloud-config?

@niusmallnan
Copy link
Contributor

Can you show me more details? Your steps and cloud-config.

@bf8392
Copy link
Author

bf8392 commented Aug 26, 2019

Steps were:

  1. Bootet rancheros
  2. wrote cloud-config with vi
  3. installed rancheros with the Ros install -c cloud-config -d commands
  4. waitet for installation
  5. Bootet and sshd in the new rancheros (everything works fluently till here. The config seems to be fully working [additional services are installed, SSH keys work, the partition is expanded etc.]
  6. then I trie to alter the file in /var/lib/rancher/conf/cloud-config-d/user-config.yml with vi to add new settings to cloud config.
  7. save file
  8. validate file
  9. when validation is ok (nothing is displayed) I reboot the machine
  10. after reboot the console displays that it can't connect to the docker/index/v2.io because the network is unreachable

It doesn't matter when I change.the adapter for virtual box, its always the same. I can't regain connection to the new installation

@bf8392
Copy link
Author

bf8392 commented Aug 26, 2019

Here is my Cloud-Config:

#cloud-config

set ssh-key

ssh_authorized_keys:

  • ssh-rsa AAAASOMEKEY123

set hostname

hostname: YourRancher

THE RANCHER KEY

rancher:

resize the device-partition

resize_device: /dev/sda

setup networking

network:
dns:
nameservers:
- 1.1.1.1
- 1.0.0.1

set system services

services_include:
kernel-headers-system-docker: true
kernel-headers: true
kernel-extras: true

setup custom system-services

services:
fail2ban:
image: crazymax/fail2ban
container_name: fail2ban
net: "host"
cap_add:
- NET_ADMIN
- NET_RAW
volumes:
- /custom_services_bd/fail2ban/data/:/data
- /custom_services_bd/fail2ban/log/:/var/log:ro
restart: always
labels:
io.rancher.os.scope: system
~
~

@bf8392
Copy link
Author

bf8392 commented Aug 26, 2019

After Booting, it pulls dockers without problem. Also SSH into the machine works. but during boot, no network seems available...it seems that rancher isn't waiting till the network is up after it booted first...also the ip adress is not displayed in console.

@bf8392
Copy link
Author

bf8392 commented Aug 28, 2019

I tried creating a new virtual machine several times and always run into the same issue - not network on boot to pull custom system services. When I put a script to wait on network before continuing boot, booting hangs and never gets further. The issue also occurs when I don't alter the config. On first boot, everything seems to work fine, but as soon as I reboot, the issue is there...

@bf8392
Copy link
Author

bf8392 commented Aug 29, 2019

Anyone? I tried everything and nothing helps =( please help! I love the approach of Rancher/OS and want to use it in all my server-environments...

@Jason-ZW
Copy link

Jason-ZW commented Aug 30, 2019

@bd8392 I will give it a try. After that, I will give u feedback ASAP

@Jason-ZW
Copy link

Jason-ZW commented Aug 30, 2019

Same problems. The first reboot no errors occur, but when changing the user_config.yml then reboot, the problem comes.

@bf8392
Copy link
Author

bf8392 commented Aug 30, 2019

Yes thank you. I also tried to configure the network like this:

rancher:

resize the device-partition

resize_device: /dev/sda

setup networking

network:
interfaces:
eth*:
dhcp: true
mtu: 1500
dns:
nameservers:
- 1.1.1.1
- 1.0.0.1
doesn't help either...tried everything now...the wait for the network workaround described in #2653 (comment)
leads to boot-halt. Rancheros reports started, but nothing happens anymore and you can't acces anything when implementing this code...

@bf8392
Copy link
Author

bf8392 commented Sep 2, 2019

Any news on this? still had no luck trying to configure...

@Jason-ZW
Copy link

Jason-ZW commented Sep 3, 2019

Is the DHCP server responding slowly? You'd better check the dhcp server. ROS using --timeout 10 flag. Maybe your lease was issued in more than 10 seconds.

dhcpcd -MA4 --nohook resolv.conf --timeout 10

You can try to change the default ROS dhcp timeout to 0(A setting of 0 seconds causes dhcpcd to wait forever to get a lease. ), for example:

rancher:
  network:
    dhcp_timeout: 0

@bf8392
Copy link
Author

bf8392 commented Sep 3, 2019

OK thanks I try it tomorrow :-)

@bf8392
Copy link
Author

bf8392 commented Sep 4, 2019

Doesn't Help :-( tried it multiple times but I think it's not related to the speed of the dhcp server, as the ip-adress is displayed in the console. I attached the console output. I also tried to configure your setting under the interfaces key...doesn't help either...I also tried it with another router...no luck
Rancher Startup

@Jason-ZW
Copy link

Jason-ZW commented Sep 5, 2019

Are these two DNS (1.0.0.1, 1.1.1.1) correct?

@bf8392
Copy link
Author

bf8392 commented Sep 5, 2019

Yes: https://1.1.1.1/de/

It also Happens when I don't alter DNS

@niusmallnan
Copy link
Contributor

There is a race condition here, dhcp may take some time to init the networking, and your custom service tries to start before.

We can try to add some logic to rc.local to ensure that the custom service can be started.

write_files:
  - path: /etc/rc.local
    permissions: "0755"
    owner: root
    content: |
      #!/bin/bash
      ...
      ...
      ## wait dns      
      while ! nslookup docker.io >/dev/null 2>/dev/null; do
        echo "wait for nameserver init"
        sleep 1  
      done
      ros s up <service>     

This approach has been proven to be effective.

@bf8392
Copy link
Author

bf8392 commented Sep 9, 2019

Nope...error persists...also tried it with ros service up network.

@bf8392
Copy link
Author

bf8392 commented Sep 10, 2019

Can someone reproduce/fix the issue? I tried multiple approaches waiting for Network...but they either result in boot-halt or don't take effect.

@niusmallnan
Copy link
Contributor

I cannot reproduce this issue.
I even wonder if there is a problem with your networking.

Can you run this script to collect some diagnostic information?
https://github.com/rancher/os/blob/master/scripts/tools/collect_rancheros_info.sh

@bf8392
Copy link
Author

bf8392 commented Sep 11, 2019

rancheros_export.zip

Done...

@niusmallnan
Copy link
Contributor

I checked your cloud-config, there is a problem in this part.

# write config files to rancher
write_files:
  - path: /etc/rc.local
    permissions: "0755"
    owner: root
    content: |
      #!/bin/bash
      ## wait dns      
      while ! nslookup docker.io >/dev/null 2>/dev/null; do
        echo "wait for nameserver init"
        sleep 1  
      done
      ros up network 

What did you expect for ros up network?

You should use ros s up fail2ban.

@bf8392
Copy link
Author

bf8392 commented Sep 12, 2019

Corrected it but error persists...I thought that the network is not up fast enough to pull the dockers thaty why I tried ros up network...as I said everything works on first boot/install but when i change the cloud-config yml, this error with not connecting comes up. The Network works perfectly fine when the rancher-logo comes up in console. also the ip is displayed and I can connect via putty using hostname. @Jason-ZW has the same error...maybe he can reproduce...

@bf8392
Copy link
Author

bf8392 commented Sep 17, 2019

Has anybody find a solution? I really want to use this os because it would be perfect for my servers...

@niusmallnan
Copy link
Contributor

Which error do you refer to?
This?
ros-sysinit: error .... Error reponse from daemon... dial udp xxx... connect: network is unreachable ..

If you use that workaround I mentioned, you can ignore that error because the system will activate the services defined in your rc.localscript again after the network is available.

@Jason-ZW
Copy link

Jason-ZW commented Sep 18, 2019

@bd8392 The solution worked for me. There will still be errors prompt but the system will activate the services which you defined in your rc.local script.

# write config files to rancher
write_files:
  - path: /etc/rc.local
    permissions: "0755"
    owner: root
    content: |
      #!/bin/bash
      ## wait dns      
      while ! nslookup docker.io >/dev/null 2>/dev/null; do
        echo "wait for nameserver init"
        sleep 1  
      done
      ros s up <your service> 

Before use workaround:

image

After use workaround:

image

@bf8392
Copy link
Author

bf8392 commented Sep 19, 2019

Yes but the problem I have here is, that you call docker extra to get the images you need...I need the os to do that at startup completely automatic for my usecase to function...so it should work the following:

  1. grep the defined dockers and system dockers in cloud-init (if this doesn't work, no system-docker or user-docker could be updated automatically from watchtower on startup as desired by me)
  2. strart the (custom)service [like fail2ban].
    All this must work without the user the intervene...this would be a system without almost no maintainance...I thought rancheros is intended to follow exactly this sheme -> pull dockers in cloud-init automatically. So the error is a problem whne I add new custom-system-services to cloud init- as they don't get pulled automatically by the system. This means, as soon as the system is installed, i can't add custom system-services via cloud-init located in: /var/lib/rancher/conf/cloud-config.d/user-config.yml. which leaves cloud init good for install, but useless for configuring/altering the system, as the custom services never apply...

@bf8392
Copy link
Author

bf8392 commented Sep 19, 2019

Sorry completely my fault! Your workaround works now! I was so focused on the console message, that i completely missunderstood the point of @Jason-ZW . The following leads to the solution:

  1. Apply the workarround described in Rancher fails to pull system dockers in cloud-config.yml during boot #2882 (comment) special thanks to @niusmallnan for working it out with me.

  2. Add the custom system service you need. The system will pull it automatically when the system is started, after the described error message. It will notify you of pulling in the console.

  3. Enjoy!

Thanks to all the members in the discussion working that out with me! I really enjoyed it, and appreciate your work and patience!

As an enchancement, I would suggest prevent the system from this behaviour from stock on, as it can prevent several users from using this system. It's the best system for docker I've seen so far! So I wish it gets spread ;-).

Kind regards

@bf8392 bf8392 closed this as completed Sep 19, 2019
@niusmallnan
Copy link
Contributor

Cool.
We will try to fix it in the next release.

@bf8392
Copy link
Author

bf8392 commented Oct 1, 2019

update:--

script not working for Pi 3. Fails to boot because of ros s up

message of system:

kernel panic: wait for docker timeout. The script leads to a complete boot-halt, as ros s up leads to infinite wait-for-docker

@kingsd041
Copy link
Contributor

Tested with RancherOS v1.5.5-rc1
Reference #2902 (comment), I have restarted repeatedly 20 times without recurring this issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants