My favorites | Sign in
Project Home Issues
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 57: Networking intermittent using Docker
10 people starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  briandorsey@google.com
Closed:  Dec 2013


Sign in to add a comment
 
Reported by da...@genomebridge.org, Dec 14, 2013
What steps will reproduce the problem?
1. Use backport debian kernel to make a node. Make sure canForwardIP is on. Make sure net.ipv4.ip_forward=1
2. Install Docker
3. Log into docker via  (sudo docker run -t -i base bash)
4. Run a bunch of commands in that docker container that use networking (apt-get update; apt-get install python). These commands work 3/10 times. 
5. Have tried commands like GSUTIL. Also intermittent.

What is the expected output? What do you see instead?
Expected output is that networking works consistently.

What version of the product are you using? On what operating system?
Using Debian 7 as supplied by Google with Backport kernel.

This has been confirmed by more than just me:
https://groups.google.com/d/msg/docker-user/N085Aq3oX5Y/Ee5S-CGP2soJ
Dec 15, 2013
#2 da...@genomebridge.org
A bit deeper -- it seems DNS is resolving, but we are not able to exchange data. We can ping, but when using something simple like netcat, we can make connections but cannot receive data from the host we connect to. I don't know if the requests aren't getting out or the responses aren't getting in. As per GCE instructions, --can_ip_forward=true.

On a GCE instance with docker:
dbernick@docker-playground:~$ sudo docker run -t -i base bash
root@0ced2f7de3ad:/# nc -vv archive.ubuntu.com 80
Connection to archive.ubuntu.com 80 port [tcp/http] succeeded!
GET /ubuntu/
(hangs)

On the GCE instance itself:
dbernick@docker-playground:~$ nc -vv archive.ubuntu.com 80
DNS fwd/rev mismatch: archive.ubuntu.com != danava.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != cursa.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != zaurac.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != obake.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != urayuli.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != sudice.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != ragana.canonical.com
DNS fwd/rev mismatch: archive.ubuntu.com != orobas.canonical.com
archive.ubuntu.com [91.189.92.156] 80 (http) open
GET /ubuntu/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /ubuntu</title>
 </head>
 <body>
<h1>Index of /ubuntu</h1>
<table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th></tr><tr><th colspan="4"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td><a href="/">Parent Directory</a></td><td>&nbsp;</td><td align="right">  - </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="dists/">dists/</a></td><td align="right">12-Dec-2013 16:33  </td><td align="right">  - </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="indices/">indices/</a></td><td align="right">18-Oct-2013 15:05  </td><td align="right">  - </td></tr>
<tr><td valign="top"><img src="/icons/compressed.gif" alt="[   ]"></td><td><a href="ls-lR.gz">ls-lR.gz</a></td><td align="right">16-Dec-2013 01:36  </td><td align="right"> 11M</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="pool/">pool/</a></td><td align="right">27-Feb-2010 06:30  </td><td align="right">  - </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="project/">project/</a></td><td align="right">28-Jun-2013 11:52  </td><td align="right">  - </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="ubuntu/">ubuntu/</a></td><td align="right">16-Dec-2013 01:47  </td><td align="right">  - </td></tr>
<tr><th colspan="4"><hr></th></tr>
</table>
<address>Apache/2.2.22 (Ubuntu) Server at archive.ubuntu.com Port 80</address>
</body></html>
Dec 16, 2013
#3 da...@genomebridge.org
The above Docker instance was started in the exact way that Proppy at Google (Johan Euphrosine) recommends. (http://docs.docker.io/en/master/installation/google/)
Dec 16, 2013
Project Member #4 briandorsey@google.com
David, 

Rather than network, I think you may be running into disk I/O waits due to a small Persistent Disk.

How big is the Persistent Disk you're running from? As of the GA launch, PD I/O scales with disk size. Larger disks are faster. Details here: https://cloud.google.com/developers/articles/compute-engine-disks-price-performance-and-persistence   Can you attach a larger PD (we generally recommend ~500GB) and store the docker files there as a test?

Also, I'll run through the examples and propose doc updates to default to a larger PD.



Status: Accepted
Owner: briandorsey@google.com
Dec 16, 2013
#5 da...@genomebridge.org
We've run the Dockernode with /var/lib/docker mounted on a 2 TB disk and still have the same issue. I'm going to run this again just to be 100% sure.
Dec 16, 2013
#6 da...@genomebridge.org
Confirmed. /var/lib/docker symlinks to /docker which is a 2TB Persistent disk (that we use ephemerally). 

sudo docker run -t -i base bash
root@8ec72ac4fcbe:/# apt-get update
Ign http://archive.ubuntu.com quantal InRelease
Hit http://archive.ubuntu.com quantal Release.gpg
Hit http://archive.ubuntu.com quantal Release
Hit http://archive.ubuntu.com quantal/main amd64 Packages
40% [Waiting for headers]
(hangs here -- never finishes). 

Have you been able to make Docker work with consistent networking in GCE?
Dec 16, 2013
#7 da...@genomebridge.org
Our actual boot disk is only 10G (as is standard), but our /var/lib/docker is 2 TB.
Dec 16, 2013
Project Member #8 briandorsey@google.com
I'm starting to work through reproducing this and troubleshooting, but it seems like you're ahead of me. :) 

One thing I plan to try is using a larger boot disk. If you have time, please give it a try as well. Steps below:

 You can manually create a larger one from an image using gcutil:
$ gcutil adddisk --source_image=backports-debian-7 --size_gb=500 docker-test-big

Then adding the instance with: 
$ gcutil addinstance --disk=docker-test-big,boot docker-test-big

(The boot partition itself will still only be 10GB, so that's all the OS will be able to see, which should be fine for a test. Instructions for repartitioning the root partition here: https://developers.google.com/compute/docs/disks?hl=en#repartitionrootpd)
Dec 16, 2013
#9 da...@genomebridge.org
I followed the above steps. So while the test disk is 500G, only 10G are on the root partition. The rest is unpartitioned. I then follow the rest of Proppy's steps. Still hangs. 

dbernick@docker-test-big:~$ sudo docker run -t -i base bash
Unable to find image 'base' (tag: latest) locally
Pulling repository base
b750fe79269d: Download complete 
27cf78414709: Download complete 
root@89a0be719f63:/# apt-get update
Ign http://archive.ubuntu.com quantal InRelease
Hit http://archive.ubuntu.com quantal Release.gpg
Hit http://archive.ubuntu.com quantal Release
Hit http://archive.ubuntu.com quantal/main amd64 Packages
40% [Waiting for headers]
(hang)
Dec 16, 2013
Project Member #10 briandorsey@google.com
Hrm... I was just able to run 'apt-get update' successfully using the stock directions (10GB boot disk):

$ sudo docker run -t -i base bash
Unable to find image 'base' (tag: latest) locally
Pulling repository base
b750fe79269d: Download complete
27cf78414709: Download complete
root@09b6f86e289a:/# apt-get update
Ign http://archive.ubuntu.com quantal InRelease
Hit http://archive.ubuntu.com quantal Release.gpg
Hit http://archive.ubuntu.com quantal Release
Hit http://archive.ubuntu.com quantal/main amd64 Packages
Get:1 http://archive.ubuntu.com quantal/universe amd64 Packages [5274 kB]
Get:2 http://archive.ubuntu.com quantal/multiverse amd64 Packages [131 kB]
Get:3 http://archive.ubuntu.com quantal/main Translation-en [660 kB]
Get:4 http://archive.ubuntu.com quantal/multiverse Translation-en [100 kB]
Get:5 http://archive.ubuntu.com quantal/universe Translation-en [3648 kB]
Fetched 9813 kB in 15s (646 kB/s)
Reading package lists... Done

Not sure what is different in my environment compared to yours. :/  

I'm using an n1-standard-1 in us-central1-b. 


Dec 16, 2013
#11 da...@genomebridge.org
It works sometimes, but it's intermittent. It usually fails. That's the issue. I just ran it this way.I'm using a n1-standard-1 in us-central1-a. Should I do b?

docker run -t -i tianon/debian /bin/bash
Unable to find image 'tianon/debian' (tag: latest) locally
Pulling repository tianon/debian
0510dba62421: Download complete 
4f9975c87b56: Download complete 
f2d4b32a0e66: Download complete 
05b866649fa8: Download complete 
f815021ef20d: Download complete 
0ff04e2946d2: Download complete 
b1f77df5b54f: Download complete 
764a25351209: Download complete 
a1390ca6935c: Download complete 
511136ea3c5a: Download complete 
3ead6dd57737: Download complete 
6660520c5eda: Download complete 
root@0b31034f1017:/# apt-get update
Get:1 http://ftp.us.debian.org wheezy Release.gpg [1672 B]
Get:2 http://ftp.us.debian.org wheezy-updates Release.gpg [836 B]   
Get:3 http://ftp.us.debian.org wheezy Release [168 kB]              
Get:4 http://security.debian.org wheezy/updates Release.gpg [836 B]
Get:5 http://ftp.us.debian.org wheezy-updates Release [124 kB]          
Get:6 http://ftp.us.debian.org wheezy/main amd64 Packages [5848 kB]
Get:7 http://ftp.us.debian.org wheezy-updates/main amd64 Packages [2905 B]
100% [Waiting for headers]                                                                                                                                              1021 kB/s 0s
100% [Waiting for headers]                                                                                                                                              1021 kB/s 0s
100% [Waiting for headers]                                                                                                                                                          
100% [Waiting for headers]

 (hangs)

Dec 16, 2013
#12 da...@genomebridge.org
I just tried it on us-central1-b and it all worked. Can you see if it always work for you in us-central1-a? If it does, then I imagine it's a region issue.
Dec 16, 2013
Project Member #13 briandorsey@google.com
I see a hang in us-central1-a.  I'll reply back here when I have more info. 
Dec 16, 2013
#14 da...@genomebridge.org
Yay! Neither of us is crazy!

We can absolutely use central1-b if all of our quotas are moved to central1-b.
Dec 16, 2013
Project Member #15 briandorsey@google.com
Intermittent errors strike again. I am no longer convinced that we have a root cause. My example working and broken VMs are now both working. I am working on creating a boot script which can test for this programmatically, rather than interactively. Then we can start running batches of test runs and use stats to narrow things down. 
Dec 16, 2013
#16 da...@genomebridge.org
So its NOT safe to use central-1b? Have you seen the error there or only 1a?
Dec 18, 2013
Project Member #17 briandorsey@google.com
We've found the root cause.

Basically, Docker assumes an MTU of 1500, whereas the eth0 NIC on the GCE instances is 1460, and that's hitting a bug in the network virtualization in us-central-1a.

For the short term, a work around is to add "ifconfig eth0 mtu 1460" to the script that you run inside your Docker container, before you do any network traffic. Another work around is  docker run -lxc-conf="lxc.network.mtu = 1460".  We've validated with a run of 10 successful apt-get updates that this workaround appears to correct things.

Longer term, we will fix the bug in the virtualization.
Dec 26, 2013
Project Member #18 briandorsey@google.com
The docker team has added a command line flag to set the MTU for all containers:

docker -d -mtu 1460

You can now set this once, and not need the ifconfig or lxc-conf steps for each container.
Status: Fixed
Sign in to add a comment

Powered by Google Project Hosting