Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: DialTimeout causes thread exhaustion #5625

Closed
gopherbot opened this issue Jun 3, 2013 · 18 comments
Closed

net: DialTimeout causes thread exhaustion #5625

gopherbot opened this issue Jun 3, 2013 · 18 comments
Milestone

Comments

@gopherbot
Copy link

by paul@adeven.com:

there seems to be a severe bug in the latest go http client. we had production crashes
of a web crawler type program and had to revert to go 1.0.3.
this is issues seems to be pretty serious since it basically breaks the whole http
client.

What steps will reproduce the problem?
If possible, include a link to a program on play.golang.org.
1. http://play.golang.org/p/mF5AV2AzbB

What is the expected output?
nothing

What do you see instead?
http://pastebin.com/CRwUMhaM

Which compiler are you using (5g, 6g, 8g, gccgo)?
go run (version 1.1)

Which operating system are you using?
tested under mac os and linux, in both cases 1.0.3 works just fine

Which version are you using?  (run 'go version')
1.1

Attachments:

  1. crash.log (466386 bytes)
@minux
Copy link
Member

minux commented Jun 3, 2013

Comment 1:

please don't ignore the error from http.Get.
for one thing, it shouldn't close the body when err != nil.

Status changed to WaitingForReply.

@gopherbot
Copy link
Author

Comment 2 by paul@adeven.com:

you're right. if i catch the errors this actually runs. the problem is that we did find
a bug in production but it seems to be a lot harder to find.
this was just my first try to "capture" this problem.
the actual production dump is atttached. we make a lot of calls to a lot of different
http apis from customers.
some of them are https and some have really long timeouts and connection interruptions.
we had no problem with 1.0.3 but with 1.1 it does crash (and yes in production we do not
ignore errors, this was just me trying to write a quick demo hack)

Attachments:

  1. paul.log (1114381 bytes)

@gopherbot
Copy link
Author

Comment 3 by paul@adeven.com:

i believe this could be related to
https://code.google.com/p/go/source/detail?r=d4e1ec84876c
i will attempt to derive the issue from our production env and to deliver smt.
reproducible.

@gopherbot
Copy link
Author

Comment 4 by jgrahamc:

I am interested in following this as we have been unable to move to Go 1.1 because of a
problem with RSS of a long running Go program growing until the OOM killer stops the
process. We have been trying to isolate a small test case.

@gopherbot
Copy link
Author

Comment 5 by paul@adeven.com:

pretty much the same thing...it works normally for a while, then suddenly stops making
any calls and then some moments later takes huge amounts of ram and gets killed.
we saw that the fewer http.Client instances we used the faster the bug would occur.
so far we are still trying to make a reproducible case. we can reliably crash it in
production but not with a test so far.

@davecheney
Copy link
Contributor

Comment 6:

@paul. The reason your code crashed was there were over 500 threads doing dns lookups.
Because Go uses the hosts' dns resolver by default on linux, each of these lookups
consumes a thread as cgo always runs on an operating system thread, not a goroutine.
This has exhausted the number of threads allowed by the ulimit of the user you were
running as. I would *not* recommend raising this limit as a stop gap, this will only
make things worse as the system resolver will crap out if you have more than 1024 FD's
open in your program.
Is the code to this custom transport available ?
        /usr/lib/go/src/pkg/net/dial.go:138 +0xa5
github.com/adeven/adjust_backend/callback_worker.dialWithTimeout(0x76e010, 0x3,
0xc263588b80, 0x14, 0x0, ...)
        /home/callback_worker/app/releases/20130525213259/callback_worker/run/.go/src/github.com/adeven/adjust_backend/callback_worker/callback_consumer.go:96 +0x196
net/http.(*Transport).dial(0xc2002d4f80, 0x76e010, 0x3, 0xc263588b80, 0x14, ...)
        /usr/lib/go/src/pkg/net/http/transport.go:382 +0x87

@gopherbot
Copy link
Author

Comment 7 by paul@adeven.com:

hi, the dial simply
func dialWithTimeout(network, address string) (connection net.Conn, err error) {
    connection, err = net.DialTimeout(network, address, requestTimeout)
    return
}
where the timeout is 5s. we iterated a lot of different versions of this code as we
thought the problem would be the code. in the end even http.Get crashed. so we
downgraded again.

@davecheney
Copy link
Contributor

Comment 8:

How many dial/second are you doing ? Can you resolve and cache `address` once then reuse
that ?

@gopherbot
Copy link
Author

Comment 10 by paul@adeven.com:

around 100-300/sec theoretically the address should be cacheable as we usually query
<10 hostnames

@davecheney
Copy link
Contributor

Comment 11:

I think you are going to have to cache the address. If you read the comment inside
net/lookup.go, there is no throttling and each lookup is handled with a timeout thread,
so if there is even a tiny blip in your dns server,or the name is slow to resolve you'll
blow you thread limit quickly.
Alternatively you could
a. compile with cgo disabled, this is my recommendation
b. switch to github.com/miekg/dns for dns resoltion, then pass a *net.TCPAddr to
net.DialTCP
Making as accepted, but leaving priority as triage for now, I do not know the correct
priority to apply to this.

Status changed to Accepted.

@gopherbot
Copy link
Author

Comment 12 by paul@adeven.com:

ok so this maybe a stupid question, but how does one disable cgo?
and does this have any other downsides?

@davecheney
Copy link
Contributor

Comment 13:

CGO_ENABLED=1 ./all.bash
will build a version of Go with cgo disabled.
The drawbacks are parts of the standard library that use cgo will be disabled, or work
with reduced of altered functionality. For example, the net package will switch to a
native Go dns resolver, which shouldn't be a problem unless you need to use more
esoteric resolvers (ie, nsswitch). The os/user package will also stop working.
Obviously you will not be able to use packages that require cgo.

@gopherbot
Copy link
Author

Comment 14 by paul@adeven.com:

ok thx a lot so far, we'll see what solution is best for us now.
but why exactly does go 1.0.3 work like a charm while 1.1 has this issue?

@davecheney
Copy link
Contributor

Comment 15:

Looking at the implementation in 1.0.3, I cannot see why you did not see the same
problem. Possibly there is some additional blocking in 1.0.x which slowed your accept or
dial rate down. You could try hitting your program with SIGQUIT under the same load and
counting the number of goroutines waiting in [syscall].

@rsc
Copy link
Contributor

rsc commented Jun 4, 2013

Comment 16:

We should at least have a cache of inflight lookups, so that 100 simultaneous dials of
one host name don't do the work 100x. That's easy and (assume we forget the answer once
they all get it) doesn't pose any consistency problems. It just merges simultaneous work.
The rest is issue #4056.

Labels changed: added priority-later, go1.2, removed priority-triage.

@bradfitz
Copy link
Contributor

Comment 17:

This issue was updated by revision 61d3b2d.

R=golang-dev, iant, cespare, rsc, dave, rogpeppe, remyoudompheng
CC=golang-dev
https://golang.org/cl/10079043

@rsc
Copy link
Contributor

rsc commented Jul 30, 2013

Comment 18:

Labels changed: added feature.

@rsc
Copy link
Contributor

rsc commented Aug 17, 2013

Comment 19:

This issue was closed by revision 1d3efd6.

Status changed to Fixed.

@rsc rsc added this to the Go1.2 milestone Apr 14, 2015
@rsc rsc removed the go1.2 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants