My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 90675: Git hanging a lot more with Gerrit move.
5 people starred this issue and may be notified of changes. Back to list
 
Project Member Reported by sosa@chromium.org, Jul 27, 2011
We got bit by this again on the Cros waterfall.  I understand that Gerrit/JGit do not perform very well under heavy load via ssh and not on the http mirrors, but is there anyway we can improve the infrastructure such that these failures don't end up in Git hanging?  Pretty much now for automated tooling we'd have to run all git-ssh commands in wrappers with timeouts + retries (not just the latter).

Jul 29, 2011
#1 cmp@chromium.org
(No comment was entered for this change.)
Status: Assigned
Owner: nsylv...@chromium.org
Labels: Pri-1
Jul 29, 2011
#2 nsylv...@google.com
ran  git config --global gc.auto 0 on all servers

Also created the new cron jobs that will gc --auto all repos during the night.

Let me know if you still see the error. 
Aug 1, 2011
#3 cmp@chromium.org
Chris/Chris, have you seen any hanging bots in repo commands since Nicolas gc'd the repos?
Cc: sosa@chromium.org cmasone@chromium.org
Aug 1, 2011
#4 cmasone@chromium.org
you'll have to ask sheriffs from the last several days -- kliegs and stevenjb would be two of them.  I'd CC them, but I'm not allowed.
Aug 1, 2011
#5 cmp@chromium.org
+kliegs and stevenjb

Jonathan/Steven, have you seen any hanging bots in repo commands since Nicolas gc'd the repos on Friday?
Cc: kliegs@chromium.org stevenjb@chromium.org
Aug 1, 2011
#6 kliegs@chromium.org
We've got the day off in Waterloo so haven't really done much syncing.

I'll check with people tomorrow and get you feedback.
Aug 1, 2011
#7 cmp@chromium.org
(No comment was entered for this change.)
Labels: -Restrict-View-InfraTeam -Restrict-EditIssue-InfraTeam
Aug 1, 2011
#9 cmp@chromium.org
The gitgc.sh script we're using was running with --auto which was not triggering a full GC.  We dropped the --auto on all Git read-only backends and ran it by hand on one repo (chromium/src.git) which took 1m10s to complete.  Afterwards, all of the old and new pack files were replaced by one pack file.

I'm running gitgc.sh by hand on the first Git backend, and the rest of the backends will pick it up in a cronjob run.
Aug 1, 2011
#10 cmp@chromium.org
Some of the object files in one of the non-GC'd repos dated back to June 2.  So we may be a lot less likely to hit this border condition after forcing the GC nightly.

Nicolas, I recall you saying you changed the global Git config on these systems to disable auto-repacking.  Did you try anything else to disable GC-on-clone?
Aug 1, 2011
#11 nsylv...@google.com
cmp: no, i haven't tried anything else.

I think git gc --auto might have been not working... let's see what the nightly "git gc" does this time.
Aug 1, 2011
#12 cmp@chromium.org
Okay, David James pointed out another hang.  This time we got in touch with the Git maintainers as well as looking for a stack trace.  What was interesting is that we confirmed this time that the Git process had no active local socket, so while the GC's were something of a slowdown, they were a red herring here.

The Git maintainers told us that there is a race condition in the smart HTTP protocol.  If someone commits while someone else is updating, then the client will hit this condition and need to start again.  We are discussing disabling smart HTTP as a short-term workaround.  We'll keep talking with the Git maintainers to understand if/how they plan to fix this issue so we can use smart HTTP.

Another possibility is to detect this condition in repo and retry their fetch.
Cc: an...@chromium.org davidjames@chromium.org
Aug 1, 2011
#13 cmp@chromium.org
Anush, are the random hangs in the Repo/Git clients causing enough pain that we should investigate disabling smart HTTP for now?  The win is that the random hangs should stop, the cost is that while full Git clones should be the same speed, updates will be slower.
Aug 1, 2011
#14 an...@chromium.org
Yes it seems to be happening often enough for people to notice it. Can we disable it per repository ? Since I think there may be "high contention" repos that we could disable this on (where every bot tries to commit to something and then pull from that repo etc). But if there is no option yes we can disable smart HTTP and take the hit during incremental updates since we would take a slow and robust checkout over a fast/flaky checkout. 

The other option is if we can open up git:// and not worry about smart http. 
Aug 2, 2011
#15 nsylv...@google.com
git:// would be a fairly significant change for the users. We hope to have a fix soon, and dealing with http is much easier on our side... so we just need a temporary solution while we wait for the fix.
Aug 2, 2011
#16 an...@chromium.org
Ok fair enough. If it is a few weeks that we will be slow for incremental checkouts then that is ok. Lets do that. 
Aug 2, 2011
#17 cmp@chromium.org
I propose the following repos be special-cased in our Git backends to be served completely over HTTP (not through Git's special HTTP backend):

- external/Webkit.git (2GB)
- chromium/src.git (1.4GB)
- external/WebKit_trimmed.git (1GB)
- chromiumos/third_party/kernel.git (500MB)
- chromiumos/third_party/kernel-next.git (500MB)

These are the largest Git repos on gerrit.chromium.org and so assuming these correlate closely to repos where there's a likely high chance of a push occurring during a fetch/clone.
Aug 2, 2011
#18 an...@chromium.org
sounds good to me. 

However Im _guessing_ contentions would occur in manifest-version.git too. Every bot tries to push/pull from this repo and we see a commit about every 15 mins (and we may be pulling about >300 times a day considering every bot) .

Also there isn't any real commit going to the Chromium repos, just a mirrored push. Dont know if that matters. 

Anyway I think it may help finding which repos the problem seems to happen on and then switch it to http. If not we can slowly work our way one by one 

manifest-versions.git (commits every 15 mins + 100s of pulls)
kernel.git   (commits every ~10hrs or so)
followed by the chromium projects. 
kernel-next.git  (has been dead for a while and isn't in the manifest)


chromiumos-overlay.git is also a high commit repo and every bot tries to sync and commit to it.  

May help to narrow which of the repos are the trouble makers and we can also make some build system side changes. 
Aug 4, 2011
#19 davidjames@google.com
Issue chrome-os-partner:5259 has been merged into this issue.
Cc: nsanders@google.com dparker@google.com tiedw...@gmail.com
Aug 4, 2011
#20 davidjames@chromium.org
 Issue chromium-os:18250  has been merged into this issue.
Cc: nsylv...@chromium.org cmp@chromium.org chrome-i...@google.com ellyjo...@chromium.org
Aug 4, 2011
#21 cmp@chromium.org
Current status is that I have a fix in hand.  The fix is partially verified, it requires careful bring-up across 3 hosts simultaneously and initial use against the playground repos.  Since the Apache servers need restarts, will do this soon outside of normal work hours (tonight is a no-go, I'll be on a flight).
Owner: cmp@chromium.org
Aug 4, 2011
#22 cmp@chromium.org
Correction: I have the *short-term fix* in hand: serve some repos directly over HTTP rather than over the Git HTTP backend.

The long-term fix still requires a patch to Git's HTTP backend on our servers.  We're talking with the Git maintainers about when that will get implemented.  No more info on that yet.
Aug 4, 2011
#23 an...@google.com
Could you tell us which repos you plan to disable smart http on? Just curious if we know the problematic ones (is it manifest-versions or chromiumos-overlay ? ) 
Aug 8, 2011
#24 cmp@chromium.org
We don't know the problematic ones.  We're guessing based on where the clients have been hanging, and this has tended to be in the chromium/src.git and kernel.git fetches, along with your manifest-versions and kernel-next.  These are the primary repos I plan to enable this fix for.  If we see more hangs, we'll disable smart HTTP for those repos.
Aug 8, 2011
#25 sosa@chromium.org
I would probably also do chromiumos/overlays/chromiumos-overlay.git.  It gets a lot of traffic from the buildbots.
Aug 8, 2011
#26 cmp@chromium.org
Ok.  Btw, the reason we don't know all of the failures is that when Git fails like this on the server side, we get an error in our Apache logs giving the SHA1 it failed on, but not the repo.  Since the repo has already been GC'd, the SHA1 is gone, and we don't have a way to backtrace the SHA1 to the repo.

Could be fixed with:
(1) Apache config change that includes the URL request right before the error message
(2) Git change to output repo name in the error message

(1) is fastest and easiest.  If someone knows a config trick to enable this, let us know.
Aug 8, 2011
#27 joseph...@chromium.org
David has observed that arg generic PFQ hung.
(https://code.google.com/p/chromium/issues/detail?id=90675)

** Start Stage LKGMCandidateSync - 19:36:30

INFO: PROGRAM(cbuildbot) -> RunCommand: ['repo', 'sync', '--quiet', '--jobs', '8'] in dir /b/cbuild

--------------------------------------------------------------------------------
started: Mon Aug  8 19:34:25 2011
ended: Mon Aug  8 20:59:01 2011
duration: 1 hrs, 24 mins, 36 secs

Aug 15, 2011
#28 cmp@chromium.org
Progress: I have an Apache config in-hand that implements the dumb HTTP protocol for specific repositories.  It's currently not enabled on our backends, so the hangs can still occur.  The workaround can be deployed when necessary to fix the most annoying/frequent hangs.

davidjames is looking into identifying when the hangs are happening and auto-retrying.  Some notification will be sent to stdio to allow us to capture these events so we know (a) when they occur and (b) when they're completely fixed.
Cc: -cmp@chromium.org -chrome-i...@google.com
Aug 16, 2011
#31 an...@chromium.org
Im curious what we did on the client/bot side to start seeing these errors more. We had the same version of gerrit/git running for a few months before we started noticing this problem. 
Aug 16, 2011
#32 kliegs@google.com
I can't say this for certain, but for me at least it seemed to start occurring around the time we added the Chromium source codes to Gerrit.  Checking out an additional ~2-4 GB (not sure of exact amount) could add quite a bit of load to the infrastructure and networks.

It also feels like there's a lot more bots recently.  So adding the 2-4GB/checkout plus many more checkouts could also compound it.
Aug 18, 2011
#33 an...@chromium.org
We had the chrome sources checked in for a while (Commit went in on Wed, 8 Jun 2011 and this problem started in end of july).  So lets be careful about what we guess is the problem since it could send a lot of people down the wrong route.  We added more bots and we also started to synchronize them more (manifest versions etc). 


Aug 18, 2011
#34 arzueced...@hotmail.com
web yayını
Aug 18, 2011
#35 an...@chromium.org
Chase/Nicolas: Each of our builders run with -j8. And at times we have 3 builders starting a checkout at the same time (more less 24 jobs at a time)  and sometimes it may collide with more builders syncing at the same time. So curious if we had 5 builders with -j8 and we are unlucky to hit the same cgit mirror would it handle it ok ?

Meanwhile I will push a CL to reduce repo sync from -j8 to -j4 to see if that helps

Also I will open an issue to get repo --mirror/--reference working for ChromiumOS checkouts so that we always only do an incremental sync even for the full builders.  (and rely on GIT SHA for source correctness which we already do )




Aug 18, 2011
#36 sosa@chromium.org
Anush, we already do incremental syncing for all builders.
Aug 18, 2011
#37 bugdroid1@chromium.org
Commit: 0c314d2df8b1bf8d3de5ca77b41ed8be49220fa5
 Email: anush@chromium.org

Reduce repo sync parallelism

BUG=chromium:90675
TEST=ad hoc

Change-Id: I1382d62a51d36b10a1c7518b50fc878a96544398
Reviewed-on: http://gerrit.chromium.org/gerrit/6223
Reviewed-by: Chris Sosa <sosa@chromium.org>
Reviewed-by: David James <davidjames@chromium.org>
Reviewed-by: Anush Elangovan <anush@google.com>
Tested-by: Anush Elangovan <anush@google.com>

M	buildbot/repository.py
Aug 18, 2011
#38 davidjames@chromium.org
FYI I don't think reducing parallelism in the bots will fix the issue or reduce the incidence of it.

GIT_HTTP_LOW_SPEED_TIME=30 might help though, ferringb is looking into enabling that. That would cause git to exit if we have more than 30 seconds where it is receiving <1 byte per second.
Aug 18, 2011
#39 bugdroid1@chromium.org
Commit: b65813e983ba0787685983ea09aab0489ebfd9ac
 Email: ferringb@chromium.org

push in git/http timeout settings for repo sync invocations

Roughly; if less than 1kb/s for a minute, terminate the git/http connection.
Bit heavy handed, but this is intended to make the hangs we've been having
crystal clear that they're occuring.

BUG=chromium:90675
TEST=ad hoc

Change-Id: I1e67b2f6fc4ae4713c6ccf239b75e43fe6133245
Reviewed-on: http://gerrit.chromium.org/gerrit/6241
Reviewed-by: Chris Sosa <sosa@chromium.org>
Reviewed-by: David James <davidjames@chromium.org>
Tested-by: Brian Harring <ferringb@chromium.org>

M	buildbot/repository.py
Aug 18, 2011
#40 davidjames@google.com
Looks like the above changes didn't work. Git still hangs. Example: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/tegra2_aebl-binary/builds/1313

$ ps -eo pid,etime,cmd | grep git
 9725    01:19:55 git fetch --quiet cros
 9729    01:19:55 git-remote-http cros http://git.chromium.org/chromiumos/overlays/board-overlays.git
 9746    01:19:54 git fetch-pack --stateless-rpc --lock-pack --include-tag --thin --no-progress 

$ xargs --null --max-args=1 echo  < /proc/9746/environ  | grep GIT
GIT_HTTP_LOW_SPEED_TIME=30
GIT_HTTP_LOW_SPEED_LIMIT=1
GIT_DIR=/b/cbuild/arm-tegra2_aebl-private-bin/.repo/projects/src/overlays.git

It looks like the environment variables got passed in correctly. but git is still hanging. So we probably need to implement a timeout and kill git when it hangs.

After noting the above info, I killed the offending git process and the process continued fine.
Cc: ferri...@chromium.org
Aug 18, 2011
#41 an...@chromium.org
Looks like ToT git has a fix for the smart http race 
http://git.kernel.org/?p=git/git.git;a=commit;h=2f5cb6aa1e2eb7b12df85d5ddc4d8bc79be47b3d
http://git.kernel.org/?p=git/git.git;a=commit;h=051e4005a37c4ebe8b8855e415e6c6969d68e1a3

http://chromeos-botmaster.mtv.corp.google.com:8026/builders/tegra2_aebl-binary/builds/1313
seems to be the wrong reference to the build since I dont see the error in it. However it would be interesting to see if there was a commit to board-overlays.git at the time the commit happened. 

We could try to use the ToT git version to see if that works ok. But I am not sure how Chase/Nicolas feel about deploying it on our mirrors. 

Meanwhile we could also implement a feature in repo to kill the call to git fetch-pack if it timesout. 





Aug 18, 2011
#42 davidjames@google.com
That's the right link. Note that RunCommand grabs stdout and stderr. Looks like ferringb's patch has a bug so the error output is not printed. (The old code did in fact print errors when the command failed.) So instead the only sign of the failure is that the 'repo sync' cmd is printed twice.
Aug 19, 2011
#43 ferri...@google.com
Just a general comment- the timeout/retry attempts are basically trying to duct tape over each spot it occurs, inevitably having to expand it to more spots.  It's not really tenable- we've got http timeout in, but now we're looking at having to push retries in for example (plus timeout seems to only work part of the time).

Point is... we really should be sorting this server side; what's the steps necessary to get this tested/deployed?  Must it be ToT gig, or can we cherry pick the fixes?
Aug 20, 2011
#44 davidjames@chromium.org
Caught another hang in progress. Here's the data.

Build: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/TOT%20Pre-Flight%20Queue/builds/4822


$ hostname
chromeosbuild3

$ git --version
git version 1.7.6

$ ps -eo pid,cmd | grep git
 9850 git fetch --quiet cros
 9851 git-remote-http cros http://git.chromium.org/chromiumos/overlays/chromiumos-overlay.git
 9895 git fetch-pack --stateless-rpc --lock-pack --include-tag --thin --no-progress http://git.chromium.org/chromiumos/overlays/chromiumos-overlay.git/  efs/heads/master

$ lsof -p 9851 -p 9895  | grep pipe
git-remot 9851 chrome-bot    0r  FIFO    0,8       0t0    23648 pipe
git-remot 9851 chrome-bot    1w  FIFO    0,8       0t0    23649 pipe
git-remot 9851 chrome-bot    2w  FIFO    0,8       0t0    18830 pipe
git-remot 9851 chrome-bot    6w  FIFO    0,8       0t0    23807 pipe
git-remot 9851 chrome-bot    7r  FIFO    0,8       0t0    23808 pipe
git       9895 chrome-bot    0r  FIFO    0,8       0t0    23807 pipe
git       9895 chrome-bot    1w  FIFO    0,8       0t0    23808 pipe
git       9895 chrome-bot    2w  FIFO    0,8       0t0    18830 pipe

$ sudo strace -p 9895 -p 9851
Process 9895 attached - interrupt to quit
Process 9851 attached - interrupt to quit
[pid  9851] read(7,  <unfinished ...>
[pid  9895] read(0, 

This shows the two processes are stuck, both reading from each other. This makes me wonder if maybe it's a client side hang rather than a server side hang.

Core files available here (from internal network only): http://www.corp.google.com/~davidjames/cores/



Aug 21, 2011
#45 an...@chromium.org
Looks like the two highly active repos are the ones we see the problem in (board-overlays.git and chromiumos-overlays.git).  Could we setup a test server and a client that is constantly committing to it,  a couple of clients constant pulling the sources?  This way we can recreate the setup in a controlled environment and see what helps w.r.t changing the git version etc. 
Aug 22, 2011
#46 ferri...@google.com
What git version are we running server side?

Offhand I look to be able to replicate hangs locally now w/ 1.7.3, 1.7.6, and (verifying this fully) 1.7.7 server side, client side 1.7.3.
Aug 22, 2011
#47 nsylv...@google.com
We should be using the latest version that was shipped with lucid, which is 1.7.0 I believe.  We do have 1.7.3.3 installed side by side on some servers, but i'm not sure if we use it.

Chase is handling this, but he is OOO. He will be back at work on Friday.

The story as I heard it last:

1. We have the code to be able to fallback on slow-git for some heavy use repos.  (Might be too slow?). I don't think it's enabled yet.

2. There is a patch out, we'd need to build from source and apply the patch.  Chase was thinking about creating a ubuntu package with this version, so we can install it easily on our servers.

Aug 22, 2011
#48 nsylv...@google.com
We should be using the latest version that was shipped with lucid, which is 1.7.0 I believe.  We do have 1.7.3.3 installed side by side on some servers, but i'm not sure if we use it.

Chase is handling this, but he is OOO. He will be back at work on Friday.

The story as I heard it last:

1. We have the code to be able to fallback on slow-git for some heavy use repos.  (Might be too slow?). I don't think it's enabled yet.

2. There is a patch out, we'd need to build from source and apply the patch.  Chase was thinking about creating a ubuntu package with this version, so we can install it easily on our servers.

Aug 22, 2011
#49 an...@chromium.org
I think we should just avoid "dumb" http and just narrow down the recreate so we can verify with the latest version or feed the minimal recreate up to the git maintainers for fixing. 
Aug 23, 2011
#51 ferri...@chromium.org
Upstream is still broke, although it's *very* fricking close; they missed a null check.  Prior, I could trigger the race about 50% of the time on a clone invocation in my setup; after, running from 8d9185 w/ the null fix (http://article.gmane.org/gmane.comp.version-control.git/179986), I couldn't trigger any issues over an hour of 7 parallel clones.  Basically ~5-7k clones w/out issue, so I'd say it's a heavy step up.

For falling back to dumb http... why don't we keep that in reserve; realistically we'd have to force it across all repos to preclude the potential which is nonoptimal.

I'd rather see us jump forward to ToT/1.7.7- smartserv code has been changing, and 1.7.0/1.7.3 are pretty old (with known races/deadlocks fixed).

Clientside, there still is a potential deadlock if the connection is lost at the wrong time- we really can't do anything about that besides find a fix and wait for it to percolate out however.
Aug 24, 2011
#52 davidjames@google.com
(No comment was entered for this change.)
Blocking: chromium-os:16326
Aug 29, 2011
#53 cmp@chromium.org
I was working on this initially and Brian (ferringb@) has been working on it lately.  The issue tracker won't let me re-assign it to him, so here's the latest update:

ferringb is testing a Git package with the fix on one of our git backends.  nsylvain just deployed the package and if it looks good, we'll put it in the pool, watch for errors, then deploy the package to the rest of our git backends.  That should take care of this issue.
Sep 20, 2011
#54 cmasone@chromium.org
How's the fixed git package looking?
Sep 20, 2011
#55 nsylv...@google.com
The package has been installed everywhere. 
I'll mark this issue as fixed.
Status: Fixed
Owner: nsylv...@chromium.org
Oct 13, 2012
#56 bugdroid1@chromium.org
This issue has been closed for some time. No one will pay attention to new comments.
If you are seeing this bug or have new data, please click New Issue to start a new bug.
Labels: Restrict-AddIssueComment-Commit
Blocking: -chromium-os:16326 chromium-os:16326
Jul 2, 2013
#57 benhe...@google.com
(No comment was entered for this change.)
Labels: -Build-Infrastructure Infra
Sign in to add a comment

Powered by Google Project Hosting