Issue 390: Branches disappear and don't fetch/clone
Status:  Released
Owner:
Closed:  Mar 2012

Blocked on:
issue 394
Project Member Reported by sop@google.com, Jan 15, 2010
Affected Version: 2.1.1.1

Sometimes a branch disappears, and it cannot be fetched or
cloned anymore.  repo sync shows this as:

  $ repo sync
  Fetching projects: 100% (224/224), done.
  error: master in platform/bionic not found

I suspect what's happening is a background `git gc` job runs
and moves the branch into the $GIT_DIR/packed-refs file, but
JGit doesn't seem to be reloading the packed-refs data after
the git gc pass.  Since the branch is no longer loose JGit is
not reporting it to a client.
Jan 15, 2010
Project Member #1 fredrik....@sonyericsson.com
Our experience is now the other way around: When people check in very big checkins, 
and they are not repacked (the hourly script was disabled, for those of you who read 
about us doing that) this leads. The load on the server increases, at some point 
this error shows up.

error: revision sw-integration in platform/<some-git> not found

Once we run git gc load goes does down and server now serves git as normal again. 
Restart of gerrit service may or may not be necessary, we have no good statistic of 
this yet. Last time we got into this it was not necessary.
Jan 15, 2010
Project Member #2 fredrik....@sonyericsson.com
Addition: By "large commits" I mean to say that it is an external delivery, which 
means that the commit contains a lot of files, a lot of those files are updated, so 
the upload of the commit means a lot of new blob data and a new ref on the server.
Jan 19, 2010
#3 sop@google.com
Given what is happening in  issue 394 , we might actually be
looking at a different variant of  issue 394 .

If the object that a branch points to cannot be read from
disk, the branch just silently disappears, and no error is
logged to the server log file.  So  issue 394  can cause the
branch to vanish like we are seeing here.
Blockedon: 394
Jan 30, 2010
#4 sop@google.com
(No comment was entered for this change.)
Labels: -Milestone-Next Milestone-2.1.2
Feb 21, 2010
#5 sop@google.com
 Issue 394  has been merged into this issue.
Mar 1, 2010
#6 jjhel...@gmail.com
My organization started seeing this today too, with similar symptoms as explained in 
 issue 394 :

fatal: protocol error: bad pack header

Has anyone been able to temporarily work-around this problem?
Mar 1, 2010
#7 jjhel...@gmail.com
Another update.  I just noticed this post: http://groups.google.com/group/repo-
discuss/browse_thread/thread/d137c9e55e55542

I dropped down to a shell and run "git gc" on the problematic git repo as the gerrit2 
user and it fixed the problem.

Mar 2, 2010
#8 sop@google.com
Slipped to 2.1.3.  I want to get 2.1.2 out.
Labels: Milestone-2.1.3
Mar 2, 2010
#9 sop@google.com
(No comment was entered for this change.)
Labels: -Milestone-2.1.2
Mar 11, 2010
Project Member #10 ulrik.sj...@gmail.com
I have finally been able to recreate this problem!

1) Push a commit onto a git (the error occurs more likely if the commit is big (mine
was 800 megs from /dev/urandom).
2) Let the replication to the replication-server finish 
3) Clone the project from the replication server (make sure you are the FIRST person
to clone after the replication is done).
4) Ctrl-C the clone
5) You are now the proud owner of a broken git. (we heal it with 'git gc') cloning
the git again will give you something like this:

Initialized empty Git repository in /mnt/src/helloworld/helloworld/.git/
remote: Counting objects: 2765, done
remote: Compressing objects: 100% (2765/2765)
fatal: internal server error6/2765), 165.82 MiB | 11346 KiB/s   
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed


Mar 12, 2010
Project Member #11 ulrik.sj...@gmail.com
Btw, forgot to add that the push in step 1) is to refs/heads/master.
Apr 9, 2010
#12 ern...@gmail.com
Is this still reproducible?
Apr 26, 2010
#13 sop@google.com
Re comment #10, when the replication is running is that
going over the system SSH, writing the objects directly
into the repository behind Gerrit's back?

I think its a red-herring that ctrl-c'ing that first
clone causes things to break for all subsequent users.

And I doubt 800 MiB is actually needed to trigger this.
What's probably happening is, your 800 MiB push contained
enough *objects* that it was over the 100 object limit and
was retained as a pack file, rather than being exploded to
loose objects.  And the Gerrit server failed to figure out
that a new pack file was available on disk.
Apr 26, 2010
Project Member #14 fredrik....@sonyericsson.com
Hi Shawn / Comment #11
Yes, we're replicating over OpenSSH.

The 800MiB example was mentioned as the safest way to reproduce the bug. But this is 
certainly not the only possible scenario, we see it quite often when we push more 
than one object as well.
The test we did, IIRC, was to check in one large 800MiB binary. I'm not sure what 
that means to git internally.. I thought it meant only one huuuuuuuuuge blob-object 
rather than many? (and then tree and commit objects, obviuosly.. still not hundreds?)

I'd love to tell you more on the differences between when the sync completes versus 
when you ctrl-c it, but I was not around Ulrik and Ernst when they set about to 
reproduce it, and hence my answer is less useful than it could've been. They might 
add their own comments tormorrow morning, EU hours.
Hope it helps!
May 3, 2010
#15 ern...@gmail.com
Had any luck with this Shawn? Can you reproduce it if you follow the #10 steps?
May 3, 2010
#16 sop@google.com
Nope.  I spent about a day on it last week.  I wasn't able to reproduce by
following comment #10.  So I spent some time looking through this section
of code in JGit.  There is a possibly bad condition relating to a push into
Gerrit Code Review confusing a concurrent read.  I've posted patches for it
to JGit, and I see they got merged over the weekend.  I doubt they fix the
case described here though, because the push must occur over the Gerrit port
to trigger the condition.

My week this week is all messed up scheduling wise due to personal stuff that
I have going on right now.  But I plan to devote most of what I can this week
at work to looking at this problem more, maybe I'll have some flash of insight
if I stare at the code long enough.
May 4, 2010
#17 sop@google.com
Some notes from an IM session with an admin suffering from
this bug on their Gerrit server, against a Linux kernel repo:

Them> got again the false missing object exception, I do notice
    > one thing tho, almost all the time it's complaining about
    > the object that is the vanilla 2.6.33 commit (we initialize
    > all our branches to start from that)

Me  > ugh

Them> hi again, wtf, I just found out that we have disabled the
    > repack script sometime in March and are only running the
    > resync-all script every night so it would mean those problems
    > are not because of the external repacking

Me  > yikes
    > so the vanilla 2.6.33 commit went poof solely due to gerrit
    > adding new pack files during pushes.

Them> not sure why it did, but yeah, it seems Gerrit doesn't know
    > about it even tho it exists and works after a restart (neither
    > of the touch or "git gc" solve the issue, only restarting Gerrit
    > does so far)

Makes me start to suspect that the PackFile object which contains the
commit got marked as corrupt in memory, or it was simply omitted from
the PackList object somehow during a copy of the array.
Labels: Component-JGit
May 13, 2010
#18 sop@google.com
Slightly new theory:

JGit has an open bug [1] where pack files are accessed after their
file descriptor was closed.  These usually result in an IOException
being thrown back at the caller.

In many places within ObjectDirectory, JGit consumes an IOException
when accessing the pack file and removes the pack file from its list
of known packs.  Since the exception is not logged, we don't know if
this condition is triggering or not.

When the pack gets removed from the list of known packs, it is never
put back into the list because the objects/pack mtime doesn't change.

So if this read-after-close bug occurs at the right place, we won't
log it, but we'll close the pack and forget it ever exists.  Later on
when we can't access the object we log the missing object error, or
simply hide the branch from the client entirely.

[1] https://bugs.eclipse.org/bugs/show_bug.cgi?id=308945
May 27, 2010
#19 sop@google.com
Fixed in Gerrit by change I50a1cd941fe9f0a7dd2a6a15d6bd56a36fc773a0
Status: Fixed
Labels: -Milestone-2.1.3 FixedIn-2.1.3
Jun 1, 2010
#20 sop@google.com
(No comment was entered for this change.)
Labels: -FixedIn-2.1.3 FixedIn-2.1.2.5
Jun 8, 2010
#21 jjhel...@gmail.com
We're hitting this daily now, even on 2.1.2.5.  

We're running the work around script that touches the pack objects files.  I'll try and disable that to see if it helps.

Jun 8, 2010
#22 jjhel...@gmail.com
My problem could well be  issue 585  too.  I'll provide the details in 858.
Mar 27, 2012
#23 sop@google.com
(No comment was entered for this change.)
Status: Released