Issue 394: protocol error: bad pack header
Status:  Duplicate
Merged:  issue 390
Owner: ----
Closed:  Feb 2010

Blocking:
issue 390
Project Member Reported by fredrik....@sonyericsson.com, Jan 19, 2010
Affected Version: Gerrit 2.1.1.1
Environment: Ubuntu 8.10 64 bit, 16GB RAM
Date: Around noon, the 18th.

The initial report from the users is this:
Initializing project platform/frameworks/base ...
warning: no common commits
fatal: internal server error
fatal: The remote end hung up unexpectedly
fatal: protocol error: bad pack header
error: Cannot fetch platform/frameworks/base

This is repeatable, several clients get the same output.

On the server, everything looks dandy. To be safe I run a gc:
platform/frameworks/base.git$ git gc
Counting objects: 146059, done.
Compressing objects: 100% (49024/49024), done.
Writing objects: 100% (146059/146059), done.
Total 146059 (delta 83409), reused 145995 (delta 83345)
platform/frameworks/base.git$ 

Latest commit according to git log --all
Date:   Sat Jan 16 16:10:45 2010 +0100

Once git gc has completed, the sync now works as expected!
But this doesn't last long, in the same sync I get the same output for 
another internal git means running git gc once more:
$ git gc
Counting objects: 94, done.
Compressing objects: 100% (28/28), done.
Writing objects: 100% (94/94), done.
Total 94 (delta 44), reused 94 (delta 44)

Lasest commit:
Date:   Fri Dec 18 18:48:37 2009 +0100

This should mean that a lot of gcs has run weekly and sometimes daily 
since then. And still we get a repack! 

After very short time a new batch of gits "turn bad". By now patience is 
over, so I restart the gerrit service.
But first some stats:

$ ps Tuxa | grep java
gerrit2   1039 76.7 52.1 9066324 8473312 pts/1 Sl   Jan15 3194:58 
GerritCodeReview -Xmx8g -Djavax.net.ssl.trustStore=/some/files -
Djavax.net.ssl.trustStorePassword=passwd -jar /srv/gerrit2/bin/gerrit.war 
daemon -d /srv/gerrit2

-> internal ps shows:

  Name               Max |Object Count        |  AvgGet  |Hit Ratio     |
                     Age |  Disk    Mem    Cnt|          |Disk Mem  Agg |
-------------------------+--------------------+----------+--------------+
  accounts           90d |                 218|          |           99%|
  accounts_byemail   90d |                    |          |              |
  diff               90d |                    |          |              |
  groups             90d |                  17|   6.5ms  |            0%|
  ldap_groups        1d  |                 216|          |           99%|
  ldap_usernames     90d |                    |          |              |
  projects           90d |                 321|          |           99%|
  sshkeys            90d |                 166|          |           98%|
  web_sessions       12h |                    |          |              |

JGit Buffer Cache:
  open files  :              15
  loaded      : 199.99 mb
  mem%        :   2%

JVM Heap:
  max         :   7.37 gb
  inuse       :   1.88 gb
  mem%        :  25%

Not sure if this gives anything of value, if I could understand the output 
from jstat better I'd show you some figures from that too. As it stands 
I'm not sure how to configure it to get any good out of it.

Jan 19, 2010
Project Member #1 fredrik....@sonyericsson.com
I was unable to set this to a blocker. I suspect Shawn can. I would appreciate that, 
as this causes as much havoc once it happens as his own filed issue 
https://code.google.com/p/gerrit/issues/detail?id=390
Jan 19, 2010
#2 sop@google.com
fredrik.luthander: I have added you as a project contributor, so
you can now edit issues, and can set the labels yourself.  Since
you asked, I'm also tagging this as Priority-Blocker for now.
Labels: -Priority-Minor Priority-Blocker
Jan 19, 2010
#3 sop@google.com
Is there anything in the error_log that matches this time period?
A failure on the server side like this that produces a corrupt
data stream to the client should have also triggered an exception
with a stack trace in the server's error_log.  Having that would
help me track down what is wrong here.

I also think your [core] section is perhaps misconfigured for
the Android project set.  We're running the following given an
8g heap size for our JVM:

  [container]
    heapLimit = 8g
  [core]
    packedGitOpenFiles = 4096
    packedGitLimit = 2g
    packedGitWindowSize = 16k

Status: Accepted
Jan 19, 2010
Project Member #4 fredrik....@sonyericsson.com
Hi Shawn!

Here's the first of the stacks from my logfile the 18th..

[2010-01-18 10:54:21,179] ERROR com.google.gerrit.sshd.BaseCommand : Internal server 
error (user first.last account 1000115) during git-upload-
pack '/platform/vendor/dir/repository.git'
com.google.gerrit.sshd.BaseCommand$Failure: fatal: 'platform/vendor/dir/repository': 
not a git archive
        at com.google.gerrit.sshd.commands.AbstractGitCommand.service
(AbstractGitCommand.java:58)
        at com.google.gerrit.sshd.commands.AbstractGitCommand.access$100
(AbstractGitCommand.java:30)
        at com.google.gerrit.sshd.commands.AbstractGitCommand$1.run
(AbstractGitCommand.java:46)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:405)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301
(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run
(ScheduledThreadPoolExecutor.java:207)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:310)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.eclipse.jgit.errors.RepositoryNotFoundException: repository not 
found: Cannot open repository platform/vendor/dir/repository
        at com.google.gerrit.server.git.LocalDiskRepositoryManager.openRepository
(LocalDiskRepositoryManager.java:97)
        at com.google.gerrit.sshd.commands.AbstractGitCommand.service
(AbstractGitCommand.java:56)
        ... 12 more
Caused by: org.eclipse.jgit.errors.RepositoryNotFoundException: repository not 
found: /srv/gerrit2/git/platform/vendor/dir/repository
        at org.eclipse.jgit.lib.RepositoryCache$FileKey.open
(RepositoryCache.java:298)
        at org.eclipse.jgit.lib.RepositoryCache.openRepository
(RepositoryCache.java:158)
        at org.eclipse.jgit.lib.RepositoryCache.open(RepositoryCache.java:99)
        at org.eclipse.jgit.lib.RepositoryCache.open(RepositoryCache.java:74)
        at com.google.gerrit.server.git.LocalDiskRepositoryManager.openRepository
(LocalDiskRepositoryManager.java:94)
        ... 13 more

Jan 19, 2010
#5 sop@google.com
That is a filesystem problem.

Gerrit is looking for the following:

  /srv/gerrit2/git/platform/vendor/dir/repository/objects
  /srv/gerrit2/git/platform/vendor/dir/repository/refs
  /srv/gerrit2/git/platform/vendor/dir/repository/HEAD

OR

  /srv/gerrit2/git/platform/vendor/dir/repository.git/objects
  /srv/gerrit2/git/platform/vendor/dir/repository.git/refs
  /srv/gerrit2/git/platform/vendor/dir/repository.git/HEAD

But we're missing at least one of these items; stat() is returning
an error code causing Gerrit to believe the one or more of the paths
does not exist.

Is it possible that your daemon process is losing access to the
filesystem due to a key expiring?  E.g. I just saw this myself
yesterday on a new Gerrit server I installed on top of an encrypted
filesystem.  The encrypted filesystem automatically timed itself out
and locked the key after X hours without an interactive login to the
host system.  That meant the daemon lost access to the filesystem and
we started to also get these exceptions.  But logging into the host
immediately resolved it as the filesystem unlocked.

However, this stack trace can't be the culprit of the originally
reported client error.  The original client error can only happen
if we've gotten past this point.  You emailed me your log file, in
that log I found two other errors that are the actual problems:

[2010-01-18 11:06:33,079] ERROR com.google.gerrit.sshd.BaseCommand : Internal server 
error (user X account Y) during git-upload-pack '/platform/frameworks/base.git'
java.lang.NullPointerException
        at 
org.eclipse.jgit.lib.PackWriter.writeWholeObjectDeflate(PackWriter.java:763)
        at org.eclipse.jgit.lib.PackWriter.writeObject(PackWriter.java:732)
        at org.eclipse.jgit.lib.PackWriter.writeObjects(PackWriter.java:692)
        at org.eclipse.jgit.lib.PackWriter.writePack(PackWriter.java:592)
        at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:471)
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:246)
        at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:227)
        at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:26)
        at 
com.google.gerrit.sshd.commands.AbstractGitCommand.service(AbstractGitCommand.java:62
)
        at 
com.google.gerrit.sshd.commands.AbstractGitCommand.access$100(AbstractGitCommand.java
:30)
        at 
com.google.gerrit.sshd.commands.AbstractGitCommand$1.run(AbstractGitCommand.java:46)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:405)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Sched
uledThreadPoolExecutor.java:98)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThr
eadPoolExecutor.java:207)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:310)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
[2010-01-18 11:06:42,587] ERROR com.google.gerrit.sshd.BaseCommand : Internal server 
error (user A account B) during git-upload-pack '/platform/frameworks/base.git'
org.eclipse.jgit.errors.MissingObjectException: Missing commit c0ffee....
        at org.eclipse.jgit.revwalk.RevObject.loadCanonical(RevObject.java:77)
        at org.eclipse.jgit.revwalk.RevCommit.parseHeaders(RevCommit.java:83)
        at org.eclipse.jgit.revwalk.PendingGenerator.next(PendingGenerator.java:138)
        at 
org.eclipse.jgit.revwalk.TopoSortGenerator.<init>(TopoSortGenerator.java:72)
        at org.eclipse.jgit.revwalk.StartGenerator.next(StartGenerator.java:145)
        at org.eclipse.jgit.revwalk.RevWalk.next(RevWalk.java:381)
        at org.eclipse.jgit.revwalk.ObjectWalk.next(ObjectWalk.java:208)
        at org.eclipse.jgit.lib.PackWriter.findObjectsToPack(PackWriter.java:856)
        at org.eclipse.jgit.lib.PackWriter.preparePack(PackWriter.java:494)
        at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:455)
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:246)
        at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:227)
        at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:26)
        at 
com.google.gerrit.sshd.commands.AbstractGitCommand.service(AbstractGitCommand.java:62
)
        at 
com.google.gerrit.sshd.commands.AbstractGitCommand.access$100(AbstractGitCommand.java
:30)
        at 
com.google.gerrit.sshd.commands.AbstractGitCommand$1.run(AbstractGitCommand.java:46)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:405)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Sched
uledThreadPoolExecutor.java:98)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThr
eadPoolExecutor.java:207)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:310)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)


Either of these exceptions would result in the originally reported
client error.  I'll have to look into these more, but I think they
happen because the repository was repacked and the underlying pack
we were trying to use is gone.  JGit should have detected this and
recovered itself, but it seems that isn't happening like it should.
Feb 21, 2010
#6 sop@google.com
I think this is the same as  issue 390 ... they both seem to have
the same underlying cause of packs disappearing, leaving the
server with objects being temporarily inaccessible.
Status: Duplicate
Mergedinto: 390