Google Code Archive - Long-term storage for Google Code Project Hosting.

Posted on Sep 18, 2009 by Grumpy Bird

What steps will reproduce the problem? 1. index for example 3 docs(uid=1 field1=..., uid=2 field1=..., uid=3 field1=...); 2. flush to disk index; 3. delete them; 4. flush to disk index; 5. insert them again 6. flush to disk index; 7. search with MatchAllDocsQuery;

What is the expected output? What do you see instead?

expected: all 3 docs will be returned;
what i see: none of them are returned;

What version of the product are you using? On what operating system? Version: trunk OS: Debian

Please provide any additional information below.

This is caused by bugs in FastMatchAllDocsQuery.

After steps above, you will get 3 index file on disk: _0.cfs, _0_1.del and _1.cfs

When FastMatchAllDocsQuery instance created by BoboIndexReader.getFastMatchAllDocsQuery, you will get deletedDocs=[0, 1, 2] and maxDoc=6.

And then, FastMatchAllDocsQuery.FastMatchAllDocsWeight.scorer will be called twice for _0.cfs and _1.cfs.

The first call is good, but the second call: _deletedDocs = [0, 1, 2] and _deletedIndex = 0 too, but _deletedDocs should be null OR _deletedIndex be 3 because any of 3 docs in _1.cfs is not deleted.

The attached patch is my workaround.

Attachments

FastMatchAllDocsQuery.diff 4.46KB

Comment #1

Posted on Sep 18, 2009 by Grumpy Hippo

Thanks Lei!

Comment #2

Posted on Sep 18, 2009 by Grumpy Hippo

I wrote this test and it passes with the current FastMatchAllDocs impl:

public void testFastMatchAllDocs() throws Exception{ RAMDirectory idxDir = new RAMDirectory(); Document doc; Field f; IndexWriter writer = new IndexWriter(idxDir,new StandardAnalyzer(),MaxFieldLength.UNLIMITED); doc = new Document(); f = new Field("id","1",Store.YES,Index.NOT_ANALYZED_NO_NORMS); doc.add(f); writer.addDocument(doc); doc = new Document(); f = new Field("id","2",Store.YES,Index.NOT_ANALYZED_NO_NORMS); doc.add(f); writer.addDocument(doc); doc = new Document(); f = new Field("id","3",Store.YES,Index.NOT_ANALYZED_NO_NORMS); doc.add(f); writer.addDocument(doc); writer.commit();

  writer.deleteDocuments(new Term("id","1"));
  writer.deleteDocuments(new Term("id","2"));
  writer.deleteDocuments(new Term("id","3"));
  writer.commit();

  BoboIndexReader reader = BoboIndexReader.getInstance(IndexReader.open(idxDir));
  IndexSearcher searcher = new IndexSearcher(reader);

  TopDocs topDocs = searcher.search(reader.getFastMatchAllDocsQuery(), 100);
  assertEquals(0, topDocs.totalHits);
  reader.close();

  doc = new Document();
  f = new Field("id","1",Store.YES,Index.NOT_ANALYZED_NO_NORMS);
  doc.add(f);
  writer.addDocument(doc);
  doc = new Document();
  f = new Field("id","2",Store.YES,Index.NOT_ANALYZED_NO_NORMS);
  doc.add(f);
  writer.addDocument(doc);
  doc = new Document();
  f = new Field("id","3",Store.YES,Index.NOT_ANALYZED_NO_NORMS);
  doc.add(f);
  writer.addDocument(doc);
  writer.commit();

  reader = BoboIndexReader.getInstance(IndexReader.open(idxDir));
  searcher = new IndexSearcher(reader);

  topDocs = searcher.search(reader.getFastMatchAllDocsQuery(), 100);
  assertEquals(3, topDocs.totalHits);
  reader.close();
}

After changing writer.commit -> writer.flush (a deprecated method) It does fail.

But it fails even after the patch is applied.

Do you have a unit test that reproduces the problem?

Thanks

Comment #3

Posted on Sep 18, 2009 by Grumpy Bird

I do not have a unit test here.

The problem only occurs when there are more the one index files.

I index these docs by zoieSystem consumer, you can set batch size to 3, so after 3 docs consumed, you will see the _0.cfs file on disk. and after the 3 deletion, the _0_1.del will be created. after 3 consume again, the _1.cfs will be there.

so there are two real index file there, _0.cfs and _1.cfs, and you will see the problem by issue a MatchAllDocsQuery.

Comment #4

Posted on Sep 18, 2009 by Grumpy Bird

And, this is my test index files.

query shanghai or beijing, will give you the right answer,

but query for : will return no result.

Attachments

idx.tar.gz 1.34KB

Comment #5

Posted on Sep 18, 2009 by Grumpy Bird

sorry, query for contents:shanghai or contents:china

Comment #6

Posted on Sep 18, 2009 by Grumpy Hippo

Is this problem with MatchAllDocsQuery or FastMatchAllDocsQuery?

Can you build the index with lucene 2.4 instead?

Lucene 2.9 had api changes that broke bobo.

Comment #7

Posted on Sep 18, 2009 by Grumpy Bird

with FastMatchAllDocsQuery. and don't have time build that in 2.4 now, i have to turn off my pc and for my train now.

I will rebuild it after iam back.

Comment #8

Posted on Sep 21, 2009 by Grumpy Bird

Indexes building with lucene 2.4.

Attachments

idx.tar.gz 1.2KB

Comment #9

Posted on Sep 21, 2009 by Grumpy Hippo

After Lei's tests, we have determined this is related to Lucene 2.9 compatibility. The above test code (with RAMDirectory changed to FSDirectory) passes with Lucene 2.4 but fails with 2.9. Where as using MatchAllDocs pass always.

Will leave this bug to be resolved with Lucene 2.9 upgrade.

Comment #10

Posted on Oct 9, 2009 by Grumpy Bird

Fix a stupid bug in my previous patch.

Attachments

FastMatchAllDocsQuery.diff 4.37KB

Comment #11

Posted on Oct 10, 2009 by Grumpy Bird

patch for BR_DEV_LUCENE_2.9 branch

Attachments

FastMatchAllDocsQuery.BR_DEV_LUCENE_2.9.diff 679

Comment #12

Posted on Oct 24, 2009 by Grumpy Hippo

Thanks Lei for the patches! FastMatchAllDocsQuery was created because Lucene's default MatchAllDocsQuery had a bottle neck on the delete check.

That was fixed in Lucene 2.9. So the default MatchAllDocsQuery should be now used instead.

This class is now removed and getFastMatchAllDocsQuery is now deprecated.

bobo-browse - issue #24

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12