My favorites | Sign in
Project Logo
             
New issue | Search
for
| Advanced search | Search tips
Issue 218: Implement a 'ropes' datatype for CPython in C.
4 people starred this issue and may be notified of changes. Back to list
Status:  Completed
Owner:  the.good.doctor.is.in
Closed:  Jan 2008
Cc:  cfbolz, avassalotti
C
hard
pypy
Due-20080119.0300
ClaimedBy-iammisc


Sign in to add a comment
 
Reported by the.good.doctor.is.in, Dec 07, 2007
'ropes' represent strings using concatenation trees; see

   http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

and

  http://morepypy.blogspot.com/2007/11/ropes-branch-merged.html

At a minimum implement string concatenation, slicing, repetition, and
search/replace.  Build this either as a straight C/C++ extension type,
or as a Python wrapper around a C module.  Provide automated tests and
a setup.py/distutils file.

For extra extra credit, implement a unicode version too (py 2.6;
'bytes' and 'str' for py 3k).

Completion:

Attach a .tar.gz or zip file to this issue and send an e-mail to the
mailing list.

Task duration: please complete this task within 5 days (120 hours) of claiming it.
Comment 1 by masterdriverz, Dec 07, 2007
I claim this task
Comment 2 by the.good.doctor.is.in, Dec 07, 2007
(No comment was entered for this change.)
Status: Claimed
Labels: ClaimedBy-masterdriverz Due-20071212.1600
Comment 3 by masterdriverz, Dec 07, 2007
I have actually been working on this before it was added as a task, so here is the
current progress I've made.
rope.c
16.6 KB Download
Comment 4 by cfbolz, Dec 07, 2007
You might want to look at the Python implementation that PyPy has for comparison:

https://codespeak.net/svn/pypy/dist/pypy/rlib/rope.py

I advice you to ask for help early, since this task is really hard.
Comment 5 by fastnix, Dec 07, 2007
You are lucked and faster then me. Good luck:)
Comment 6 by cfbolz, Dec 07, 2007
fastnix: you might be interested in Tasks 248 or 239.
Comment 7 by masterdriverz, Dec 07, 2007
For the interested (apparently a few), I've uploaded the git repo I'm working from to
http://dev.gentooexperimental.org/~masterdriverz/ropes.git
Comment 8 by avassalotti, Dec 07, 2007
Hi,

I started to read the code you posted and you seems to be on the right path. I saw a
few stylistic issues, but nothing important. I also see that your implementation is
still missing the rebalancing code, which I believe is the hardest part to get right.
Anyway, keep up the good work!
Comment 9 by masterdriverz, Dec 07, 2007
avassalotti:
Thanks, I will definitely refer to the PyPy implementation for that part. Is there a
coding style guide for the C elements of python? 
Comment 10 by cfbolz, Dec 07, 2007
Hi,

I can really only encourage you to have good unittest coverage, otherwise it will be
incredibly hard to cover all corner-cases. The PyPy code I linked to above contains a
rather well-tuned hash function that is composable: e.g. you can compute it from the
hash functions of the sub-ropes for concatenation-nodes.
Comment 11 by georg.brandl, Dec 07, 2007
The C coding style is documented in PEP 7. Note that old code uses tabs, while new
code uses 4 spaces, like for PEP 8.
Comment 12 by avassalotti, Dec 07, 2007
> Is there a coding style guide for the C elements of python?

Indeed. See PEP 7 [1].

But these aren't the type of stylistic issues I was referring to. Here some examples:

Use Py_ssize_t [3] for these:
	/* Length of entire rope */
	long len;
	/* Cached hash of rope, -1 means uncalculated */
	long hash;


This is unnecessary: 
	if (kwds && PyDict_Size(kwds)) {
		PyErr_SetString(PyExc_TypeError, "rope takes no keyword arguments");
		return -1;
	}


I think the "S" format spec (but I could be wrong) could be used here:
	if (!PyArg_ParseTuple(args, "O", &obj))
		return -1;
	if (!PyString_Check(obj))
		obj = PyObject_Str(obj);
	if (!obj)
		return -1;
	assert(PyString_Check(obj), -1);

That's a bit cryptic (just try to read it out loud), in my humble opinion:
		return !!strstr(PyString_AS_STRING(self->str), str);

Anyway, you shouldn't worry too much about these issues. These can be fixed easily
after you completed the task.

[1]: http://www.python.org/dev/peps/pep-0007/
[2]:
http://docs.python.org/dev/3.0/c-api/utilities.html#parsing-arguments-and-building-values
[3]: http://www.python.org/dev/peps/pep-0353/
Comment 13 by avassalotti, Dec 07, 2007
Just out of curiosity, is the cyclic garbage-collector really necessary? I don't see
how the strings in rope could form cyclic references.
Comment 14 by avassalotti, Dec 07, 2007
Oops, I just realized that I said something stupid in comment 11. Don't use
Py_ssize_t for the hash, keep it as long. While am at it, here is the standard
function for hashing strings in Python:

static long
string_hash(PyStringObject *a)
{
	register Py_ssize_t len;
	register unsigned char *p;
	register long x;

	if (a->ob_shash != -1)
		return a->ob_shash;
	len = Py_Size(a);
	p = (unsigned char *) a->ob_sval;
	x = *p << 7;
	while (--len >= 0)
		x = (1000003*x) ^ *p++;
	x ^= Py_Size(a);
	if (x == -1)
		x = -2;
	a->ob_shash = x;
	return x;
}

Comment 15 by cfbolz, Dec 07, 2007
It's an interesting design question whether ropes should have the same hash as Python
strings. In PyPy we decided against this, to make the hash much easier computable.
(see my last comment). The CPython implementation maybe should hash identical to
strings for better interoperability of the two types.
Comment 16 by masterdriverz, Dec 07, 2007
avassalotti:
Thanks for your comments, they're much appreciated :)

> Use Py_ssize_t [3] for these:

Thanks, changed a couple of these (including start and end for substr, but not hash :)

> This is unnecessary:

Well without it `Rope(s, foo="bar")` doesn't complain, even though foo has no effect,
which personally I don't like (also it doesn't replicate str's behaviour).

> I think the "S" format spec (but I could be wrong) could be used here:

Using "S" means Rope(["foo"]) won't work (whereas str(["foo"]) gives '["foo"]').

> That's a bit cryptic (just try to read it out loud), in my humble opinion:

Well it converts the pointer returned by strstr to a boolean (I thought !! was a
common idiom for this purpose?). I might add a comment explaining this, then again
the code may vanish altogether as it doesn't work for finding a string that overlaps
on a concat or repeat.

> Just out of curiosity, is the cyclic garbage-collector really necessary? I don't see
> how the strings in rope could form cyclic references.

Remember it's an acyclic graph - personally I thought it was easier to add the
traversal code than prove that there couldn't be a circular graph (and for that not
to change as the code developed).

> Oops, I just realized that I said something stupid in comment 11. Don't use
> Py_ssize_t for the hash, keep it as long.

No worries :)

> While am at it, here is the standard
> function for hashing strings in Python:

Thanks, this is nicer than the one I had before. However, as I can't apply the
initial and final XORs (they aren't composable, which is necessary as pointed out by
cfbolz), so it's not as strong as I'd like. Ah well, I doubt it's too much of a
performance hit for dicts.
Comment 17 by avassalotti, Dec 07, 2007

> > I think the "S" format spec (but I could be wrong) could be used here:
>
> Using "S" means Rope(["foo"]) won't work (whereas str(["foo"]) gives '["foo"]').

Ah, that's what I thought. I wasn't sure whether "S" called PyObject_Str implicitly.
I verified and it turns out it doesn't. So, keep "O".

> Well without it `Rope(s, foo="bar")` doesn't complain, even though foo has
> no effect, which personally I don't like (also it doesn't replicate str's
> behaviour).

What do you think about this, then?

diff --git a/rope.c b/rope.c
index 7e0095f..5fa01a5 100644
--- a/rope.c
+++ b/rope.c
@@ -487,18 +487,22 @@ static int
 rope_init(rope *self, PyObject *args, PyObject *kwds)
 {
        PyObject *obj;
+       static char *kwlist[] = {"source", 0};
 
-       if (kwds && PyDict_Size(kwds)) {
-               PyErr_SetString(PyExc_TypeError, "rope takes no keyword arguments");
-               return -1;
-       }
-       if (!PyArg_ParseTuple(args, "O", &obj))
+       if (!PyArg_ParseTupleAndKeywords(args, kwds,
+                                        "|O", kwlist, &obj))
                return -1;
 
-       if (!PyString_Check(obj))
+       if (obj == NULL) {
+               obj = PyString_FromString("");
+       }
+       else if (!PyString_Check(obj)) {
+               /* XXX The reference to the original object is lost here.
+                  This will cause a leak. */
                obj = PyObject_Str(obj);
-       if (!obj)
-               return -1;
+               if (!obj)
+                       return -1;
+       }
        assert(PyString_Check(obj), -1);
 
        Py_INCREF(obj);


Comment 18 by avassalotti, Dec 07, 2007
Oops. There's a small bug in the above diff.

@@ -486,19 +486,23 @@
 static int
 rope_init(rope *self, PyObject *args, PyObject *kwds)
 {
-       PyObject *obj;
+       PyObject *obj = NULL;

Comment 20 by avassalotti, Dec 07, 2007
+               /* XXX The reference to the original object is lost here.
+                  This will cause a leak. */

Forget what I said in this comment. It's just me being stupid again.

By the way, I think you should set a limit on the length a rope can have. That will
prevent the interpreter to crash when computing the repr of a rope if it exceeds the
maximum length a string can have. It should also prevent the length of the rope from
overflowing. For example:

  >>> from rope import rope
  >>> r = rope("hello")
  >>> r = r*(2**32-1)
  >>> len(r+r)
  -2147483648

You probably want to fix these bugs too:

  >>> r = rope('abc')
  >>> r + ''
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  RuntimeError: Assertion "0" failed in rope.c, line 274 (_rope_str)
  >>> (r+r) + ''
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  SystemError: ../Objects/stringobject.c:4106: bad argument to internal function
  >>> r[:]
  MemoryError
  >>> r += 'hello'
  >>> r
  [1]    10455 segmentation fault (core dumped)  python
  >>> r = rope('hello')
  >>> r[6:-1]
  [1]    10507 segmentation fault (core dumped)  python

Finally, here is some potential (micro) optimizations:

  >>> r = rope('')
  >>> r * 10**9            # just return an empty rope
  >>> r + rope('hello')    # just return rope('hello')

However, you shouldn't worry to much about these things for
now.
Comment 21 by masterdriverz, Dec 08, 2007
avassalotti: Thanks, these should all be fixed now, except for the
micro-optimizations (hopefully the regression tests work too)
Comment 22 by avassalotti, Dec 08, 2007
You should raise a TypeError, instead of converting implicitly the objects:

    >>> Rope('hello') + 'hello'
    Rope("hellohello")
    >>> Rope('hello') + 2
    Rope("hello2")

Comment 23 by avassalotti, Dec 08, 2007
And oh, some nitpicking, please keep your code within the ANSI C89 standard:

   static inline rope *
   gen_rope(void)
   {

The 'inline' keyword is from C99. (The GCC tries to inline functions declared
'static' anyway). 

Comment 24 by avassalotti, Dec 08, 2007
I was wandering how your code would look in ANSI C89 -- i.e., without the anonymous
struct/union. So, what is my conclusion? It is gross. :-) Although if you want to get
your rope implementation into the Python standard library, it would probably be a
good idea to accept slightly more verbose ANSI C89 version. Anyway, it is your
project; you choose. :) 
ansi-c89.patch
10.4 KB Download
Comment 25 by masterdriverz, Dec 09, 2007
avassalotti:
Yeah, I really like having anonymous structs/unions :( I committed your patch with a
few slight modifications (eg, renaming item to child for v.repeat, since it can have
the same name as v.substr.child as they're now in different scopes), and rope_concat
will now only implicitly convert strings.
Comment 26 by masterdriverz, Dec 09, 2007
Also, I added those micro-optimizations you mentioned earlier in the thread.
Comment 27 by masterdriverz, Dec 09, 2007
Ok, some questions as to further things to add:

How much caching should be done (if any more at all)? Ropes with len < 5?

Should I add a load more types for the string functions (upper, lower, title, join
etc come to mind), or should I return a rope with these things applied?
Comment 28 by cfbolz, Dec 09, 2007
> How much caching should be done (if any more at all)? Ropes with len < 5?

I would definitely use a prebuilt array of ropes of len == 1. Not sure caching is
worth it, I wouldn't bother at the moment, it's more an optimization to be done later.

> Should I add a load more types for the string functions (upper, lower, title, join
> etc come to mind), or should I return a rope with these things applied?

I wouldn't do upper, lower, title, I can't imagine them being particularly
performance relevant. Maybe join, but even that could be done later. In fact PyPy
doesn't even have a multiplication node (can be done easily with concat nodes) and a
slicing node (can be done by taking as much as possible from the existing tree and
adding some few nodes on the fringe). New node types have a cost, which is the need
to deal with them specifically in the rebalancing logic and in some other places.

I would really start working on the rebalancing now, it is the most complex piece of
code you will have to write and without it the whole rest is not really worth that much.

Comment 29 by avassalotti, Dec 09, 2007
> Yeah, I really like having anonymous structs/unions :( I committed
> your patch with a few slight modifications (eg, renaming item to
> child for v.repeat, since it can have the same name as
> v.substr.child as they're now in different scopes),

Ah, I see you changed 'char braid' back to 'int braid'. I realized
after posting the patch that the 'char' left a 3 bytes hole in the
structure. So, 'int' was a better choice. :-)

> and rope_concat will now only implicitly convert strings.

IMHO, you souldn't do any implicit conversion. As some people would
say, it is "unpythonic" (Python favors strong-typing).

> How much caching should be done (if any more at all)? Ropes with len < 5?

> Should I add a load more types for the string functions (upper,
> lower, title, join etc come to mind), or should I return a rope with
> these things applied?

Carl is right. I should start working on the rebalancing as soon as
possible. Adding more caching won't speed up things much; it may in
fact slow down your implementation (Personally, I find the caching of
repr already too much). And adding more rope types will just make the
rebalancing more complex. The real bottleneck is the height of
concatenation tree. For example

   >>> r = rope('a')
   >>> r * 100000
   Rope("aaaaaaaa...")
   >>> for _ in range(100000):
   ...    r += rope('a')
   ...
   >>> r
   [Memory-usage spikes up and the kernel kills the process]

So, you should gears up your optimizations toward mitigating
this -- e.g., rebalancing the tree, fusion of small nodes, etc. 


Comment 30 by cfbolz, Dec 12, 2007
how is the progress here? There were no changes in the git repo for a while. Do you
need help or advice?
Comment 31 by masterdriverz, Dec 12, 2007
Sorry, I've been busy at school recently (exams and the like). Some str functions and
some simple rebalancing pushed. I'm sure there must be more to rebalancing, but I
can't see what I'm missing. Should I recurse down the two sides of the rope being
rebalanced?
Comment 32 by cfbolz, Dec 12, 2007
there's quite a bit more to rebalancing, have you looked at the logic in the pypy
code? What you implemented so far works, but is rather simplistic, since you are
always flattening the string completely. The output of the rebalancing should be a
tree that represents the whole string but in which for every occuring concatenation
nodes, the left and the right tree have approximately the same depth.
Comment 33 by masterdriverz, Dec 12, 2007
Is that true of every node in the tree?
Comment 34 by cfbolz, Dec 13, 2007
yes. Of course you cannot make it completely true in some cases (e.g. if the tree
doesn't have a power of 2 of leaf nodes). you also might take a look at the paper
linked above:

http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

It describes the rebalancing algorithm rather well. The PyPy code implements exactly
this algorithm, so with both together you should manage to implement it.
Comment 35 by georg.brandl, Dec 15, 2007
I think that, due to the hard nature of this task, it's fair to give a 3-day
extension. That will make the task expire today, and then you'll have another 3 days
grace period to complete it.
Labels: -Due-20071212.1600 Due-20071215.1600 near-expiration
Comment 36 by cfbolz, Dec 15, 2007
Hi Georg!

Thanks for writing this, I was about to ask :-)
Comment 37 by masterdriverz, Dec 18, 2007
Sorry for my inactivity, I've had exams and a conspicuous lack of internet 
connectivity. George, thanks for the extension, I defininitely underestimated the 
complexity of the rebalancing operation. I think I have it though, I'll upload my 
attempt tonight.
Comment 38 by cfbolz, Dec 22, 2007
Hi!

What's going on there? Are you still progressing? The deadline of this task is
completely over, so if I don't hear from you immediately I will have to reopen the task.
Comment 39 by masterdriverz, Dec 22, 2007
Sorry, I completely dropped the ball on this. I'm still lacking internet access from
my home machine, but I'll try and upload what I've done tomorrow.
Comment 41 by cfbolz, Dec 28, 2007
I am putting this task as open again.
Status: Open
Labels: -ClaimedBy-masterdriverz -Due-20071215.1600 -near-expiration
Comment 42 by iammisc, Dec 28, 2007
I claim this task.
Comment 43 by doug.hellmann, Dec 29, 2007
This task is due January 03, 2008 04:20:00 UTC
Labels: Due-20080103.0420 ClaimedBy-iammisc
Comment 44 by doug.hellmann, Dec 29, 2007
(No comment was entered for this change.)
Status: Claimed
Comment 45 by iammisc, Dec 29, 2007
I just wanted to give a little update on my progress. So far, I've written the basic
appending and repetition code. Also, I've written a rebalancer which rebalances the
entire tree. Converting to a string and getting the length via len() also works.
Tomorrow, I hope to get slices and indexing working. I've written some unit tests
that test the rebalancer, the appending function, and the string conversion which use
large computer generated "Lorum ipsem..." The tests all pass right now. I tried to
avoid any memory leaks in the code but running it through valgrind pops up a number
of memory leaks in the interpeter itself and it is a bit difficult to sort through
the ones that my module is causing. Is there any better way to do this?

I'll try to get a module that fulfills all the criteria in about two days. From there
I would like to add some more functionality so that it more closely resembles the
python string type.

Talk to you later.
Comment 46 by avassalotti, Dec 29, 2007
> I tried to avoid any memory leaks in the code but running it through
> valgrind pops up a number of memory leaks in the interpeter itself and
> it is a bit difficult to sort through the ones that my module is
> causing. Is there any better way to do this?

See Misc/README.valgrind in the source distribution of Python. By the way, why do you
worry about memory leaks? I don't think ropes needs a lot, if any, of manual memory
management. Are you trying to track reference leaks, instead?


Comment 47 by iammisc, Dec 30, 2007
Yes. I'm not 100% sure if my code does reference counting properly. I don't want to
have a rope kept in memory because i didn't do reference counting right and i
definitely don't want to have it deleted from right under me.
Comment 48 by avassalotti, Dec 30, 2007
Then, you probably want to get debug build of Python. Read the
GettingAndCompilingPython wiki page first. Then, run configure with the
--with-pydebug option. This will add some debugging code, reference tracker, into the
interpreter.

When you will run your debug build of the interpreter, you will see the total
reference count will after each expression you enter. if your code is leaking
references, try to call (or recreate) an object twice. If on the second try, the
reference count is incremented (or subtracted) then you probably want to check your
code. This trick works because Python, when run interactively, always keep a
reference to the result of the last expression entered, in the variable _. So, by
calling an object twice, assuming the call doesn't have any side-effects, the total
reference should remain equal.

However, as I said, this trick only works for expressions without side-effects. For
example, appending a same object to a list will increment the count every time you do
so. Also, you need to be aware of tricky cases -- such as the fact that Python
doesn't deallocate integers < 10, interned strings, etc. So, it can become unwieldy 
quickly.

That is where you want to have a thorough set of unit tests, and run it with the
Lib/test/regrtest.py script using the -R4:3 option. At this point, you want to work
directly in Python's source directory -- that is copy your module in Modules/ and
modify setup.py to make it compile your rope implementation (just like you were core
developer adding a new module). Finally, run make, copy your unit test in Lib/test
and run it with "./python -tt -E Lib/test/regrtest.py -R4:3 <your_test_name>". It
would surely be possible to keep your module from the core distribution (by writing
some script that imports test.regrtest and call its main function directly), but that
would be slightly more cumbersome.

If your code is leaking, regrtest's output will look like this:

alex:py3k% ./python -tt -E Lib/test/regrtest.py -R4:3 test_cmd_line
test_cmd_line
beginning 7 repetitions
1234567
.......
test_cmd_line leaked [0, 51, 0] references, sum=51
1 test OK.
[71090 refs]

Unfortunately, that only tells you whether your module is leaking or not. If it is,
then you will have to find the leak yourself. This can be hard if you never checked
your code for leaks before, since you don't have any reference point to find the
change that introduced the leak (By the way, it is a good idea to use a source
version control to keep the history of your changes -- preferably a distributed one,
like mercurial or bazaar, so that you don't have to go into the troubles of
configuring a central server).

Here is simple checklist to help you find leaks:  

  - In your destructor, have you decremented the reference, with Py_DECREF, of every
objects "owned"? (this apply only if you have PyObject pointers in your object's struct)
  - In methods, have you decremented the reference of objects you created? Even when
there's an error?
  - If you return an object that cames from the method's argument, did you increment
its reference?
  - Do you have any lost references -- i.e., you created an object assigned to a
variable that previously hold an object that you also created (in other words, not
from the method's argument)?

If you have any trouble finding a reference leak, please tell. Don't waste your time
on such details. I better like to see the rope implementation completed with a few
reference leaks, than an uncompleted one without leaks.
Comment 49 by iammisc, Dec 30, 2007
Thanks for the quick reply. I will be sure to get a debug build of python quickly(I
use gentoo so this shouldn't be too hard.)

Thanks for the checklist. I believe my code meets all of those criteria but I'll
still check it.
Comment 50 by iammisc, Dec 30, 2007
Can I use the ghop repository for version control? Do I already have write
privileges? Where should I put it under?
Comment 51 by doug.hellmann, Dec 30, 2007
Yes, you can use the ghop-python-2007 project svn repository.  You are already a
member of that project, so you should be able to create and update files there. 
Please create a directory for your project under /trunk/ghop-python-2007.  Either
give it a descriptive name, or just use the issue number.  Inside that directory you
can organize your files in whatever way makes the most sense to you.

If you have problems getting set up with svn, send email to the ghop-python list with
error messages, and other details and we'll try to help you out.
Comment 52 by iammisc, Dec 30, 2007
Okay I've uploaded my code. It is now under the ropes folder. Please take a look and
tell me what you think. It isn't complete yet and there are probably some cases where
I used rather ugly code or less than optimal algorithms. As of right now, I've tested
it on a 64-bit machine and it passes the unit tests I've given it. I know that the
unit tests don't cover all the functionality right now because that functionality is
incomplete, most notably hashing. I'm still testing the hashing code and as such,
I've used really ugly code that prints the hash out when I call len() on the rope.
This will change once I get hashing working. As of right now, hashing does not work
because of the limitations of C's integer data types. All of them overflow. I've
downloaded pypy and made the hashing code for ropes more verbose. The hashes are
exactly the same until the C code's integer overflows and I'm left with a negative
number. Are there any solutions to this?

I did some cleanup on the code itself today so there is not much improved
functionality from yesterday other than the ability to get the character at a
specific index.

Right now, I'm using a static array of character to represent a literal. This has the
limitation that a literal rope node can only have a certain amount of characters(64
at the moment). I did this because it makes everything much easier to handle in terms
of memory management. Is this ok or do you want me to dynamically allocate buffers?

Lastly, I will not be able to do much tomorrow because it's New Year's Eve and and
we're having a party. On New Year's itself I also probably won't be able to do much.
But I will be able to devote more time on the 2nd. 

I'll be waiting your comments on the code.
Comment 53 by avassalotti, Dec 31, 2007
Before I go through your code and review it, I would like to know how much you care
about this task. I know that the task is hard, and therefore I am willing to close my
eyes on a few issues. So, I can either review it very throughly and point out every
single issues. However if your goal is just to complete the task, you may find this
quite annoying. On the other hand, if your goal is to write a great code, then you
will probably love this. So for this time, I let you choose which review method you
like better.


Comment 54 by iammisc, Dec 31, 2007
No I would really like you to review it thoroughly. I'd rather write great code so
point out every issue and I'll fix it.
Comment 55 by avassalotti, Jan 01, 2008
First, the stylistic issues:

  - Please follow the coding conventions described in PEP-7
    (http://www.python.org/dev/peps/pep-0007/). Don't clump everything
    together; structure your code in paragraph, just like you would do
    for an essay.
  - Use shorter name for your variables in your main struct.
  - Don't use the Py prefix for public function, this is reserved for
    the interpreter core objects and functions.

I think your main rope data-structure for rope objects needs to be
worked out better. Boehm's paper (pages 7-8) highlights an elegant way
to represent ropes using only concatenation and closure nodes. It is
elegant since it allows you to define as many rope types you want,
without increasing the complexity of the rebalancing operations, while
keeping the main data-structure compact. I am including an
implementation of this idea filled with example how to use it. You are
free to use it, or not, for your implementation.

Your implementation seems to be fairly straightforward. However, I
believe many parts could (and should) be simplified. For example, I
think you are overly clever in rope_index(). (Your use of tail
recursion is surely neat, but not necessary since rope depth should
be bounded. You shouldn't use a fixed array for literals, because this
is making your rope object unnecessarily large (in term of memory).

Quick question, what is this for?

void rope_destroy(PyRopeObject* r)
{
  rope_traverse(r, _rope_dec_refcount);
}

Finally, I have found these bugs in your implementation:

>>> r + "world"
Rope('hello world')
>>> "world" + r
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'ropes.Rope' objects

>>> "".join(r)
104
104000413
-1908071293
858739893
-1881877042
182784266
-403830251
-1457451538
-1708129860
-720966496
1731276356
before:1731276356
after:1731276356
1731276356
'hello world'
>>> len(r)
104
104000413
-1908071293
858739893
-1881877042
before:-1881877042
after:-1881877042
-1881877042
5

>>> r[0:3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence index must be integer, not 'slice'

>>> r = Rope('')
>>> r + r + r + r
[1]    10197 segmentation fault (core dumped)  python
>>> r = Rope('a')
>>> r = r + r + r
>>> str(r)
[1]    10203 segmentation fault (core dumped)  python

>>> r = Rope("abc")
>>> r = r * 1 
>>> len(r)
b: 0
masked_power: 1
97
97000389
-1156378894
before:-1156378894
after:-1156378894
[1]    10126 segmentation fault (core dumped)  python

>>> hash(r)
-1210995232
>>> hash(r*1)
-1210995056

>>> r*2
Rope('abcabc')
>>> s
Rope('abcabc')
>>> r*2 == s
False

>>> r = Rope("abc")
>>> s = Rope("abcabc")
>>> r+r == s
False

>>> r = Rope("abc")
>>> s = Rope("abcabc")
>>> hash(r+r)
-1211166160
>>> hash(s)
-1211166688

>>> while 1:
...   r = Rope("abc" * 2000)
... 
[memory leak]

>>> hash(r*1)
-1210225008
>>> hash(r+r+r)
-1210224832
>>> hash(r*1)
[1]    10164 segmentation fault (core dumped)  python

>>> r = Rope("")
>>> r[-1]
before:0
after:0
0
'\x00'

>>> r = Rope()
>>> r.__init__(r, "hello")
>>> r.__init__(r, "hello")
[1]    10217 segmentation fault (core dumped)  python

>>> r = Rope('')
>>> r.__init__(r, "hello")
[infinite loop]

>>> r = Rope('')
>>> Rope.__init__(r, "hello")
>>> Rope.__init__(r, "hello")
>>> Rope.__init__(r, "hello")
>>> r
Rope('hellohellohello')
[weird; and nasty for subclasses]


And one more thing, happy new year!
rope_example.c
6.5 KB Download
Comment 56 by avassalotti, Jan 01, 2008
Oh, I forgot to mention something. Here's an invocation of GNU indent, that will
format your code automatically (almost perfectly) to PEP-7 C coding convention. You
will just need to specify your typedefs, like PyRopeObject, with the -T flag.  

For Python 2.x, use this:
indent -ut -ts8 -i8 -l79 -bap -nbad -ncdb -ncs -sc -br -nce -cdw -cli0 \
  -nss -npcs -saf -sai -saw -nbc -di1 -brs -psl -lp -nlps -nbbo \
  -T PyObject <filename>

For Python 3.x, use this:
indent -i4 -l79 -bap -nbad -ncdb -ncs -sc -br -nce -cdw -cli0 \
  -nss -npcs -saf -sai -saw -nbc -di1 -brs -psl -lp -nlps -nbbo \
  -T PyObject <filename>

Comment 57 by iammisc, Jan 02, 2008
First of all, thanks for reviewing it so thoroughly, I'll be sure to look into all
the things you mentioned. However, I think you must have misunderstood me
avassalotti, you reported some things as bugs which I said(at least I thought I said)
aren't complete such as slicing or hashing. The only things implemented are
concatenation, and repetition.

> I think your main rope data-structure for rope objects needs to be
> worked out better. Boehm's paper (pages 7-8) highlights an elegant way
> to represent ropes using only concatenation and closure nodes. It is
> elegant since it allows you to define as many rope types you want,
> without increasing the complexity of the rebalancing operations, while
> keeping the main data-structure compact. I am including an
> implementation of this idea filled with example how to use it. You are
> free to use it, or not, for your implementation.

Thanks. I'll look into it.

> Quick question, what is this for?
> 
> void rope_destroy(PyRopeObject* r)
> {
>   rope_traverse(r, _rope_dec_refcount);
> }

This is an old function that I used when I first started the implementation to
decrement the reference count of all the nodes children. However, it's no longer
needed and I just forgot to take it out.

> >>> len(r)
[ disregard the output from len ]
> 5

I don't see how this is a bug. As far as I can tell r is equal to "hello" and that is
length 5 right?

> >>> r = Rope("abc")
> >>> r = r * 1 
> >>> len(r)
[disregard len's output ]
> [1]    10126 segmentation fault (core dumped)  python

This does not happen on my computer. This could be because I might have fixed the bug
after committing the copy you looked at in the subversion repo. However, I'll check
it on another computer. Also, the computer I'm developing from is 64-bit. Is your's
32-bit? This could be a problem if there's a bug in the code that makes an assumption
about bit size although I doubt it.

> >>> while 1:
> ...   r = Rope("abc" * 2000)
> ... 
> [memory leak]

I am unsure about what you mean by this or how I am supposed to fix it. How exactly
is the memory being leaked?

> >>> r = Rope('')
> >>> Rope.__init__(r, "hello")
> >>> Rope.__init__(r, "hello")
> >>> Rope.__init__(r, "hello")
> >>> r
> Rope('hellohellohello')
> [weird; and nasty for subclasses]

This is because I basically made the init method append its arguments to itself. I'll
be sure to fix this.

Those are the only bugs that I had something to say about. I'll look into the rest
that relate to the things I've implemented. And I'll start finishing up support for
slicing, comparisons, and search/replace.

> And one more thing, happy new year

You too.




0 bytes Download
Comment 58 by avassalotti, Jan 02, 2008
Ah, you are right. I missed your comment about hashing. Sorry, about that.

> I've used really ugly code that prints the hash out when I call len() on the rope.
> This will change once I get hashing working. As of right now, hashing does not work
> because of the limitations of C's integer data types. All of them overflow. I've
> downloaded pypy and made the hashing code for ropes more verbose. The hashes are
> exactly the same until the C code's integer overflows and I'm left with a negative
> number. Are there any solutions to this?

Overflow is expected for the hashing functions. You don't have to make your ropes
hash equals to PyPy's ropes. The only requirements regarding the hashing function are:
  1) It should generate evenly distributed hash keys on random input.
  2) Equal strings should have equal hashes.

> This does not happen on my computer. 

You removed the debugging printf in rope_length(), right? If so, you also removed the
 the call to rope_hash() that caused the crash. The problem is rope_hash()
dereferences a null pointer when given a MULTIPLY node.

	long x = rope_hash(rope->n_node.concat.left) +
	    rope_hash(rope->n_node.concat.right) * masked_power(1000003,
    [right is null for multiply nodes] -----^

However, I am not sure why "hash(r * 1)" (sometimes) works, though.

> > I am unsure> >>> while 1:
> > ...   r = Rope("abc" * 2000)
> > ... 
> > [memory leak]
>
> I am unsure about what you mean by this or how I am supposed to fix it. How
> exactly is the memory being leaked?

Ropes are never deallocated (This is almost certainly due to a reference-leak). To
make my example clearer:

  while 1:
    r = Rope("abc")
    del r


Comment 59 by iammisc, Jan 02, 2008
> Ropes are never deallocated (This is almost certainly due to a reference-leak). To
> make my example clearer
>
>   while 1:
>     r = Rope("abc")
>     del r

I don't understand? Ropes are never deallocated? Shouldn't they be deleted once they
are no longer used(e.g. once their reference count becomes 0). 

Maybe I'm not understanding you correctly, but what exactly should happen in the case
of your while loop? 
Comment 60 by avassalotti, Jan 02, 2008
Indeed, a rope should be deallocated when its reference count becomes 0. The problem
is, some yet unknown reason, the reference count never gets to 0.

In the while loop, the rope should get deallocated at each iteration. However, it
isn't the case right now, thus "the memory leaks."   
Comment 61 by iammisc, Jan 02, 2008
Ah, ok. The newer version should not have this problem. I have committed my changes
to the repository that should fix this problem as well as most of the other problems
you described. Please note however that there is still no hashing or comparison support.
Comment 62 by iammisc, Jan 02, 2008
I just committed a new version that fixed a nasty bug in the last version which
caused the ropes module to completely fail its unit tests due to allocating literals
dynamically.
Comment 63 by iammisc, Jan 02, 2008
The new version now has comparison support(between ropes) and iteration support. I'll
be working on hashing and search/replace tomorrow.
Comment 64 by avassalotti, Jan 02, 2008

I reviewed your code some more, I saw a few more things to fix:

  - Don't use random value for your unit tests (testSlicing). Use specific boundary
    values, prone to failures, instead. 
  - Strip the Lorem ipsum. Instead, generate long strings with repetition -- e.g.
    data1 = "XXX" * 200; data2 = "YYY" * 200; data3 = ...
  - Don't reinvent the wheel. Use the shared mixins for testing strings for
    your tests. Just import test.string_tests, and use what you need.
  - Remove Rope.append. Ropes should be considered as immutable.
  - Remove the "debugging" getters from the public interface of Rope
    (e.g. Rope.type). However, you can keep them for now, if they help you.
  - Declare 'static' all your functions. Combine 'python.c' and 'rope.c' if
    necessary.
  - Remove any implicit type conversion -- e.g. Rope + str should raise a TypeError.
  - In rope_repr(), escape any control characters. For example, this:
      >>> Rope('hel\n\tlo')
      Rope('hel
              lo')
    should be:
      >>> Rope('hel\n\tlo')
      Rope('hel\n\tlo')
    You probably can reuse the implementation of, or use, PyString_Repr() for this.
  - Please put all your debugging functions (e.g. print_rope()) at the same place
    and label them as such. Also, protect your debugging printf calls with
    if-statements. For example, "if (DEBUG) { printf(...); }".
  - Use Python's own memory allocator (PyMem_Malloc/PyMem_Free), instead of
    malloc/free. 
  - Remove rope_incref(). You don't need it, and it's causing (major) reference
    leaks. You just need to call Py_INCREF once on an existing node (not a brand
    new one) to a concatenation node.
  - Don't check the type of 'self', in your methods. This is done automatically be
    the interpreter. Remember you are implementing an extension type, not a whole
    new public API for Python.
  - Define a macro, Rope_Check, if you really need to do this (I don't think so):
    "self->ob_type != &ropes_type".
  - Fix the compiler warnings. I included a build log with the '-ansi -pedantic'
    flags turned on. You can ignore the warnings about 'long long' in longobject.h.
  - Mark fall-through case-clauses explicitly (e.g, ROPE_SLICE_NODE in
    rope_to_string()).

Now, I have some questions for you:

  - Shouldn't rope_move() deallocates its 'src'? Otherwise, why is it called
    rope_move, and not rope_copy? Also, why this function is needed at all?
  - Why rope_append() is so long and complex? Why do you walk down the whole
    concatenation tree (to get the right-most leaf) to append a node? Couldn't you
    just create a new concatenation node and set the right children to 'self' and
    the left children to the rope to be appended?

Comment 65 by avassalotti, Jan 02, 2008
Oops, I forgot to include the build log for the compiler warnings.
build_warnings.txt
5.1 KB Download
Comment 66 by iammisc, Jan 02, 2008
>  - Don't use random value for your unit tests (testSlicing). Use specific boundary
>    values, prone to failures, instead. 

Alright, I just used random values in the beginning because it made my code fail and
allowed me to make the slicing code better.
  
> - Strip the Lorem ipsum. Instead, generate long strings with repetition -- e.g.
>    data1 = "XXX" * 200; data2 = "YYY" * 200; data3 = ...

Ok.

>  - Don't reinvent the wheel. Use the shared mixins for testing strings for
>    your tests. Just import test.string_tests, and use what you need.

I'll be sure to do that.

>  - Remove Rope.append. Ropes should be considered as immutable.

Isn't that the same as +=, should I take that out too?

>  - Remove the "debugging" getters from the public interface of Rope
>    (e.g. Rope.type). However, you can keep them for now, if they help you.

I'll take them out once I am sure they're no longer needed. They really help me with
runtime debugging right now.

>  - Declare 'static' all your functions. Combine 'python.c' and 'rope.c' if
>    necessary.

I was originally going to make rope.c a ropes library in C but with the way it's gone
it no longer makes a difference so I'll change it.

>  - Remove any implicit type conversion -- e.g. Rope + str should raise a TypeError

This is another debugging remnant. I just didn't want to type Rope(' all the time.
But I'll change it.

>  - In rope_repr(), escape any control characters. For example, this:
>      >>> Rope('hel\n\tlo')
>      Rope('hel
>              lo')
>    should be:
>      >>> Rope('hel\n\tlo')
>      Rope('hel\n\tlo')
>    You probably can reuse the implementation of, or use, PyString_Repr() for this.

Ok.

>  - Please put all your debugging functions (e.g. print_rope()) at the same place
>    and label them as such. Also, protect your debugging printf calls with
>    if-statements. For example, "if (DEBUG) { printf(...); }".

Alright but don't you mean a #if? I guess a non brain-dead compiler could optimize
that out though.

>  - Use Python's own memory allocator (PyMem_Malloc/PyMem_Free), instead of
>    malloc/free. 

For some reason, when I first was doing the string conversion function, using
PyMem_Malloc seg faulted but now it works just fine so I'll change these over too.

>   - Remove rope_incref(). You don't need it, and it's causing (major) reference
>    leaks. You just need to call Py_INCREF once on an existing node (not a brand
>    new one) to a concatenation node.

I just realized this problem while going through the code in my head. It'll
definitely be taken out next commit. 

>   - Don't check the type of 'self', in your methods. This is done automatically be
>    the interpreter. Remember you are implementing an extension type, not a whole
>    new public API for Python.

Ok

>  - Fix the compiler warnings. I included a build log with the '-ansi -pedantic'
>    flags turned on. You can ignore the warnings about 'long long' in longobject.h.

I know about the compiler warnings but I wasn't concerned about them at the time
because I was more concentrated on getting everything to work. I will make sure to
fix these up tomorrow.

>  - Mark fall-through case-clauses explicitly (e.g, ROPE_SLICE_NODE in
>    rope_to_string()).

Alright I'll do this.

>  - Shouldn't rope_move() deallocates its 'src'? Otherwise, why is it called
>    rope_move, and not rope_copy? Also, why this function is needed at all?
>  - Why rope_append() is so long and complex? Why do you walk down the whole
>    concatenation tree (to get the right-most leaf) to append a node? Couldn't you
>    just create a new concatenation node and set the right children to 'self' and
>    the left children to the rope to be appended?

I'm answering these questions together because they have something to do with each
other. rope_move() was made for rope_append. You are right about simply creating a
concatenation node. I don't know what I was thinking when I wrote this function.

I'll also format the code according to PEP 7 tomorrow.
Comment 67 by avassalotti, Jan 02, 2008
> >  - Remove Rope.append. Ropes should be considered as immutable.
>
> Isn't that the same as +=, should I take that out too?

If you had specific code to make += in-place, yes. Python will
define x += y automatically as x = x + y, which is the desired behavior.

> > - Please put all your debugging functions (e.g. print_rope()) at
> >    the same place and label them as such. Also, protect your
> >    debugging printf calls with if-statements. For example,
> >    "if (DEBUG) { printf(...); }".
>
> Alright but don't you mean a #if? I guess a non brain-dead compiler
> could optimize that out though.

No, I really meant a if-statement. As you said, the compiler will
optimize it out automatically. The advantage, with an if-statement, is
the debugging code gets checked by the complier. Although, #ifdef is
sometime the only choice -- e.g. at the top-level.

Comment 68 by iammisc, Jan 03, 2008
Eeeks. I ran GNU indent like you said and it makes the code really ugly and almost
unreadable. Do you want me to commit the file like that? I can send you a copy of it
if you like.
Comment 69 by avassalotti, Jan 03, 2008
What do you mean by "ugly and almost unreadable"? Can you show an example? 
Comment 70 by iammisc, Jan 03, 2008
yeah. For an example of what I'm talking about look at line 223. The function call
has been split up into more than 10 lines
ropes.c
13.8 KB Download
Comment 71 by iammisc, Jan 03, 2008
One last question, how exactly do I use the test.string_tests module?
Comment 72 by avassalotti, Jan 03, 2008
Ah, that is what I thought. This means you need to rework your code to use less
indentation.

Here's a few tricks:
  - You can combine nested if-statements by combining their conditions. For example 
       if (cond1) {
           if (cond2) {
               /* do something */
           }
       }
     can be rewritten as:
       if (cond1 && cond2) {
           /* do something */
       }
  - You can also use the Linus Torvalds trick and rewrite this:
      for (i=0; i < 100; i++) {
          if (cond) {
              /* do something */
          }
      }
    as:
      for (i=0; i < 100; i++) {
          if (!cond)
              continue;
          /* do something */
      }
  - Don't add braces for your case-clauses.
  - Extract your code into separate functions.
  - Use better algorithms.
 
Comment 73 by avassalotti, Jan 03, 2008
Oh, it is really easy. You just subclass the test cases you need. You can use
test_str as a reference:

  http://svn.python.org/view/python/trunk/Lib/test/test_str.py?rev=55873&view=markup
Comment 74 by iammisc, Jan 03, 2008
But how do I make it use the ropes type?
Comment 75 by iammisc, Jan 03, 2008
Oh I see nevermind. The problem is the unit tests test for things that I've not
implemented or are not supposed to work. For example it concatenates a string to a
rope and you told me that should fail so the unittests all fail with E because some
sort of exception was thrown.
Comment 76 by iammisc, Jan 03, 2008
I committed a new revision just now that almost fixes all of the issues you
mentioned. I still have to remove some debugging functions(like print_rope()) but
I'll do that once I'm finished. I've also moved everything to one file, ropes.c and
I've renamed the unit tests file test.py to test_ropes.py because it will cause
problems once I'm using test.string_tests.
Comment 77 by avassalotti, Jan 04, 2008
I did a fair amount of refactoring on your code. I added some bells-and-whistles too
(like a hashing function than generate the same hashes as str). I am not done yet,
but I want to show what I got, so you don't do unnecessary work.

ropes.c
23.4 KB Download
Comment 78 by avassalotti, Jan 04, 2008
By the way, for your iterator, you probably want to generate a linked-list of the
leaves and make next() simply advances to the next node in the list.
Comment 79 by iammisc, Jan 04, 2008
Ok thanks avassalotti. The only thing I'm wondering about in your code is the absence
of the slice node type. Is this just something you haven't done or is it something
you want me to take out?

Also, did you commit your changes to the svn repo?
Comment 80 by iammisc, Jan 04, 2008
One more thing. I noticed, you no longer have the sq_contains method, is this
something you also want to take out?
Comment 81 by avassalotti, Jan 04, 2008
> The only thing I'm wondering about in your code is the absence
> of the slice node type. Is this just something you haven't done
> or is it something you want me to take out?

Honestly, don't know. There is a complexity trade-off for the lazily evaluated slice
nodes (for the repeat nodes as well). Personally, I think your implementation would
be much simpler without these lazy nodes. Carl what do you think?

> Also, did you commit your changes to the svn repo?

This repository is yours. I won't commit anything into it, unless I get your permission.
Comment 82 by avassalotti, Jan 04, 2008
> One more thing. I noticed, you no longer have the sq_contains method, is this
> something you also want to take out?

No. I removed temporarily to make the code not depend on the iterator. By the way, do
you that your implementation of sq_contains is quadratic (i.e., O(n**2))?

One last thing, does it bother you that I rewrote parts of your code to be more
idiomatic? I know that you are working hard enough already. So, I don't want to make
this task harder than it should be, by forcing you to review my code.
Comment 83 by iammisc, Jan 04, 2008
> One last thing, does it bother you that I rewrote parts of your code to be more
> idiomatic?

No, that's fine.

I'll rewrite the iterator to use a linked list of nodes instead of creating
subiterators like you said.
Comment 84 by cfbolz, Jan 04, 2008
> Honestly, don't know. There is a complexity trade-off for the lazily evaluated 
> slice nodes (for the repeat nodes as well). Personally, I think your
> implementation would be much simpler without these lazy nodes. Carl what do
> you think?

The following came out of experiments in PyPy. Should be similarly true for this
implementation: 

I think repeat nodes are not really worth it, because you can write a multiplication
algorithm that uses O(log(n)) time and space using concatenation nodes very easily. 

Something similar is true for slicing. If the slice is small, it is much quicker to
make a copy anyway. If you take a big slice of a rope that is somewhat well-balanced
 then you can share most nodes and use only a bit of additional storage (again
proportional to O(log(length)), I think, but haven't tried to prove it).

There are lots of advantages of not having new node-types: Less code-complexity, less
parameters you need to tune, etc.
Comment 85 by iammisc, Jan 04, 2008
I'll think I just will implement slice nodes by copying because the time advantages
of using a slice node is balanced out by the memory advantages of making a copy. This
is becuase, in the current implementation a very large rope is still kept in memory
even though the only reference to it is a slice which only takes one character. Also,
as cfbolz said, slice nodes add complexity. However, I think I'll keep repeat nodes
because they aren't too complex and they also allow very large repeats.

Just a quick question though, cfbolz. In the PyPy page on ropes you calculated the
hash of 'a'*sys.maxint. How did you this if PyPy just uses concatenation nodes and
not repeat nodes?
Comment 86 by cfbolz, Jan 04, 2008
Well, you can concatenate a node to itself. by doing this you can get very long
strings very quickly. Look at this:

http://codespeak.net/~cfbolz/rope_multiplication.png

It's a visualization of the node graph of 'a' * 100000.

This makes calculating hashes very fast too, because for a concatenation node you
need to calculate the hash of both children, which are the same, so you calculate it
only once.
Comment 87 by iammisc, Jan 04, 2008
Ah, I see.
Comment 88 by iammisc, Jan 04, 2008
Avassalotti, I am not sure what you mean when you say my implementation of
ropeobj_contains is quadratic. I iterate over self only once. In the worst case it
could be O(n**2) but it also could be O(n) or less. I am not sure how else to
implement this. Basically, what I am doing inside the nested loop is the same thing a
platform's strcmp() would do except using PyObject*'s. 
Comment 89 by iammisc, Jan 04, 2008
One more thing, do you want me to make ropeobj_compare independent of iterators?
Comment 90 by avassalotti, Jan 04, 2008
Please, ignore my previous comment about the complexity of your ropeobj_contains.
Just didn't understand what I was talking about.

What I meant is, you are using the naive string searching algorithm, which is O((n -
m + 1)*m) where n is the length of the "haystack" and m the length of the "needle".
My point is there exists better algorithms that perform in average linear time --
such as the one Python's str uses (a variant of Boyer-Moore-Horspool).

So perhaps, a better solution would be to convert the ropes to strings (with
rope_str) and use PySequence_Contains.
Comment 91 by avassalotti, Jan 04, 2008
> One more thing, do you want me to make ropeobj_compare independent of iterators?

Only if it makes the code simpler or more readable. 
Comment 92 by avassalotti, Jan 04, 2008
> So perhaps, a better solution would be to convert the ropes to strings (with
> rope_str) and use PySequence_Contains.

Nevermind. This would be actually worse than the naive approach.
Comment 93 by iammisc, Jan 04, 2008
Ah I see. I'll implement a better searching algorithm. It can't be the
Boyer-Moore-Horspool algorithm however because that would require searching backwards
which would make the code more complex. The problem is most of the efficient
searching algorithms require random access to the string but calling rope_index
continuously is more inefficient than just using the naive approach.
Comment 94 by iammisc, Jan 04, 2008
I've redone the iterators to make them more efficient by using a fixed size array.
I'll commit the changes to the repo. I've decided to use your version of ropes.c. The
only problem is that for some reason, my system does not seem to define Py_ssize_t.
Any ideas on why this might be?

Comment 95 by avassalotti, Jan 04, 2008
I think you should stick with the naive approach for now. I will lookup in the
literature if there is a way to use a property of search trees to make comparisons
and  searching more efficient. I will tell you about it if I find anything useful.

> The only problem is that for some reason, my system does not seem to define
> Py_ssize_t. Any ideas on why this might be?

Which version of Python do you use?
Comment 96 by iammisc, Jan 04, 2008
2.4
Comment 97 by avassalotti, Jan 04, 2008
Can you upgrade to 2.5? Otherwise, you could temporarily use this typedef:

#ifdef HAVE_SSIZE_T
typedef ssize_t Py_ssize_t;
#else
typedef long Py_ssize_t;
#endif

Comment 98 by cfbolz, Jan 04, 2008
I would stick to naive searching too for now. You can probably get really efficient
search algorithms on the trees (even much better than effsearch in some cases :-) )
but they are work and really not simple at all. Alexandre, if you happen to find a
nice algorithm in the literature I would be extremely interested to hear about it.
Comment 99 by iammisc, Jan 04, 2008
I'll get python 2.5 ASAP. I've committed the changes to the repo.
Comment 100 by avassalotti, Jan 04, 2008
I did some more work on the reviewed parts. The main changes are:
   - Added proper indexing support, with rope_as_mapping.
   - Added GC support. No more reference leaks!
   - Added rope_from_string, to create a rope from a C string (used in rope_new).
   - Improved the handling of REPEAT_NODE in _rope_str.
   - Fixed a nasty double DECREF in rope_repr. (Oh, you fixed this one already. :-))
   - Renamed rope_traverse to rope_char_iter
   - Added rope_traverse for GC support.

I will review the balancing code and the iterator, tomorrow.
ropes-reviewed.patch
20.6 KB Download
Comment 101 by iammisc, Jan 04, 2008
Balancing isn't done yet. You can review the iterator part. But I wanted to improve
the balancer so that it can merge adjacent literal nodes which are too small.

Thanks for the patch but the only thing that I'm confused about is rope_traverse. I
thought this was only needed for cyclic garbage collection, where the node can hold a
reference to itself. If a node referenced itself, the length of the string would be
infinite and hashes wouldn't compute correctly or would take forever to do so. Not to
mention, the impossibility of converting the rope to string. I really don't see why
this is needed. Could you post an example of code that has cyclic references?
Comment 102 by iammisc, Jan 05, 2008
Just to let you know that I won't be able to do much today as I have a lot of things
to do before school starts again. However, I'll try and finish the balancing code later.
Comment 103 by avassalotti, Jan 05, 2008
Fine, I won't review the balancing code yet.

And for the GC issue, you are right that a rope forms an acyclic graph. Just been
lazy and used the GC to clean up unreferenced objects. I think the INCREF in
rope_concat might be superfluous, but I am not sure yet.

Comment 104 by iammisc, Jan 05, 2008
Okay, I've implemented merging of adjacent nodes of small length in the balancing
code. I've also added a new debugging tool which implements a balance method so you
can test the balancing code. Now the new iterator code which uses a list has a bug
inside which I'll fix tomorrow along with the slicing code which I'm basically going
to port from the PyPy implementation. Note that because the iterators are not
working, comparisons also don't work and so two of the unit tests fail.
Comment 105 by iammisc, Jan 05, 2008
One more thing, do you think MIN_LITERAL_LENGTH is too small? I saw that in the
rope_to_string code you don't split the string; should this be the same way in the
balancing code? E.g. merge all adjacent literal nodes?
Comment 106 by georg.brandl, Jan 06, 2008
Extending until 01-08.
Labels: -Due-20080103.0420 Due-20080108.0420 expired
Comment 107 by iammisc, Jan 06, 2008
Slice support is complete. I've committed it to the svn repo.
Comment 108 by the.good.doctor.is.in, Jan 06, 2008
I hereby nominate this task as the task with the most comments!  Impressive work,
iammisc, and some good mentoring.
Comment 109 by avassalotti, Jan 06, 2008
> I hereby nominate this task as the task with the most comments!
> Impressive work, iammisc, and some good mentoring.

Thanks Titus!
Comment 110 by avassalotti, Jan 06, 2008
> Slice support is complete. I've committed it to the svn repo.

Could you move rope_slice into rope_subscript? There is already a stub for it. 

        ...
        else if (PySlice_Check(item)) {
                Py_INCREF(Py_NotImplemented);
                return Py_NotImplemented;
        }
        ...

You can get the indices of the slice object as follow:

        Py_ssize_t start, stop, step, slicelength;
        if (PySlice_GetIndicesEx((PySliceObject *)item,
                                 self->length,
                                 &start, &stop, &step, &slicelength) < 0) {
            return NULL;
        }

It is not super important, though. Also, the rope_balance seems to have some problem: 

>>> from ropes import Rope
[19942 refs]
>>> r = Rope("hello")
[19944 refs]
>>> r[:]
Rope('hello')
[19946 refs]
>>> r[2:4]
Rope('ll')
[19946 refs]
>>> r[2:4:2]
NotImplemented
[19946 refs]
>>> r[2:4]
Rope('ll')
[19946 refs]
>>> r += r
[19948 refs]
>>> r += r
[19950 refs]
>>> r += r
[19952 refs]
>>> r += r
[19954 refs]
>>> r += r
[19956 refs]
>>> r += r
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210681152 (LWP 7362)]
0xb7f075f2 in rope_balance (r=0xb7d1c61c) at src/ropes.c:650
650                     if (node_list[i]->type != LITERAL_NODE)


P.S. Could you keep your changes consistent with PEP-7? I find it particularly
difficult to review C code that doesn't the follow PEP-7 coding conventions. 
Comment 111 by iammisc, Jan 07, 2008
I don't get this error on my computer but I do have an explanation for it. I think it
is caused by me not initializing i to 0 at the beginning of the loop. I have
committed the new version to the repo. This new version also fixes the formatting
changes, gets rid of rope_slice, and replaces it with rope_subscript although
stepping support is Not Implemented.
0 bytes Download
Comment 112 by avassalotti, Jan 07, 2008
Sorry, I didn't have the time review the code yet. Nevertheless, I found some bugs:

  - rope_repeat (and other methods that changes the length of the rope) doesn't check
for integer overflow.

>>> r = Rope('A') * int(2**31-1)
2147483647
>>> r = r * int(2**31-1)
>>> len(r)
1

  - rope_concat is leaky. I think this is related to the potentially superfluous
INCREF (see Comment 103). At least, the garbage collector does its work. :-)

>>> r = Rope('1234567890')
[19940 refs]
>>> for _ in range(20):
...   r = r + r
... 
[2018791 refs]
>>> del r
[19943 refs]

On the positive side, I really like the memory behavior of slicing. It was a good
idea to replace the slice nodes. 

>>> from ropes import Rope
[19763 refs]
>>> r = Rope('1234567890')
[19765 refs]
>>> for _ in range(20):
...   r = r + r
... 
[2018616 refs]
>>> r = r[:len(r)/2]
[1019192 refs]
>>> r = r[:len(r)/2]
[519480 refs]
>>> r = r[:len(r)/2]
[269624 refs]
>>> r = r[:len(r)/2]
[144696 refs]
>>> r = r[:130]
[19772 refs]
>>> r = r[:1]
[19770 refs]

However, concatenation will need to be tuned. The following example consumes more
than 800MB of memory:

>>> from ropes import Rope
>>> r = Rope('1234567890')
>>> for _ in range(23):
...   r = r + r
... 
>>> len(r)
639631360

While, a Python string of the same length consumes 610MB [1Bytes * 639631360 *
1MB/(2^30)Bytes = 610MB]. In theory, a rope could consumes much less memory. With a
minimal length of 128, my example could be represented using only with a 160Bytes
literal [min(10 * 2^x) > 128 = 160 (x = 4)] and 19 concatenation nodes [23 - 4]. Each
node has a 40Bytes overhead (on a 32bit machine). Therefore, the whole string could
be represented using a grand total of 960 bytes [160Bytes + (20-node *
40Bytes/node)]. But wait, something doesn't seems right to me; 160 * 2^19 = 83886080,
not 639631360. Hm.

>>> s = "1234567890"
>>> for _ in range(23):
...    s = s + s
... 
>>> len(s)
83886080

Ha ah! I found another bug. <wink> 

Nitpicking: It is bad practice to use GNU indent to reformat reviewed code. 
Comment 113 by iammisc, Jan 07, 2008
>  - rope_repeat (and other methods that changes the length of the rope) doesn't check
for integer overflow.

Alright, I'll fix it.

>  - rope_concat is leaky. I think this is related to the potentially superfluous
INCREF (see Comment 103). At least, the garbage collector does its work. :-)

It could be the INCREF but it also could be the balancing code. I'll check this out.

> On the positive side, I really like the memory behavior of slicing. It was a good
idea to replace the slice nodes. 

Thanks, I feel much more comfortable with the existing implementation too.

> However, concatenation will need to be tuned. The following example consumes more
than 800MB of memory:

Once again, I think this is related to a bad balancing method which needs to check if
adjacent nodes are the same. The balancing method makes a list of nodes. I was
actually thinking of always keeping the rope balanced. All concatenations are always
made to the right side so keeping it balanced all the time wouldn't be too hard.
Actually, I think it will make everything a whole lot easier. What do you think about
this?

> Nitpicking: It is bad practice to use GNU indent to reformat reviewed code. 

:-[ Sorry about that. It's just that I don't like reformatting code after I've
written it. I think I'll pay more attention when *writing* the code next time.


P.S. School starts tomorrow so I won't be able to go as fast as I have. You'll
probably have to wait about 24 hours for new code or comments.
Comment 114 by cfbolz, Jan 08, 2008
> Once again, I think this is related to a bad balancing method which
> needs to check if adjacent nodes are the same.

Why does it need to check that? Are you roughly following the algorithm
in the paper or in the PyPy implementation?

> The balancing method makes a list of nodes. I was actually thinking of
> always keeping the rope balanced. All concatenations are always made
> to the right side

Why are all concatenation always made to the right side? You can imagine
having a huge rope and concatenating something to the left side.

> so keeping it balanced all the time wouldn't be too hard.  Actually, I
> think it will make everything a whole lot easier.  What do you think
> about this?

No, that doesn't make sense to me. The idea of ropes is to make
concatenation extremely fast in most cases (creating one concat one is
much cheaper than copying two strings). In most cases the rope is not
really perfectly balanced but "balanced enough" so that you don't really
care. Only when the imbalance becomes too great you do a full rebalance.

Ah, another note: the rope_index function uses tail recursion (the last
thing the function does is call itself again) which means it can be
trivially rewritten to use iteration instead. This is actually quite
beneficial, because you safe a lot of function calls when the rope is
deep (a good compiler can theoretically do that itself, but you never
know). The result would look roughly like this (untested):

static char
rope_index(RopeObject *self, Py_ssize_t i)
{
        while (1) {
                assert(self && i < self->length);

                switch (self->type) {
                case LITERAL_NODE:
                        return self->v.literal[i];
                case CONCAT_NODE:
                        if (self->v.concat.left && i < self->v.concat.left->length) {
                                self = self->v.concat.left;
                                continue;
                        }
                        else if (self->v.concat.right) {
                                i = i - self->v.concat.right->length;
                                self = self->v.concat.right;
                                continue;
                        }
                        break;
                case REPEAT_NODE:
                        if (self->v.repeat.child) {
                                i = i % self->v.repeat.child->length;
                                self = self.v.repeat.child;
                                continue;
                        }
                        break;
                }
        }
        /* never reached */
        return -1;
}

Comment 115 by avassalotti, Jan 08, 2008
> > Nitpicking: It is bad practice to use GNU indent to reformat reviewed code. 
>
> :-[ Sorry about that. It's just that I don't like reformatting code after I've
written it. I think I'll pay more attention when *writing* the code next time.

No need to be sorry. I am really nitpicking. My comments about the formatting are 
just friendly nudges to make you learn that it is a good habit to keep the code 
clean while you write it. Just by curiosity, what editor do you use for writing your 
code?

> P.S. School starts tomorrow so I won't be able to go as fast as I have. You'll
> probably have to wait about 24 hours for new code or comments.

That is fine. I am thinking marking this task as completed soon, since I don't want 
to drag you over with new requirements. You been doing great work so far, keep it up.
Comment 116 by avassalotti, Jan 08, 2008

> Ah, another note: the rope_index function uses tail recursion 

Oh, it is me that wrote (actually rewrote) rope_index. Travis's initial version used 
a clever technique to reuse the function stack frame (with a goto). I thought this 
was unnecessary (since most modern C compilers do sibling call optimization) and 
rewrote it in a more intuitive style. 

> (the last thing the function does is call itself again) which means it can be
> trivially rewritten to use iteration instead. This is actually quite
> beneficial, because you safe a lot of function calls when the rope is
> deep 

The rope shouldn't have a deep greater than 32. So, the benefit is not noticable.

> (a good compiler can theoretically do that itself, but you never
> know). 

Yes, that is right. I know that GCC does since 3.4 (I am not sure about MSVC, 
though). Anyway, the hand-optimized version doesn't look bad, either.
Comment 117 by cfbolz, Jan 08, 2008
> Oh, it is me that wrote (actually rewrote) rope_index. Travis's initial version 
> used a clever technique to reuse the function stack frame (with a goto). I thought
> this was unnecessary (since most modern C compilers do sibling call optimization)
> and rewrote it in a more intuitive style. 

Oh, I see. I find the while loop pretty natural, but maybe you are right that it does
indeed not matter too much.
Comment 118 by iammisc, Jan 08, 2008
> No, that doesn't make sense to me. The idea of ropes is to make
> concatenation extremely fast in most cases (creating one concat one is
> much cheaper than copying two strings). In most cases the rope is not
> really perfectly balanced but "balanced enough" so that you don't really
> care. Only when the imbalance becomes too great you do a full rebalance.

Well you would basically rotate the tree to the right or left depending on if you're
appending a concatenation node to a literal or repetition node or vice versa. I'm not
quite sure on how to explain this. If you're concatenating a concatenation node to a
literal or repetition node and the concatenation node is balanced(which it should be
because the rope_concat function would have balanced it), then simply rotating the
tree to the right would balance it out. If you're doing it vice versa(e.g. appending
a literal node or repetition node to a balanced concatenation node), then rotating it
to the left would balance it out. If you're appending a literal node or repetition
node to another literal or repetition node, then you don't need any tree rotations.
If you're concatenating two concatenation nodes with the same depth, then the old
method works. The hardest part is concatenating two concatenation nodes with
different depths, in this case, I think you can do the appropriate amount of
rotations depending on the difference in depths between the two nodes. Rotating a
ropes is an O(1) operation and is pretty cheep. Also, this way is faster than my
balancing implementation which now that you mention it, doesn't actually use the
algorithm in the paper or the PyPy one, which is why I would like to change it. In
the way I described above, the cost of rebalancing is spread out so the rope
implementation doesn't seem to all of a sudden slow down because it is balancing a
rope. I don't think the cost of a simple rotation ( in most cases ) would
substantially slow down the code. I think on most modern computers, the effects would
hardly be noticeable(unless you're counting nanoseconds). 

> No need to be sorry. I am really nitpicking. My comments about the formatting are 
> just friendly nudges to make you learn that it is a good habit to keep the code 
> clean while you write it. Just by curiosity, what editor do you use for writing 
> your code?

I use emacs in its C mode with automatic indentation which is why the indents don't
come out right. I only just switched to emacs about two to three months ago. Is there
a way for me to get emacs to indent it according to PEP 7. Even better, is there an
emacs minor mode that will do it for me?

Please respond with your thoughts, especially on my idea on rebalancing. Do you think
it's overkill or do you think the performance losses are negligible? I think the only
real noticeable performance loss would be on nodes with the very different depths.
Anyway, tell me what you think.
Comment 119 by avassalotti, Jan 08, 2008
> I use emacs in its C mode with automatic indentation which is why the indents don't
come out right.

Euh, I use Emacs too and I don't have any problem with the automatic indentation. 
That is my mode I use for Python C code:

(defmacro def-styled-c-mode (name style &rest body)
  "Define styled C modes."
  `(defun ,name ()
     (interactive)
     (c-mode)
     (c-set-style ,style)
     ,@body))

(def-styled-c-mode python-c-mode "python"
  (setq indent-tabs-mode t
        tab-width 8
        c-basic-offset 8))

(def-styled-c-mode py3k-c-mode "python"
  (setq indent-tabs-mode nil
        tab-width 4
        c-basic-offset 4))


Comment 120 by iammisc, Jan 08, 2008
Thanks avassalotti, that works much better.
Comment 121 by avassalotti, Jan 08, 2008
Regarding the rebalancing, I am not sure yet what you're suggesting is appropriate 
for ropes. I will have to think about this a bit more.
Comment 122 by avassalotti, Jan 09, 2008
> Well you would basically rotate the tree to the right or left
> depending on if you're appending a concatenation node to a literal
> or repetition node or vice versa.

You can't append a concatenation node to a literal. Although, you can
*create* a new concatenation node with a literal and another
concatenation node.

> I'm not quite sure on how to explain this. If you're concatenating a
> concatenation node to a literal or repetition node and the
> concatenation node is balanced(which it should be because the
> rope_concat function would have balanced it), then simply rotating
> the tree to the right would balance it out. If you're doing it vice

"Rotating the tree to the right" doesn't anything to me, unless you
specify the pivot. (I am assuming that you would choose the previous
concatenation root as the pivot). Anyway, this idea won't work and here
is simple counter-example (with pretty pictures):

  http://peadrop.com/rope/balancing.html

> cheep. Also, this way is faster than my balancing implementation
> which now that you mention it, doesn't actually use the algorithm in
> the paper or the PyPy one, which is why I would like to change
> it. 

Why? The algorithm is the paper is elegant, fast, and simple to
implement, no? Do you have difficulties understanding it? If so, I
can help you; just ask.

Comment 123 by iammisc, Jan 09, 2008
I see now cfbolz, thanks for the page. I wasn't visualizing it right. I'll implement
the algorithm in the paper then.
Comment 124 by iammisc, Jan 09, 2008
Last question, how does the pypy implementation handle keeping the depth below max
depth? AFAICT, it just rebalances the rope.
Comment 125 by avassalotti, Jan 09, 2008
Well, a rope of depth 32 can have 2^32 leafs. Combine this with the minimum leaf
length of 128, this give you more space than you will ever need. And simply
rebalancing the rope will reduce significantly the depth of the rope. So, I don't
think you have to worry about this. 
Comment 126 by iammisc, Jan 09, 2008
Okay. The new implementation works well. However, it fails completely on large
ropes(I mean really large with depth 32). Also, literal merging also fails. I'm going
to commit the code to the repo anyway. I would suggest that you undefine
LITERAL_MERGING and try the rebalancing code. Please tell me what you think. I've had
very little time to do anything today because I had to do school stuff too.
Comment 127 by iammisc, Jan 10, 2008
I have a question. Should rebalancing be done in place? E.g. should calling the
balance() method on a rope return a new rope that is balanced or should it balance
the rope in place? Right now, the rope_balance method returns another rope but it
could have completely destroyed the nodes of the old rope. It doesn't matter in a
non-debug build where the only place it can be called is the rope_concat function and
internal functions that do it "right." However, in debug mode, a user can just call
the balance() method and expect to get a new rope. However, the balance method
destroys the nodes of the old rope so when you try to repr() the old rope, it will
fail. The way to fix this would be to assign the result of the balance() method back
to the old variable which is the behavior that the rope_concat functions emulate. For
example, say r is a rope. To balance it one should do:

  r=r.balance()

because 

 r.balance()

will make repr(r) fail.

I just wanted to know what you think about this. Should I leave it like this, because
balance() is only available in DEBUG mode and because anyone who is doing debugging
should already know this or should I make rope_balance balance the rope in place?
Comment 128 by cfbolz, Jan 11, 2008
r.balance() must not change r itself in any way! Consider the following situation:

r1 = <some rope>
r2 = <some other rope>
x = r1 + r2

r1.balance() (whether called explicitly or by some other operation does not matter).

If r1.balance() broke r1, then x would be broken as well. If r1.balance() would
happen inplace then this might destroy the balancedness of x. It is really best to
view ropes as something immutable.
Comment 129 by iammisc, Jan 11, 2008
ok cfbolz.
Comment 130 by iammisc, Jan 11, 2008
Ok, balancing is now much better. I've committed the new version to the svn
repository. Please take a look.
Comment 131 by avassalotti, Jan 11, 2008
Great to hear! I will review the new balancing code as soon as I get the time (i.e.,
probably tomorrow morning).
Comment 132 by gvanrossum, Jan 12, 2008
I've been asked to relay my experience with "ropes" in ABC. Well, ABC didn't actually
use ropes (I think) -- it used a balanced (B?) tree structure.  You can still find
the source online: http://homepages.cwi.nl/~steven/abc/implementations.html; the file
to start is probably abc/btr/i1tex.c.

In my experience there were two major issues with the implementation in ABC: due to
the tree balancing algorithm it was very complex, which made it hard to get the code
(and the code using it) to be absolutely correct; and, while it was carefully
designed to have asymptotically optimal performace for all operations on large
strings, the overhead for small strings (which abound in typical programs) was pretty
bad.

It was also expensive to get a string out of it in a form that a typical C system or
library call could use, although that wasn't much of an issue for ABC (which doesn't
have a lot of system interface code, unlike Python).
Comment 133 by avassalotti, Jan 12, 2008
Thanks again Guido for sharing your thoughts.

Travis, if you don't already know, ABC is the predecessor of Python.
(http://en.wikipedia.org/wiki/ABC_%28programming_language%29). Just like Cedar, the
programming environment described in Bohem's paper, ABC used heavily a rope-like
structure (a B-tree actually) for its strings.

Of course, ABC's implementation is quite different for yours, and probably won't be
much useful to you. Anyway, I thought you would find this little piece of history
interesting.


Comment 134 by avassalotti, Jan 12, 2008
The balancing operation is still quite leaky (both in memory and in references):

>>> from ropes import Rope
[20516 refs]
>>> a = Rope("a")
[20518 refs]
>>> b = Rope("b")
[20521 refs]
>>> for _ in range(100000):
...   a += b
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
KeyboardInterrupt
[58449746 refs]
>>> len(a)
61324
[memory usage: 980.3MB]


I think you should use an array of MAX_DEPTH elements for storing the forest, instead
of a list of pointers, during the balancing operation. This way you wouldn't need to
manage memory manually, thus reducing the risk of leaks. Also, why 
rope_concat_unchecked is needed at all? The forest elements during rebalancing
operation shouldn't exceed MAX_DEPTH, no?     
Comment 135 by avassalotti, Jan 12, 2008
Oh, I don't know if this will be any useful to you, but I wrote, earlier this week, a
simple implementation of rope in Python. I used it to improve my understanding the
rebalancing operation. Feel free to ignore if you already understand the balancing
algorithm, described in the paper, well.
rope.py
6.5 KB Download
Comment 136 by iammisc, Jan 12, 2008
> Travis, if you don't already know, ABC is the predecessor of Python.
> (http://en.wikipedia.org/wiki/ABC_%28programming_language%29). Just like Cedar, the
> programming environment described in Bohem's paper, ABC used heavily a rope-like
> structure (a B-tree actually) for its strings.

Yes. I looked up ABC on wikipedia already.

> The balancing operation is still quite leaky (both in memory and in references):

I'll look into it. I have an idea on how to reduce the memory usage(esp. in regards
to using a malloc()'ed array). See the end of my post. This could also be caused by
literal merging creating new literal nodes with the full length. Because new nodes
have been created, their are now more literals than just the original one literal.

> I think you should use an array of MAX_DEPTH elements for storing the forest, instead
> of a list of pointers, during the balancing operation.

That is what I was originally thinking but the balancing operation doesn't really
guarantee that it will make the rope's depth less than MAX_DEPTH. This is once again,
just something for handling worst case scenarios.

>  Also, why rope_concat_unchecked is needed at all? The forest elements during 
> rebalancing operation shouldn't exceed MAX_DEPTH, no?

As said before, the algorithm in the paper doesn't really make sure the rope depth
doesn't exceed MAX_DEPTH so it can cause rope_balance to be recursively called. This
is really something for a worst case scenario.

> Oh, I don't know if this will be any useful to you, but I wrote, earlier this week, a
> simple implementation of rope in Python. I used it to improve my understanding the
> rebalancing operation. Feel free to ignore if you already understand the balancing
> algorithm, described in the paper, well.

I'll definitely look into it. I think I understand the algorithm well, though.

Right now, I'm using an array of rope nodes because that was easier. However, I'm
thinking of replacing it with a singly-linked stack type linked list which would
lessen the memory needed. What do you think of this? This might take longer but it
wouldn't consume 980.3 MB(well hopefully).

I'm taking part in the open house at my school tomorrow though so I won't be able to
do much until maybe Monday. Most probably, I'll have some new version by Tuesday.
Comment 137 by avassalotti, Jan 12, 2008
> That is what I was originally thinking but the balancing operation
> doesn't really guarantee that it will make the rope's depth less
> than MAX_DEPTH. This is once again, just something for handling
> worst case scenarios.

See comment just below.

> As said before, the algorithm in the paper doesn't really make sure
> the rope depth doesn't exceed MAX_DEPTH so it can cause rope_balance
> to be recursively called. This is really something for a worst case
> scenario.

It does actually. See last paragraph of the fifth page of the paper.

And if you want to be sure that your code can handle (very) unlikely
worst case, where the tree has more than 2^MAX_DEPTH nodes and all
node are filled, then just add a check in _find_fib_slot that ensure
that the value returned is never greater or equal to MAX_DEPTH.

> Right now, I'm using an array of rope nodes because that was
> easier. However, I'm thinking of replacing it with a singly-linked
> stack type linked list which would lessen the memory needed. What do
> you think of this? This might take longer but it wouldn't consume
> 980.3 MB(well hopefully).

I am not sure if I understand what you are saying. Could you explain
better how your idea would reduce memory usage?

> I'm taking part in the open house at my school tomorrow though so I
> won't be able to do much until maybe Monday. Most probably, I'll have
> some new version by Tuesday.

Fine. Just to be sure I don't drag you along with new requirements, I
putting a "pencil-down" deadline on Saturday, January 19th, 2008 at
3:00UTC. I will take the code written at this date and do a final
evaluation. You will be welcome to continue working on your
implementation (and I will gladly continue to help you). However,
further work won't be considered part of the GHOP contest.
Labels: -Due-20080108.0420 -expired Due-20080119.0300
Comment 138 by iammisc, Jan 13, 2008
> It does actually. See last paragraph of the fifth page of the paper.

I saw this paragraph but I am unsure how it makes sure the resulting rope has a depth
less than the maximum depth. AFAICT, this paragraph only helps you determine the
depth of the final rope but not how to make the depth less. Are you saying to fail on
ropes larger than depth 32?

> And if you want to be sure that your code can handle (very) unlikely
> worst case, where the tree has more than 2^MAX_DEPTH nodes and all
> node are filled, then just add a check in _find_fib_slot that ensure
> that the value returned is never greater or equal to MAX_DEPTH.

Once again, are you saying to fail on ropes greater than MAX_DEPTH?

> I am not sure if I understand what you are saying. Could you explain
> better how your idea would reduce memory usage?

Yes. Right now, I malloc() a whole array which I fill with each node of the rope in
left to right order. However, this eats up a lot of memory. In some cases, the nodes
of the rope are the exact same. For example, if r is a rope and you keep on doing
r=r+r, then all the nodes are going to be the same. It makes little sense to make a
whole array just for putting the same exact nodes into. I was going to instead make a
singly linked list which would function as a stack. Basically, instead of having a
function _rope_get_iter_list, I would traverse the tree in the while loop and only
get the next node when I needed it instead of getting all of them before and putting
them into a massive array.

One more thing, I think literal merging takes up more memory than just not having
literal merging. This is because in some cases, the nodes being merged are the same.
What do you think? Should I just scrap literal merging altogether?

Comment 139 by cfbolz, Jan 13, 2008
Why does the array take so much space? In theory the array should never need to be
longer than 44 items long because the 44th fibonacci number is greater than
sys.maxint, the length of the longest string. Literal merging is definitely worth it,
you need to keep in mind that in practice strings don't get constructed by saying "a"
* 1000000 or by repeatedly concatenating the same string.
Comment 140 by avassalotti, Jan 13, 2008
Comment 138 by iammisc, Today (5 hours ago)
> Once again, are you saying to fail on ropes greater than MAX_DEPTH?

Yes. In fact, it should fail much earlier. The length of strings, in
Python, can't be greater than PY_SSIZE_T_MAX (2^32 / 2 - 1, on the typical
32-bit machine). Therefore, the length of a rope should not exceed this
value.

> Yes. Right now, I malloc() a whole array which I fill with each node
> of the rope in left to right order. However, this eats up a lot of
> memory.

Why do you allocate the array dynamically -- i.e. with malloc()?
Wouldn't it be easier (and faster) to use a static array?

> In some cases, the nodes of the rope are the exact same. For
> example, if r is a rope and you keep on doing r=r+r, then all the
> nodes are going to be the same.

"r=r+r" is unlikely in a real program. 

> It makes little sense to make a whole array just for putting the
> same exact nodes into. I was going to instead make a singly linked
> list which would function as a stack. Basically, instead of having a
> function _rope_get_iter_list, I would traverse the tree in the while
> loop and only get the next node when I needed it instead of getting
> all of them before and putting them into a massive array.

Hm, a massive array? Why do you store the leaves, at all, in an array?
You just need to do a left to right traversal of the tree and build up
the balanced tree incrementally.

> One more thing, I think literal merging takes up more memory than
> just not having literal merging. This is because in some cases, the
> nodes being merged are the same.

It should be quite the opposite. Literal merging should remove nodes,
thus reduce the memory usage (only slightly though). 

> What do you think? Should I just scrap literal merging altogether?

No. I think you should improve your code instead. When done right,
literal merging gives a nice speed boost too. Try it yourself. Run the
short example, in my simple rope implementation, with literal merging:

  % time python rope.py
  final length: 20000
  final depth: 20
  python rope.py  1.80s user 0.04s system 95% cpu 1.937 total

Then, change the min_leaf_length parameter of rebalance() to -1 to
disable literal merging and try to run the example again:

  alex:~% time python rope.py
  final length: 20000
  final depth: 15
  python rope.py  1279.52s user 19.93s system 91% cpu 23:46.99 total

It is a bit faster. :-)

Comment 141 by iammisc, Jan 13, 2008
> Hm, a massive array? Why do you store the leaves, at all, in an array?
> You just need to do a left to right traversal of the tree and build up
> the balanced tree incrementally.

Which is what my new idea would do.

> It should be quite the opposite. Literal merging should remove nodes,
> thus reduce the memory usage (only slightly though). 

In the ways you've tested the implementation it would take more wouldn't it?

For example you ran

>>> from ropes import Rope
[20516 refs]
>>> a = Rope("a")
[20518 refs]
>>> b = Rope("b")
[20521 refs]
>>> for _ in range(100000):
...   a += b

But in the current implementation, new nodes would be created for all the b nodes. So
I would have new nodes with 128 'b's inside even though normally you would only need
b. This is what I mean when I say literal merging wastes space.

> Yes. In fact, it should fail much earlier. The length of strings, in
> Python, can't be greater than PY_SSIZE_T_MAX (2^32 / 2 - 1, on the typical
> 32-bit machine). Therefore, the length of a rope should not exceed this
> value.

So should I put an if statement in rope_concat that will throw an exception?

> "r=r+r" is unlikely in a real program. 

Other things such as the code you ran to test the implementation would suffer from
the same problems, that I mentioned above.

I change MIN_LITERAL_LENGTH to a higher number and adopted a method of literal
merging similar to your implementation. The longer literals really made a difference.

Also, check the new version, it runs your test code fine and without using much memory.
Comment 142 by avassalotti, Jan 13, 2008
> > Hm, a massive array? Why do you store the leaves, at all, in an array?
> > You just need to do a left to right traversal of the tree and build up
> > the balanced tree incrementally.
> 
> Which is what my new idea would do.

But, you don't need a linked list for that, no? Just need a procedure similar to
rope_str, which, instead of copying characters, insert the rope leaves into the forest. 

> But in the current implementation, new nodes would be created for all the b nodes. So
> I would have new nodes with 128 'b's inside even though normally you would only need
> b. This is what I mean when I say literal merging wastes space.

Actually, no. You forgot that there is two components to the size of a rope, the size
of the literal and the node overhead (which is 40 bytes on a 32-bit machine; it's
greater on a 64-bit machine). Therefore, without literal merging the total size of
the rope, namely a, resulting from:

  a = Rope('a')
  b = Rope('b')
  for _ in range(100000):
     a = a + b

is equal to:

  = <literal size> + <node overhead>
  = (2 * 1 byte) + (100000 nodes * 40 bytes per node)
  = 4000002 bytes

With literal merging the resulting rope is equal to:

  = (128 * (100000 // 128) + 100000 % 128) + ((100000 // 128 + 1) * 40)
  = 100000 bytes + 31280 bytes
  = 131280 bytes

> So should I put an if statement in rope_concat that will throw an exception?

I think rope_from_type would be a better place to put a such check (since rope_repeat
need to be checked too).



Comment 143 by avassalotti, Jan 13, 2008
Here is a simplistic prototype of how I think the balancing should be done. It surely
has some bugs (I have neither tried, nor proved that it works). However, I think it
will give you some guidelines on how to improve the balancing code of your
implementation. 

typedef RopeObject *Forest[MAX_DEPTH]; 

static void
insert_in_forest(Forest forest, RopeObject *leaf)
{
	int i;
	RopeObject *x;

	i = _find_fib_slot(leaf->length);
	assert(i < MAX_DEPTH);
	for (;;) {
		if (forest[i] == 0) {
			forest[i] = leaf;
			break;
		}
		else {
			leaf = rope_concat(forest[i], leaf);
			forest[i] = 0;
			i = _find_fib_slot(leaf->length);
		}
	}
}

static void
_rope_balance(RopeObject *self, Forest forest)
{
	if (self->depth == 0) {
		insert_in_forest(forest, self);
	}
	else { /* Concatenation node */
		_rope_balance(self->v.concat.left, forest); 
		_rope_balance(self->v.concat.right, forest);
	}
}

static RopeObject *
rope_balance(RopeObject *self)
{
	int i;
	Forest forest;
	RopeObject *rope = NULL;
	
	for (i = 0; i < MAX_DEPTH; i++) {
		forest[i] = 0;
	}
	_rope_balance(self, forest);
	for (i = 0; i < MAX_DEPTH; i++) {
		if (rope == NULL)
			rope = forest[i];
		else
			rope = rope_concat(rope, forest[i]);
	}
	return rope;
}
Comment 144 by cfbolz, Jan 14, 2008
Hi Alexandre,

I think there are some problems with that code:

 - I think the static array should be a bit bigger, namely _find_fib_index(MAXINT)
which is 44 on 32 bit and 90 on 64 bit machines.

 - the insert_in_forest is wrong, because it needs to consider the possibility that
there are nodes somewhere in forest[:i], which need to be swept up

 - at the end of rope_balance there can be nodes in all of forest, which need to be
concatenated to get the result

Comment 145 by avassalotti, Jan 14, 2008
> I think there are some problems with that code:

Ah, I knew it. :-)

> - I think the static array should be a bit bigger, namely _find_fib_index(MAXINT)
> which is 44 on 32 bit and 90 on 64 bit machines.

Wouldn't it be simpler to cap the return value of _find_fib_index so that it can't 
be greater or equal to MAX_DEPTH (and perhaps increase the value of MAX_DEPTH)? I 
don't think strings greater than, let's say, 100MB are very common.

> - the insert_in_forest is wrong, because it needs to consider the possibility that
> there are nodes somewhere in forest[:i], which need to be swept up

Oh, that's true! Initially, I thought that wasn't nescessary, but now I see that it 
is (since the string could be reconstructed in disorder if that is not done). Thanks 
for pointing this out.

> - at the end of rope_balance there can be nodes in all of forest, which need to be
> concatenated to get the result

Hm. I am not sure what you mean there. 



Comment 146 by iammisc, Jan 14, 2008
I did not use Alexandre's code though so you shouldn't see those bugs in my
implementation. Anyway, I've basically made my implementation recursive like you
said. I've also made work_list a static array of MAX_DEPTH nodes. Something you may
want to know is that the state of the balancing operation is kept on a new struct
RopeBalanceState which really made making the function recursive much easier.
Personally, I think Alexandre's idea of making the function recursive is much better.
I think it goes faster now(I haven't really timed it) and it works better. I've also
removed all the warnings(Finally). I've committed the new version so you might want
to take a look.

Tell me if you find any bugs.
Comment 147 by avassalotti, Jan 14, 2008
I see that you opted for an approach closer to the PyPy's implementation. That is
fine, although, I am not sure if it is the most elegant one in C.

Here's a few things that struck me as odd:

In rope_balance:

   state.a = state.b = INT_MAX;

Don't use INT_MAX for that, use PY_SSIZE_T_MAX instead. 

In _rope_balance:

    if(cur->type==LITERAL_NODE && cur->length < MIN_LITERAL_LENGTH &&
state->string_length < MIN_LITERAL_LENGTH) {
        if(!state->string) {
            state->string = PyMem_Malloc(cur->length);
            memcpy(state->string, cur->v.literal, cur->length);
            state->string_length = cur->length;
        }
        else {
            state->string = PyMem_Realloc(state->string, state->string_length +
cur->length);
            memcpy(state->string + state->string_length, cur->v.literal, cur->length);
            state->string_length += cur->length;
        }
        return;
    }

I think you use a static buffer, instead of using malloc/realloc. This will change
the semantic for the literal merging algorithm a bit. However, that will avoid many
calls to malloc/realloc, and (maybe) speed up the balancing code.

In rope_balance:

	_rope_balance(r, &state, 1);
	if(state.string) {
		cur = rope_from_type(LITERAL_NODE, state.string_length);
		cur->v.literal = state.string;

		_rope_balance(cur, &state, 0);
	}

Why do you call _rope_balance again?

Finally, you need to check for overflow more throughly:

>> from ropes import Rope
[20523 refs]
>>> r = Rope('1234567890')
[20525 refs]
>>> s = Rope('1234567890')
[20527 refs]
>>> for _ in range(200):
...   r += s
...   s += r
... 

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209956160 (LWP 8053)]
0xb7c82ca6 in rope_concat_unchecked (self=0x5704e7, other=0x8b00a9c) at src/ropes.c:358
358             Py_INCREF(self);


On the good side, memory usage usage is *much* better. Also the balancing operation
seems to be faster. Good work!
Comment 148 by iammisc, Jan 14, 2008
> I think you use a static buffer, instead of using malloc/realloc. This will change
> the semantic for the literal merging algorithm a bit. However, that will avoid many
> calls to malloc/realloc, and (maybe) speed up the balancing code.

It definitely would speed up the code but if all the nodes point to a static buffer
on the stack, their contents will be overwritten if you change one of them or if a
function decides to use the stack(which is every function).

A way to maybe speed up the code is to allocate the full 1024 bytes and then realloc
it if it isn't using the full buffer.

> In rope_balance:
> 
> 	_rope_balance(r, &state, 1);
> 	if(state.string) {
>		cur = rope_from_type(LITERAL_NODE, state.string_length);
>		cur->v.literal = state.string;
> 
>		_rope_balance(cur, &state, 0);
>	}
> 
> Why do you call _rope_balance again?

Because it doesn't work if you take it out? :P No really, it's because the merged
string might be incomplete so we have to add it to the list.

> Finally, you need to check for overflow more throughly:

Yeah I haven't done any checking in the concatenation functions.

> In rope_balance:
>
>    state.a = state.b = INT_MAX;
>
> Don't use INT_MAX for that, use PY_SSIZE_T_MAX instead. 

I'll fix it immediately.

> On the good side, memory usage usage is *much* better. Also the balancing operation
> seems to be faster. Good work!

Thanks.

Comment 149 by iammisc, Jan 15, 2008
The new version checks for overflow in rope_balance.
Comment 150 by iammisc, Jan 17, 2008
Avassalotti, have you checked the code yet?
Comment 151 by avassalotti, Jan 17, 2008
Not yet (I just arrived from short trip). But, I am looking at it now.
Comment 152 by avassalotti, Jan 17, 2008
> It definitely would speed up the code but if all the nodes point to a static buffer
> on the stack, their contents will be overwritten if you change one of them or if a
> function decides to use the stack(which is every function).

I think you misunderstood what I suggested, but that isn't important.

> A way to maybe speed up the code is to allocate the full 1024 bytes and then realloc
> it if it isn't using the full buffer

Yes, that would also work.

> The new version checks for overflow in rope_balance.

Looks fine (I haven't tested it, though). But, what about the overflow case I mentioned in comment 112?

Question: Why the REPEAT_NODE case in rope_slice is so complex? Couldn't you do something like the 
following (in pseudo-code)?

def slice(repeat(child, count), start, end):
    start = slice(child, start % len(child))
    mid = repeat(child, (end - start) // len(child))
    end = slice(child, end % len(child))
    return start + mid + end

One more thing, do you plan to use rope_get_iter_list_count and _rope_get_iter_list again?
Comment 153 by avassalotti, Jan 17, 2008
Whoops, there is a mistake in my last example. 

def slice(repeat(child, count), start, end):
    left = slice(child, start % len(child))
    mid = repeat(child, (end - start) // len(child))
    right = slice(child, end % len(child))
    return left + mid + right

Comment 154 by iammisc, Jan 18, 2008
> One more thing, do you plan to use rope_get_iter_list_count and _rope_get_iter_list
> again?

Yes RopeIter uses them.

> Question: Why the REPEAT_NODE case in rope_slice is so complex? Couldn't you do 
> something like the 
> following (in pseudo-code)?

That is basically what the current code does. Your pseudo code does not take into
account the fact that we might not need a mid node at all. 

I've committed a new version which handles overflows better and has some other minor
changes.
Comment 155 by avassalotti, Jan 19, 2008
> > One more thing, do you plan to use rope_get_iter_list_count and _rope_get_iter_list
> > again?
> 
> Yes RopeIter uses them.

Ah, I see. Could you combine the two procedures so that you only do a single traversal?  The prototype would become 
something like this:

int _rope_get_iter_list(RopeObject **node_list, Py_ssize_t *count, RopeObject *node)

> That is basically what the current code does. 

Well, except that you take a full page of not-so-easy-to-understand code to do it. So, could you try to refactor this part, 
please?

> Your pseudo code does not take into account the fact that we might not need a mid node at all. 

This kind of optimization should be done in rope_concat.

> I've committed a new version which handles overflows better and has some other minor
> changes.

Unfortunately, your code is relying on the signed overflow behavior, which is undefined in C. The problem is recent 
versions of GCC might optimize out your overflow checks. (See http://bugs.python.org/issue1621 for more detail). I 
suspect that this will be hard to fix everywhere though. So, you might want to leave as is for now.



Comment 156 by iammisc, Jan 19, 2008
> Ah, I see. Could you combine the two procedures so that you only do a single 
> traversal?  The prototype would become 
> something like this:

Ok.

> Well, except that you take a full page of not-so-easy-to-understand code to do it.  
> So, could you try to refactor this part, 
> please?

Alright I'll try to make it smaller. I don't understand your pseudo-code though, your
calls to rope_slice only have one argument, shouldn't there be both a start and end
argument?





Comment 157 by iammisc, Jan 19, 2008
Uh oh, One thing avassalotti about the _rope_get_iter_list* functions.

The reason they are split into two is because I need to allocate a buffer for the
node_list before. Do you want me to realloc this list?
Comment 158 by avassalotti, Jan 19, 2008
> Alright I'll try to make it smaller. I don't understand your pseudo-code though, your
> calls to rope_slice only have one argument, shouldn't there be both a start and end
> argument?

Oh, that is another bug. Here's the corrected version:

def slice(repeat(child, count), start, end):
    left = slice(child, 0, start % len(child))
    mid = repeat(child, (end - start) // len(child))
    right = slice(child, 0, end % len(child))
    return left + mid + right

> The reason they are split into two is because I need to allocate a buffer for the
> node_list before. Do you want me to realloc this list?

Hm. Why not? 
Comment 159 by iammisc, Jan 19, 2008
> Oh, that is another bug. Here's the corrected version:

Yeah but that still doesn't work because left might start on an offset other than 0.

> Hm. Why not? 

Why not what?

Oh I forgot to mention in the last comment, but I had already committed a new version
to the repo.
Comment 160 by avassalotti, Jan 20, 2008
> Yeah but that still doesn't work because left might start on an offset other than 0.

Ah, that was a typo. I meant "left = slice(child, start % len(child), len(child))".
However, the algorithm is still far from being correct. I posted it just to give you
an idea of how I would do it -- i.e., it wasn't intended to be implemented directly.

I found a few more bugs, while testing your code:

>>> Rope("") * 2
Traceback (most recent call last):
 ...
OverflowError: The string is too large!
>>> r = Rope("abcde") * int(2**31-1)
>>> r = Rope("abcdef") * int(2**31-1)
Traceback (most recent call last):
 ...
OverflowError: The rope is  too long!
>>> r = Rope("abcdefg") * int(2**31-1)
>>> r = Rope("")
>>> r[:]
Traceback (most recent call last):
 ...
ValueError: No sane value to slice!
>>> Rope('c') in Rope('abc')
True
Fatal Python error: ../Objects/listobject.c:532 object at 0xb7ce8a38 has negative ref
count -606348326

Anyway, here is my final evaluation for this task. You completed most
of the requirements for this task:

  concatenation    [done]
  indexing/slicing [done]
  repetition       [done]
  hashing          [done]
  iterator         [done]
  balancing        [done]
  comparison       [done]
  substring search [almost complete]
  search/replace   [not started]
  automated tests  [incomplete]

I can't say the substring search is complete due to the above bug. The
automated tests are incomplete. gcov reports:

  Lines executed:36.03% of 569
  Branches executed:28.23% of 581
  Taken at least once:16.01% of 581
  Calls executed:25.63% of 238

There is plenty of test cases, that were shown during the past
discussions, that could be included in the automated tests. Finally,
your code still don't conform to PEP-7 (I know that you think your C
style is the best, just as everyone else. <wink>)

Overall, you did some good work. I hope you enjoyed working on this
project (I did). Also, I remarked that your programming skills improved
along the way, which is great. Hopefully, I will see you again around Python core
development.

I am still available if you have any question -- just mail at
alexandre@peadrop.com.

        Good luck with your projects,
             Alexandre

Status: Completed
Comment 161 by iammisc, Jan 26, 2008
Sorry for the late reply.

Thanks for all the help you gave me throughout the whole project. I think I will
definitely continue this project. I really enjoyed this project.

> Hopefully, I will see you again around Python core
> development.

How would I get involved in this?
Comment 162 by avassalotti, Jan 26, 2008
> Thanks for all the help you gave me throughout the whole project. I think I will
> definitely continue this project. I really enjoyed this project.

Great to hear!

> How would I get involved in this?

First, you should read http://www.python.org/dev/intro/ if you want to learn about
Python development process. There is also a developer FAQ
http://www.python.org/dev/faq/, but that one is mostly a svn FAQ, at this point. 

Next, subscribe to python-dev (and python-3000 if you plan to help on Py3k) to get
the latest news about Python development.

Finally, start helping out. :-) Bug fixes and improvement to the unit tests are
always welcome and usually easy. The list of easy tasks, in the bug tracker, is a
good place to start too:

http://bugs.python.org/issue?@columns=title,id,activity,type,components,versions,severity,status&@sort=-activity&@group=priority&@filter=keywords,status&@pagesize=50&@startwith=0&keywords=6&status=1

At some point if you contribute on a regular basis, you should introduce yourself to
python-dev and ask for commit privileges.

Comment 163 by avassalotti, Jan 26, 2008
Oh, the long URL to the bug tracker got mangled, so here's a short version:

http://tinyurl.com/354fss
Sign in to add a comment

Hosted by Google Code