Export to GitHub

google-highly-open-participation-psf - issue #218

Implement a 'ropes' datatype for CPython in C.


Posted on Dec 7, 2007 by Helpful Giraffe

'ropes' represent strings using concatenation trees; see

http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

and

http://morepypy.blogspot.com/2007/11/ropes-branch-merged.html

At a minimum implement string concatenation, slicing, repetition, and search/replace. Build this either as a straight C/C++ extension type, or as a Python wrapper around a C module. Provide automated tests and a setup.py/distutils file.

For extra extra credit, implement a unicode version too (py 2.6; 'bytes' and 'str' for py 3k).

Completion:

Attach a .tar.gz or zip file to this issue and send an e-mail to the mailing list.

Task duration: please complete this task within 5 days (120 hours) of claiming it.

Comment #1

Posted on Dec 7, 2007 by Grumpy Cat

I claim this task

Comment #2

Posted on Dec 7, 2007 by Helpful Giraffe

(No comment was entered for this change.)

Comment #3

Posted on Dec 7, 2007 by Grumpy Cat

I have actually been working on this before it was added as a task, so here is the current progress I've made.

Attachments

Comment #4

Posted on Dec 7, 2007 by Helpful Dog

You might want to look at the Python implementation that PyPy has for comparison:

https://codespeak.net/svn/pypy/dist/pypy/rlib/rope.py

I advice you to ask for help early, since this task is really hard.

Comment #5

Posted on Dec 7, 2007 by Happy Bird

You are lucked and faster then me. Good luck:)

Comment #6

Posted on Dec 7, 2007 by Helpful Dog

fastnix: you might be interested in Tasks 248 or 239.

Comment #7

Posted on Dec 7, 2007 by Grumpy Cat

For the interested (apparently a few), I've uploaded the git repo I'm working from to http://dev.gentooexperimental.org/~masterdriverz/ropes.git

Comment #8

Posted on Dec 7, 2007 by Massive Ox

Hi,

I started to read the code you posted and you seems to be on the right path. I saw a few stylistic issues, but nothing important. I also see that your implementation is still missing the rebalancing code, which I believe is the hardest part to get right. Anyway, keep up the good work!

Comment #9

Posted on Dec 7, 2007 by Grumpy Cat

avassalotti: Thanks, I will definitely refer to the PyPy implementation for that part. Is there a coding style guide for the C elements of python?

Comment #10

Posted on Dec 7, 2007 by Helpful Dog

Hi,

I can really only encourage you to have good unittest coverage, otherwise it will be incredibly hard to cover all corner-cases. The PyPy code I linked to above contains a rather well-tuned hash function that is composable: e.g. you can compute it from the hash functions of the sub-ropes for concatenation-nodes.

Comment #11

Posted on Dec 7, 2007 by Happy Bear

The C coding style is documented in PEP 7. Note that old code uses tabs, while new code uses 4 spaces, like for PEP 8.

Comment #12

Posted on Dec 7, 2007 by Massive Ox

Is there a coding style guide for the C elements of python?

Indeed. See PEP 7 1.

But these aren't the type of stylistic issues I was referring to. Here some examples:

Use Py_ssize_t 3 for these: /* Length of entire rope / long len; / Cached hash of rope, -1 means uncalculated */ long hash;

This is unnecessary: if (kwds && PyDict_Size(kwds)) { PyErr_SetString(PyExc_TypeError, "rope takes no keyword arguments"); return -1; }

I think the "S" format spec (but I could be wrong) could be used here: if (!PyArg_ParseTuple(args, "O", &obj)) return -1; if (!PyString_Check(obj)) obj = PyObject_Str(obj); if (!obj) return -1; assert(PyString_Check(obj), -1);

That's a bit cryptic (just try to read it out loud), in my humble opinion: return !!strstr(PyString_AS_STRING(self->str), str);

Anyway, you shouldn't worry too much about these issues. These can be fixed easily after you completed the task.

Comment #13

Posted on Dec 7, 2007 by Massive Ox

Just out of curiosity, is the cyclic garbage-collector really necessary? I don't see how the strings in rope could form cyclic references.

Comment #14

Posted on Dec 7, 2007 by Massive Ox

Oops, I just realized that I said something stupid in comment 11. Don't use Py_ssize_t for the hash, keep it as long. While am at it, here is the standard function for hashing strings in Python:

static long string_hash(PyStringObject *a) { register Py_ssize_t len; register unsigned char *p; register long x;

if (a->ob_shash != -1)
    return a->ob_shash;
len = Py_Size(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
    x = (1000003*x) ^ *p++;
x ^= Py_Size(a);
if (x == -1)
    x = -2;
a->ob_shash = x;
return x;

}

Comment #15

Posted on Dec 7, 2007 by Helpful Dog

It's an interesting design question whether ropes should have the same hash as Python strings. In PyPy we decided against this, to make the hash much easier computable. (see my last comment). The CPython implementation maybe should hash identical to strings for better interoperability of the two types.

Comment #16

Posted on Dec 7, 2007 by Grumpy Cat

avassalotti: Thanks for your comments, they're much appreciated :)

Use Py_ssize_t [3] for these:

Thanks, changed a couple of these (including start and end for substr, but not hash :)

This is unnecessary:

Well without it Rope(s, foo="bar") doesn't complain, even though foo has no effect, which personally I don't like (also it doesn't replicate str's behaviour).

I think the "S" format spec (but I could be wrong) could be used here:

Using "S" means Rope(["foo"]) won't work (whereas str(["foo"]) gives '["foo"]').

That's a bit cryptic (just try to read it out loud), in my humble opinion:

Well it converts the pointer returned by strstr to a boolean (I thought !! was a common idiom for this purpose?). I might add a comment explaining this, then again the code may vanish altogether as it doesn't work for finding a string that overlaps on a concat or repeat.

Just out of curiosity, is the cyclic garbage-collector really necessary? I don't see how the strings in rope could form cyclic references.

Remember it's an acyclic graph - personally I thought it was easier to add the traversal code than prove that there couldn't be a circular graph (and for that not to change as the code developed).

Oops, I just realized that I said something stupid in comment 11. Don't use Py_ssize_t for the hash, keep it as long.

No worries :)

While am at it, here is the standard function for hashing strings in Python:

Thanks, this is nicer than the one I had before. However, as I can't apply the initial and final XORs (they aren't composable, which is necessary as pointed out by cfbolz), so it's not as strong as I'd like. Ah well, I doubt it's too much of a performance hit for dicts.

Comment #17

Posted on Dec 7, 2007 by Massive Ox

I think the "S" format spec (but I could be wrong) could be used here:

Using "S" means Rope(["foo"]) won't work (whereas str(["foo"]) gives '["foo"]').

Ah, that's what I thought. I wasn't sure whether "S" called PyObject_Str implicitly. I verified and it turns out it doesn't. So, keep "O".

Well without it Rope(s, foo="bar") doesn't complain, even though foo has no effect, which personally I don't like (also it doesn't replicate str's behaviour).

What do you think about this, then?

diff --git a/rope.c b/rope.c index 7e0095f..5fa01a5 100644 --- a/rope.c +++ b/rope.c @@ -487,18 +487,22 @@ static int rope_init(rope *self, PyObject *args, PyObject *kwds) { PyObject *obj; + static char *kwlist[] = {"source", 0};

  • if (kwds && PyDict_Size(kwds)) {
  • PyErr_SetString(PyExc_TypeError, "rope takes no keyword arguments");
  • return -1;
  • }
  • if (!PyArg_ParseTuple(args, "O", &obj))
  • if (!PyArg_ParseTupleAndKeywords(args, kwds,
  • "|O", kwlist, &obj)) return -1;

  • if (!PyString_Check(obj))

  • if (obj == NULL) {
  • obj = PyString_FromString("");
  • }
  • else if (!PyString_Check(obj)) {
  • /* XXX The reference to the original object is lost here.
  • This will cause a leak. */ obj = PyObject_Str(obj);
  • if (!obj)
  • return -1;
  • if (!obj)
  • return -1;
  • } assert(PyString_Check(obj), -1);

    Py_INCREF(obj);
    

Comment #18

Posted on Dec 7, 2007 by Massive Ox

Oops. There's a small bug in the above diff.

@@ -486,19 +486,23 @@ static int rope_init(rope *self, PyObject *args, PyObject *kwds) { - PyObject *obj; + PyObject *obj = NULL;

Comment #19

Posted on Dec 8, 2007 by Massive Ox

Comment deleted

Comment #20

Posted on Dec 8, 2007 by Massive Ox
  • /* XXX The reference to the original object is lost here.
  • This will cause a leak. */

Forget what I said in this comment. It's just me being stupid again.

By the way, I think you should set a limit on the length a rope can have. That will prevent the interpreter to crash when computing the repr of a rope if it exceeds the maximum length a string can have. It should also prevent the length of the rope from overflowing. For example:

from rope import rope r = rope("hello") r = r*(2**32-1) len(r+r) -2147483648

You probably want to fix these bugs too:

r = rope('abc') r + '' Traceback (most recent call last): File "", line 1, in RuntimeError: Assertion "0" failed in rope.c, line 274 (_rope_str) (r+r) + '' Traceback (most recent call last): File "", line 1, in SystemError: ../Objects/stringobject.c:4106: bad argument to internal function r[:] MemoryError r += 'hello' r [1] 10455 segmentation fault (core dumped) python r = rope('hello') r[6:-1] [1] 10507 segmentation fault (core dumped) python

Finally, here is some potential (micro) optimizations:

r = rope('') r * 10**9 # just return an empty rope r + rope('hello') # just return rope('hello')

However, you shouldn't worry to much about these things for now.

Comment #21

Posted on Dec 8, 2007 by Grumpy Cat

avassalotti: Thanks, these should all be fixed now, except for the micro-optimizations (hopefully the regression tests work too)

Comment #22

Posted on Dec 8, 2007 by Massive Ox

You should raise a TypeError, instead of converting implicitly the objects:

>>> Rope('hello') + 'hello'
Rope("hellohello")
>>> Rope('hello') + 2
Rope("hello2")

Comment #23

Posted on Dec 8, 2007 by Massive Ox

And oh, some nitpicking, please keep your code within the ANSI C89 standard:

static inline rope * gen_rope(void) {

The 'inline' keyword is from C99. (The GCC tries to inline functions declared 'static' anyway).

Comment #24

Posted on Dec 8, 2007 by Massive Ox

I was wandering how your code would look in ANSI C89 -- i.e., without the anonymous struct/union. So, what is my conclusion? It is gross. :-) Although if you want to get your rope implementation into the Python standard library, it would probably be a good idea to accept slightly more verbose ANSI C89 version. Anyway, it is your project; you choose. :)

Attachments

Comment #25

Posted on Dec 9, 2007 by Grumpy Cat

avassalotti: Yeah, I really like having anonymous structs/unions :( I committed your patch with a few slight modifications (eg, renaming item to child for v.repeat, since it can have the same name as v.substr.child as they're now in different scopes), and rope_concat will now only implicitly convert strings.

Comment #26

Posted on Dec 9, 2007 by Grumpy Cat

Also, I added those micro-optimizations you mentioned earlier in the thread.

Comment #27

Posted on Dec 9, 2007 by Grumpy Cat

Ok, some questions as to further things to add:

How much caching should be done (if any more at all)? Ropes with len < 5?

Should I add a load more types for the string functions (upper, lower, title, join etc come to mind), or should I return a rope with these things applied?

Comment #28

Posted on Dec 9, 2007 by Helpful Dog

How much caching should be done (if any more at all)? Ropes with len < 5?

I would definitely use a prebuilt array of ropes of len == 1. Not sure caching is worth it, I wouldn't bother at the moment, it's more an optimization to be done later.

Should I add a load more types for the string functions (upper, lower, title, join etc come to mind), or should I return a rope with these things applied?

I wouldn't do upper, lower, title, I can't imagine them being particularly performance relevant. Maybe join, but even that could be done later. In fact PyPy doesn't even have a multiplication node (can be done easily with concat nodes) and a slicing node (can be done by taking as much as possible from the existing tree and adding some few nodes on the fringe). New node types have a cost, which is the need to deal with them specifically in the rebalancing logic and in some other places.

I would really start working on the rebalancing now, it is the most complex piece of code you will have to write and without it the whole rest is not really worth that much.

Comment #29

Posted on Dec 9, 2007 by Massive Ox

Yeah, I really like having anonymous structs/unions :( I committed your patch with a few slight modifications (eg, renaming item to child for v.repeat, since it can have the same name as v.substr.child as they're now in different scopes),

Ah, I see you changed 'char braid' back to 'int braid'. I realized after posting the patch that the 'char' left a 3 bytes hole in the structure. So, 'int' was a better choice. :-)

and rope_concat will now only implicitly convert strings.

IMHO, you souldn't do any implicit conversion. As some people would say, it is "unpythonic" (Python favors strong-typing).

How much caching should be done (if any more at all)? Ropes with len < 5?

Should I add a load more types for the string functions (upper, lower, title, join etc come to mind), or should I return a rope with these things applied?

Carl is right. I should start working on the rebalancing as soon as possible. Adding more caching won't speed up things much; it may in fact slow down your implementation (Personally, I find the caching of repr already too much). And adding more rope types will just make the rebalancing more complex. The real bottleneck is the height of concatenation tree. For example

r = rope('a') r * 100000 Rope("aaaaaaaa...") for _ in range(100000): ... r += rope('a') ... r [Memory-usage spikes up and the kernel kills the process]

So, you should gears up your optimizations toward mitigating this -- e.g., rebalancing the tree, fusion of small nodes, etc.

Comment #30

Posted on Dec 12, 2007 by Helpful Dog

how is the progress here? There were no changes in the git repo for a while. Do you need help or advice?

Comment #31

Posted on Dec 12, 2007 by Grumpy Cat

Sorry, I've been busy at school recently (exams and the like). Some str functions and some simple rebalancing pushed. I'm sure there must be more to rebalancing, but I can't see what I'm missing. Should I recurse down the two sides of the rope being rebalanced?

Comment #32

Posted on Dec 12, 2007 by Helpful Dog

there's quite a bit more to rebalancing, have you looked at the logic in the pypy code? What you implemented so far works, but is rather simplistic, since you are always flattening the string completely. The output of the rebalancing should be a tree that represents the whole string but in which for every occuring concatenation nodes, the left and the right tree have approximately the same depth.

Comment #33

Posted on Dec 12, 2007 by Grumpy Cat

Is that true of every node in the tree?

Comment #34

Posted on Dec 13, 2007 by Helpful Dog

yes. Of course you cannot make it completely true in some cases (e.g. if the tree doesn't have a power of 2 of leaf nodes). you also might take a look at the paper linked above:

http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

It describes the rebalancing algorithm rather well. The PyPy code implements exactly this algorithm, so with both together you should manage to implement it.

Comment #35

Posted on Dec 15, 2007 by Happy Bear

I think that, due to the hard nature of this task, it's fair to give a 3-day extension. That will make the task expire today, and then you'll have another 3 days grace period to complete it.

Comment #36

Posted on Dec 15, 2007 by Helpful Dog

Hi Georg!

Thanks for writing this, I was about to ask :-)

Comment #37

Posted on Dec 18, 2007 by Grumpy Cat

Sorry for my inactivity, I've had exams and a conspicuous lack of internet connectivity. George, thanks for the extension, I defininitely underestimated the complexity of the rebalancing operation. I think I have it though, I'll upload my attempt tonight.

Comment #38

Posted on Dec 22, 2007 by Helpful Dog

Hi!

What's going on there? Are you still progressing? The deadline of this task is completely over, so if I don't hear from you immediately I will have to reopen the task.

Comment #39

Posted on Dec 23, 2007 by Grumpy Cat

Sorry, I completely dropped the ball on this. I'm still lacking internet access from my home machine, but I'll try and upload what I've done tomorrow.

Comment #40

Posted on Dec 23, 2007 by Happy Horse

Comment deleted

Comment #41

Posted on Dec 28, 2007 by Helpful Dog

I am putting this task as open again.

Comment #42

Posted on Dec 29, 2007 by Happy Bird

I claim this task.

Comment #43

Posted on Dec 29, 2007 by Helpful Kangaroo

This task is due January 03, 2008 04:20:00 UTC

Comment #44

Posted on Dec 29, 2007 by Helpful Kangaroo

(No comment was entered for this change.)

Comment #45

Posted on Dec 30, 2007 by Happy Bird

I just wanted to give a little update on my progress. So far, I've written the basic appending and repetition code. Also, I've written a rebalancer which rebalances the entire tree. Converting to a string and getting the length via len() also works. Tomorrow, I hope to get slices and indexing working. I've written some unit tests that test the rebalancer, the appending function, and the string conversion which use large computer generated "Lorum ipsem..." The tests all pass right now. I tried to avoid any memory leaks in the code but running it through valgrind pops up a number of memory leaks in the interpeter itself and it is a bit difficult to sort through the ones that my module is causing. Is there any better way to do this?

I'll try to get a module that fulfills all the criteria in about two days. From there I would like to add some more functionality so that it more closely resembles the python string type.

Talk to you later.

Comment #46

Posted on Dec 30, 2007 by Massive Ox

I tried to avoid any memory leaks in the code but running it through valgrind pops up a number of memory leaks in the interpeter itself and it is a bit difficult to sort through the ones that my module is causing. Is there any better way to do this?

See Misc/README.valgrind in the source distribution of Python. By the way, why do you worry about memory leaks? I don't think ropes needs a lot, if any, of manual memory management. Are you trying to track reference leaks, instead?

Comment #47

Posted on Dec 30, 2007 by Happy Bird

Yes. I'm not 100% sure if my code does reference counting properly. I don't want to have a rope kept in memory because i didn't do reference counting right and i definitely don't want to have it deleted from right under me.

Comment #48

Posted on Dec 30, 2007 by Massive Ox

Then, you probably want to get debug build of Python. Read the GettingAndCompilingPython wiki page first. Then, run configure with the --with-pydebug option. This will add some debugging code, reference tracker, into the interpreter.

When you will run your debug build of the interpreter, you will see the total reference count will after each expression you enter. if your code is leaking references, try to call (or recreate) an object twice. If on the second try, the reference count is incremented (or subtracted) then you probably want to check your code. This trick works because Python, when run interactively, always keep a reference to the result of the last expression entered, in the variable _. So, by calling an object twice, assuming the call doesn't have any side-effects, the total reference should remain equal.

However, as I said, this trick only works for expressions without side-effects. For example, appending a same object to a list will increment the count every time you do so. Also, you need to be aware of tricky cases -- such as the fact that Python doesn't deallocate integers < 10, interned strings, etc. So, it can become unwieldy quickly.

That is where you want to have a thorough set of unit tests, and run it with the Lib/test/regrtest.py script using the -R4:3 option. At this point, you want to work directly in Python's source directory -- that is copy your module in Modules/ and modify setup.py to make it compile your rope implementation (just like you were core developer adding a new module). Finally, run make, copy your unit test in Lib/test and run it with "./python -tt -E Lib/test/regrtest.py -R4:3 ". It would surely be possible to keep your module from the core distribution (by writing some script that imports test.regrtest and call its main function directly), but that would be slightly more cumbersome.

If your code is leaking, regrtest's output will look like this:

alex:py3k% ./python -tt -E Lib/test/regrtest.py -R4:3 test_cmd_line test_cmd_line beginning 7 repetitions 1234567 ....... test_cmd_line leaked [0, 51, 0] references, sum=51 1 test OK. [71090 refs]

Unfortunately, that only tells you whether your module is leaking or not. If it is, then you will have to find the leak yourself. This can be hard if you never checked your code for leaks before, since you don't have any reference point to find the change that introduced the leak (By the way, it is a good idea to use a source version control to keep the history of your changes -- preferably a distributed one, like mercurial or bazaar, so that you don't have to go into the troubles of configuring a central server).

Here is simple checklist to help you find leaks:

  • In your destructor, have you decremented the reference, with Py_DECREF, of every objects "owned"? (this apply only if you have PyObject pointers in your object's struct)
  • In methods, have you decremented the reference of objects you created? Even when there's an error?
  • If you return an object that cames from the method's argument, did you increment its reference?
  • Do you have any lost references -- i.e., you created an object assigned to a variable that previously hold an object that you also created (in other words, not from the method's argument)?

If you have any trouble finding a reference leak, please tell. Don't waste your time on such details. I better like to see the rope implementation completed with a few reference leaks, than an uncompleted one without leaks.

Comment #49

Posted on Dec 30, 2007 by Happy Bird

Thanks for the quick reply. I will be sure to get a debug build of python quickly(I use gentoo so this shouldn't be too hard.)

Thanks for the checklist. I believe my code meets all of those criteria but I'll still check it.

Comment #50

Posted on Dec 30, 2007 by Happy Bird

Can I use the ghop repository for version control? Do I already have write privileges? Where should I put it under?

Comment #51

Posted on Dec 30, 2007 by Helpful Kangaroo

Yes, you can use the ghop-python-2007 project svn repository. You are already a member of that project, so you should be able to create and update files there. Please create a directory for your project under /trunk/ghop-python-2007. Either give it a descriptive name, or just use the issue number. Inside that directory you can organize your files in whatever way makes the most sense to you.

If you have problems getting set up with svn, send email to the ghop-python list with error messages, and other details and we'll try to help you out.

Comment #52

Posted on Dec 31, 2007 by Happy Bird

Okay I've uploaded my code. It is now under the ropes folder. Please take a look and tell me what you think. It isn't complete yet and there are probably some cases where I used rather ugly code or less than optimal algorithms. As of right now, I've tested it on a 64-bit machine and it passes the unit tests I've given it. I know that the unit tests don't cover all the functionality right now because that functionality is incomplete, most notably hashing. I'm still testing the hashing code and as such, I've used really ugly code that prints the hash out when I call len() on the rope. This will change once I get hashing working. As of right now, hashing does not work because of the limitations of C's integer data types. All of them overflow. I've downloaded pypy and made the hashing code for ropes more verbose. The hashes are exactly the same until the C code's integer overflows and I'm left with a negative number. Are there any solutions to this?

I did some cleanup on the code itself today so there is not much improved functionality from yesterday other than the ability to get the character at a specific index.

Right now, I'm using a static array of character to represent a literal. This has the limitation that a literal rope node can only have a certain amount of characters(64 at the moment). I did this because it makes everything much easier to handle in terms of memory management. Is this ok or do you want me to dynamically allocate buffers?

Lastly, I will not be able to do much tomorrow because it's New Year's Eve and and we're having a party. On New Year's itself I also probably won't be able to do much. But I will be able to devote more time on the 2nd.

I'll be waiting your comments on the code.

Comment #53

Posted on Dec 31, 2007 by Massive Ox

Before I go through your code and review it, I would like to know how much you care about this task. I know that the task is hard, and therefore I am willing to close my eyes on a few issues. So, I can either review it very throughly and point out every single issues. However if your goal is just to complete the task, you may find this quite annoying. On the other hand, if your goal is to write a great code, then you will probably love this. So for this time, I let you choose which review method you like better.

Comment #54

Posted on Dec 31, 2007 by Happy Bird

No I would really like you to review it thoroughly. I'd rather write great code so point out every issue and I'll fix it.

Comment #55

Posted on Jan 2, 2008 by Massive Ox

First, the stylistic issues:

  • Please follow the coding conventions described in PEP-7 (http://www.python.org/dev/peps/pep-0007/). Don't clump everything together; structure your code in paragraph, just like you would do for an essay.
  • Use shorter name for your variables in your main struct.
  • Don't use the Py prefix for public function, this is reserved for the interpreter core objects and functions.

I think your main rope data-structure for rope objects needs to be worked out better. Boehm's paper (pages 7-8) highlights an elegant way to represent ropes using only concatenation and closure nodes. It is elegant since it allows you to define as many rope types you want, without increasing the complexity of the rebalancing operations, while keeping the main data-structure compact. I am including an implementation of this idea filled with example how to use it. You are free to use it, or not, for your implementation.

Your implementation seems to be fairly straightforward. However, I believe many parts could (and should) be simplified. For example, I think you are overly clever in rope_index(). (Your use of tail recursion is surely neat, but not necessary since rope depth should be bounded. You shouldn't use a fixed array for literals, because this is making your rope object unnecessarily large (in term of memory).

Quick question, what is this for?

void rope_destroy(PyRopeObject* r) { rope_traverse(r, _rope_dec_refcount); }

Finally, I have found these bugs in your implementation:

r + "world" Rope('hello world') "world" + r Traceback (most recent call last): File "", line 1, in TypeError: cannot concatenate 'str' and 'ropes.Rope' objects

"".join(r) 104 104000413 -1908071293 858739893 -1881877042 182784266 -403830251 -1457451538 -1708129860 -720966496 1731276356 before:1731276356 after:1731276356 1731276356 'hello world' len(r) 104 104000413 -1908071293 858739893 -1881877042 before:-1881877042 after:-1881877042 -1881877042 5

r[0:3] Traceback (most recent call last): File "", line 1, in TypeError: sequence index must be integer, not 'slice'

r = Rope('') r + r + r + r [1] 10197 segmentation fault (core dumped) python r = Rope('a') r = r + r + r str(r) [1] 10203 segmentation fault (core dumped) python

r = Rope("abc") r = r * 1 len(r) b: 0 masked_power: 1 97 97000389 -1156378894 before:-1156378894 after:-1156378894 [1] 10126 segmentation fault (core dumped) python

hash(r) -1210995232 hash(r*1) -1210995056

r*2 Rope('abcabc') s Rope('abcabc') r*2 == s False

r = Rope("abc") s = Rope("abcabc") r+r == s False

r = Rope("abc") s = Rope("abcabc") hash(r+r) -1211166160 hash(s) -1211166688

while 1: ... r = Rope("abc" * 2000) ... [memory leak]

hash(r*1) -1210225008 hash(r+r+r) -1210224832 hash(r*1) [1] 10164 segmentation fault (core dumped) python

r = Rope("") r[-1] before:0 after:0 0 '\x00'

r = Rope() r.init(r, "hello") r.init(r, "hello") [1] 10217 segmentation fault (core dumped) python

r = Rope('') r.init(r, "hello") [infinite loop]

r = Rope('') Rope.init(r, "hello") Rope.init(r, "hello") Rope.init(r, "hello") r Rope('hellohellohello') [weird; and nasty for subclasses]

And one more thing, happy new year!

Attachments

Comment #56

Posted on Jan 2, 2008 by Massive Ox

Oh, I forgot to mention something. Here's an invocation of GNU indent, that will format your code automatically (almost perfectly) to PEP-7 C coding convention. You will just need to specify your typedefs, like PyRopeObject, with the -T flag.

For Python 2.x, use this: indent -ut -ts8 -i8 -l79 -bap -nbad -ncdb -ncs -sc -br -nce -cdw -cli0 \ -nss -npcs -saf -sai -saw -nbc -di1 -brs -psl -lp -nlps -nbbo \ -T PyObject

For Python 3.x, use this: indent -i4 -l79 -bap -nbad -ncdb -ncs -sc -br -nce -cdw -cli0 \ -nss -npcs -saf -sai -saw -nbc -di1 -brs -psl -lp -nlps -nbbo \ -T PyObject

Comment #57

Posted on Jan 2, 2008 by Happy Bird

First of all, thanks for reviewing it so thoroughly, I'll be sure to look into all the things you mentioned. However, I think you must have misunderstood me avassalotti, you reported some things as bugs which I said(at least I thought I said) aren't complete such as slicing or hashing. The only things implemented are concatenation, and repetition.

I think your main rope data-structure for rope objects needs to be worked out better. Boehm's paper (pages 7-8) highlights an elegant way to represent ropes using only concatenation and closure nodes. It is elegant since it allows you to define as many rope types you want, without increasing the complexity of the rebalancing operations, while keeping the main data-structure compact. I am including an implementation of this idea filled with example how to use it. You are free to use it, or not, for your implementation.

Thanks. I'll look into it.

Quick question, what is this for?

void rope_destroy(PyRopeObject* r) { rope_traverse(r, _rope_dec_refcount); }

This is an old function that I used when I first started the implementation to decrement the reference count of all the nodes children. However, it's no longer needed and I just forgot to take it out.

len(r) [ disregard the output from len ] 5

I don't see how this is a bug. As far as I can tell r is equal to "hello" and that is length 5 right?

r = Rope("abc") r = r * 1 len(r) [disregard len's output ] [1] 10126 segmentation fault (core dumped) python

This does not happen on my computer. This could be because I might have fixed the bug after committing the copy you looked at in the subversion repo. However, I'll check it on another computer. Also, the computer I'm developing from is 64-bit. Is your's 32-bit? This could be a problem if there's a bug in the code that makes an assumption about bit size although I doubt it.

while 1: ... r = Rope("abc" * 2000) ... [memory leak]

I am unsure about what you mean by this or how I am supposed to fix it. How exactly is the memory being leaked?

r = Rope('') Rope.init(r, "hello") Rope.init(r, "hello") Rope.init(r, "hello") r Rope('hellohellohello') [weird; and nasty for subclasses]

This is because I basically made the init method append its arguments to itself. I'll be sure to fix this.

Those are the only bugs that I had something to say about. I'll look into the rest that relate to the things I've implemented. And I'll start finishing up support for slicing, comparisons, and search/replace.

And one more thing, happy new year

You too.

Attachments

Comment #58

Posted on Jan 2, 2008 by Massive Ox

Ah, you are right. I missed your comment about hashing. Sorry, about that.

I've used really ugly code that prints the hash out when I call len() on the rope. This will change once I get hashing working. As of right now, hashing does not work because of the limitations of C's integer data types. All of them overflow. I've downloaded pypy and made the hashing code for ropes more verbose. The hashes are exactly the same until the C code's integer overflows and I'm left with a negative number. Are there any solutions to this?

Overflow is expected for the hashing functions. You don't have to make your ropes hash equals to PyPy's ropes. The only requirements regarding the hashing function are: 1) It should generate evenly distributed hash keys on random input. 2) Equal strings should have equal hashes.

This does not happen on my computer.

You removed the debugging printf in rope_length(), right? If so, you also removed the the call to rope_hash() that caused the crash. The problem is rope_hash() dereferences a null pointer when given a MULTIPLY node.

long x = rope_hash(rope->n_node.concat.left) +
    rope_hash(rope->n_node.concat.right) * masked_power(1000003,
[right is null for multiply nodes] -----^

However, I am not sure why "hash(r * 1)" (sometimes) works, though.

I am unsure> >>> while 1: ... r = Rope("abc" * 2000) ... [memory leak]

I am unsure about what you mean by this or how I am supposed to fix it. How exactly is the memory being leaked?

Ropes are never deallocated (This is almost certainly due to a reference-leak). To make my example clearer:

while 1: r = Rope("abc") del r

Comment #59

Posted on Jan 2, 2008 by Happy Bird

Ropes are never deallocated (This is almost certainly due to a reference-leak). To make my example clearer

while 1: r = Rope("abc") del r

I don't understand? Ropes are never deallocated? Shouldn't they be deleted once they are no longer used(e.g. once their reference count becomes 0).

Maybe I'm not understanding you correctly, but what exactly should happen in the case of your while loop?

Comment #60

Posted on Jan 2, 2008 by Massive Ox

Indeed, a rope should be deallocated when its reference count becomes 0. The problem is, some yet unknown reason, the reference count never gets to 0.

In the while loop, the rope should get deallocated at each iteration. However, it isn't the case right now, thus "the memory leaks."

Comment #61

Posted on Jan 2, 2008 by Happy Bird

Ah, ok. The newer version should not have this problem. I have committed my changes to the repository that should fix this problem as well as most of the other problems you described. Please note however that there is still no hashing or comparison support.

Comment #62

Posted on Jan 2, 2008 by Happy Bird

I just committed a new version that fixed a nasty bug in the last version which caused the ropes module to completely fail its unit tests due to allocating literals dynamically.

Comment #63

Posted on Jan 3, 2008 by Happy Bird

The new version now has comparison support(between ropes) and iteration support. I'll be working on hashing and search/replace tomorrow.

Comment #64

Posted on Jan 3, 2008 by Massive Ox

I reviewed your code some more, I saw a few more things to fix:

  • Don't use random value for your unit tests (testSlicing). Use specific boundary values, prone to failures, instead.
  • Strip the Lorem ipsum. Instead, generate long strings with repetition -- e.g. data1 = "XXX" * 200; data2 = "YYY" * 200; data3 = ...
  • Don't reinvent the wheel. Use the shared mixins for testing strings for your tests. Just import test.string_tests, and use what you need.
  • Remove Rope.append. Ropes should be considered as immutable.
  • Remove the "debugging" getters from the public interface of Rope (e.g. Rope.type). However, you can keep them for now, if they help you.
  • Declare 'static' all your functions. Combine 'python.c' and 'rope.c' if necessary.
  • Remove any implicit type conversion -- e.g. Rope + str should raise a TypeError.
  • In rope_repr(), escape any control characters. For example, this:

    Rope('hel\n\tlo') Rope('hel lo') should be: Rope('hel\n\tlo') Rope('hel\n\tlo') You probably can reuse the implementation of, or use, PyString_Repr() for this.

  • Please put all your debugging functions (e.g. print_rope()) at the same place and label them as such. Also, protect your debugging printf calls with if-statements. For example, "if (DEBUG) { printf(...); }".
  • Use Python's own memory allocator (PyMem_Malloc/PyMem_Free), instead of malloc/free.
  • Remove rope_incref(). You don't need it, and it's causing (major) reference leaks. You just need to call Py_INCREF once on an existing node (not a brand new one) to a concatenation node.
  • Don't check the type of 'self', in your methods. This is done automatically be the interpreter. Remember you are implementing an extension type, not a whole new public API for Python.
  • Define a macro, Rope_Check, if you really need to do this (I don't think so): "self->ob_type != &ropes_type".
  • Fix the compiler warnings. I included a build log with the '-ansi -pedantic' flags turned on. You can ignore the warnings about 'long long' in longobject.h.
  • Mark fall-through case-clauses explicitly (e.g, ROPE_SLICE_NODE in rope_to_string()).

Now, I have some questions for you:

  • Shouldn't rope_move() deallocates its 'src'? Otherwise, why is it called rope_move, and not rope_copy? Also, why this function is needed at all?
  • Why rope_append() is so long and complex? Why do you walk down the whole concatenation tree (to get the right-most leaf) to append a node? Couldn't you just create a new concatenation node and set the right children to 'self' and the left children to the rope to be appended?

Comment #65

Posted on Jan 3, 2008 by Massive Ox

Oops, I forgot to include the build log for the compiler warnings.

Attachments

Comment #66

Posted on Jan 3, 2008 by Happy Bird
  • Don't use random value for your unit tests (testSlicing). Use specific boundary values, prone to failures, instead.

Alright, I just used random values in the beginning because it made my code fail and allowed me to make the slicing code better.

  • Strip the Lorem ipsum. Instead, generate long strings with repetition -- e.g. data1 = "XXX" * 200; data2 = "YYY" * 200; data3 = ...

Ok.

  • Don't reinvent the wheel. Use the shared mixins for testing strings for your tests. Just import test.string_tests, and use what you need.

I'll be sure to do that.

  • Remove Rope.append. Ropes should be considered as immutable.

Isn't that the same as +=, should I take that out too?

  • Remove the "debugging" getters from the public interface of Rope (e.g. Rope.type). However, you can keep them for now, if they help you.

I'll take them out once I am sure they're no longer needed. They really help me with runtime debugging right now.

  • Declare 'static' all your functions. Combine 'python.c' and 'rope.c' if necessary.

I was originally going to make rope.c a ropes library in C but with the way it's gone it no longer makes a difference so I'll change it.

  • Remove any implicit type conversion -- e.g. Rope + str should raise a TypeError

This is another debugging remnant. I just didn't want to type Rope(' all the time. But I'll change it.

  • In rope_repr(), escape any control characters. For example, this:

    Rope('hel\n\tlo') Rope('hel lo') should be: Rope('hel\n\tlo') Rope('hel\n\tlo') You probably can reuse the implementation of, or use, PyString_Repr() for this.

Ok.

  • Please put all your debugging functions (e.g. print_rope()) at the same place and label them as such. Also, protect your debugging printf calls with if-statements. For example, "if (DEBUG) { printf(...); }".

Alright but don't you mean a #if? I guess a non brain-dead compiler could optimize that out though.

  • Use Python's own memory allocator (PyMem_Malloc/PyMem_Free), instead of malloc/free.

For some reason, when I first was doing the string conversion function, using PyMem_Malloc seg faulted but now it works just fine so I'll change these over too.

  • Remove rope_incref(). You don't need it, and it's causing (major) reference leaks. You just need to call Py_INCREF once on an existing node (not a brand new one) to a concatenation node.

I just realized this problem while going through the code in my head. It'll definitely be taken out next commit.

  • Don't check the type of 'self', in your methods. This is done automatically be the interpreter. Remember you are implementing an extension type, not a whole new public API for Python.

Ok

  • Fix the compiler warnings. I included a build log with the '-ansi -pedantic' flags turned on. You can ignore the warnings about 'long long' in longobject.h.

I know about the compiler warnings but I wasn't concerned about them at the time because I was more concentrated on getting everything to work. I will make sure to fix these up tomorrow.

  • Mark fall-through case-clauses explicitly (e.g, ROPE_SLICE_NODE in rope_to_string()).

Alright I'll do this.

  • Shouldn't rope_move() deallocates its 'src'? Otherwise, why is it called rope_move, and not rope_copy? Also, why this function is needed at all?
  • Why rope_append() is so long and complex? Why do you walk down the whole concatenation tree (to get the right-most leaf) to append a node? Couldn't you just create a new concatenation node and set the right children to 'self' and the left children to the rope to be appended?

I'm answering these questions together because they have something to do with each other. rope_move() was made for rope_append. You are right about simply creating a concatenation node. I don't know what I was thinking when I wrote this function.

I'll also format the code according to PEP 7 tomorrow.

Comment #67

Posted on Jan 3, 2008 by Massive Ox
  • Remove Rope.append. Ropes should be considered as immutable.

Isn't that the same as +=, should I take that out too?

If you had specific code to make += in-place, yes. Python will define x += y automatically as x = x + y, which is the desired behavior.

  • Please put all your debugging functions (e.g. print_rope()) at the same place and label them as such. Also, protect your debugging printf calls with if-statements. For example, "if (DEBUG) { printf(...); }".

Alright but don't you mean a #if? I guess a non brain-dead compiler could optimize that out though.

No, I really meant a if-statement. As you said, the compiler will optimize it out automatically. The advantage, with an if-statement, is the debugging code gets checked by the complier. Although, #ifdef is sometime the only choice -- e.g. at the top-level.

Comment #68

Posted on Jan 4, 2008 by Happy Bird

Eeeks. I ran GNU indent like you said and it makes the code really ugly and almost unreadable. Do you want me to commit the file like that? I can send you a copy of it if you like.

Comment #69

Posted on Jan 4, 2008 by Massive Ox

What do you mean by "ugly and almost unreadable"? Can you show an example?

Comment #70

Posted on Jan 4, 2008 by Happy Bird

yeah. For an example of what I'm talking about look at line 223. The function call has been split up into more than 10 lines

Attachments

Comment #71

Posted on Jan 4, 2008 by Happy Bird

One last question, how exactly do I use the test.string_tests module?

Comment #72

Posted on Jan 4, 2008 by Massive Ox

Ah, that is what I thought. This means you need to rework your code to use less indentation.

Here's a few tricks: - You can combine nested if-statements by combining their conditions. For example if (cond1) { if (cond2) { /* do something / } } can be rewritten as: if (cond1 && cond2) { / do something / } - You can also use the Linus Torvalds trick and rewrite this: for (i=0; i < 100; i++) { if (cond) { / do something / } } as: for (i=0; i < 100; i++) { if (!cond) continue; / do something */ } - Don't add braces for your case-clauses. - Extract your code into separate functions. - Use better algorithms.

Comment #73

Posted on Jan 4, 2008 by Massive Ox

Oh, it is really easy. You just subclass the test cases you need. You can use test_str as a reference:

http://svn.python.org/view/python/trunk/Lib/test/test_str.py?rev=55873&view=markup

Comment #74

Posted on Jan 4, 2008 by Happy Bird

But how do I make it use the ropes type?

Comment #75

Posted on Jan 4, 2008 by Happy Bird

Oh I see nevermind. The problem is the unit tests test for things that I've not implemented or are not supposed to work. For example it concatenates a string to a rope and you told me that should fail so the unittests all fail with E because some sort of exception was thrown.

Comment #76

Posted on Jan 4, 2008 by Happy Bird

I committed a new revision just now that almost fixes all of the issues you mentioned. I still have to remove some debugging functions(like print_rope()) but I'll do that once I'm finished. I've also moved everything to one file, ropes.c and I've renamed the unit tests file test.py to test_ropes.py because it will cause problems once I'm using test.string_tests.

Comment #77

Posted on Jan 4, 2008 by Massive Ox

I did a fair amount of refactoring on your code. I added some bells-and-whistles too (like a hashing function than generate the same hashes as str). I am not done yet, but I want to show what I got, so you don't do unnecessary work.

Attachments

Comment #78

Posted on Jan 4, 2008 by Massive Ox

By the way, for your iterator, you probably want to generate a linked-list of the leaves and make next() simply advances to the next node in the list.

Comment #79

Posted on Jan 4, 2008 by Happy Bird

Ok thanks avassalotti. The only thing I'm wondering about in your code is the absence of the slice node type. Is this just something you haven't done or is it something you want me to take out?

Also, did you commit your changes to the svn repo?

Comment #80

Posted on Jan 4, 2008 by Happy Bird

One more thing. I noticed, you no longer have the sq_contains method, is this something you also want to take out?

Comment #81

Posted on Jan 4, 2008 by Massive Ox

The only thing I'm wondering about in your code is the absence of the slice node type. Is this just something you haven't done or is it something you want me to take out?

Honestly, don't know. There is a complexity trade-off for the lazily evaluated slice nodes (for the repeat nodes as well). Personally, I think your implementation would be much simpler without these lazy nodes. Carl what do you think?

Also, did you commit your changes to the svn repo?

This repository is yours. I won't commit anything into it, unless I get your permission.

Comment #82

Posted on Jan 4, 2008 by Massive Ox

One more thing. I noticed, you no longer have the sq_contains method, is this something you also want to take out?

No. I removed temporarily to make the code not depend on the iterator. By the way, do you that your implementation of sq_contains is quadratic (i.e., O(n**2))?

One last thing, does it bother you that I rewrote parts of your code to be more idiomatic? I know that you are working hard enough already. So, I don't want to make this task harder than it should be, by forcing you to review my code.

Comment #83

Posted on Jan 4, 2008 by Happy Bird

One last thing, does it bother you that I rewrote parts of your code to be more idiomatic?

No, that's fine.

I'll rewrite the iterator to use a linked list of nodes instead of creating subiterators like you said.

Comment #84

Posted on Jan 4, 2008 by Helpful Dog

Honestly, don't know. There is a complexity trade-off for the lazily evaluated slice nodes (for the repeat nodes as well). Personally, I think your implementation would be much simpler without these lazy nodes. Carl what do you think?

The following came out of experiments in PyPy. Should be similarly true for this implementation:

I think repeat nodes are not really worth it, because you can write a multiplication algorithm that uses O(log(n)) time and space using concatenation nodes very easily.

Something similar is true for slicing. If the slice is small, it is much quicker to make a copy anyway. If you take a big slice of a rope that is somewhat well-balanced then you can share most nodes and use only a bit of additional storage (again proportional to O(log(length)), I think, but haven't tried to prove it).

There are lots of advantages of not having new node-types: Less code-complexity, less parameters you need to tune, etc.

Comment #85

Posted on Jan 4, 2008 by Happy Bird

I'll think I just will implement slice nodes by copying because the time advantages of using a slice node is balanced out by the memory advantages of making a copy. This is becuase, in the current implementation a very large rope is still kept in memory even though the only reference to it is a slice which only takes one character. Also, as cfbolz said, slice nodes add complexity. However, I think I'll keep repeat nodes because they aren't too complex and they also allow very large repeats.

Just a quick question though, cfbolz. In the PyPy page on ropes you calculated the hash of 'a'*sys.maxint. How did you this if PyPy just uses concatenation nodes and not repeat nodes?

Comment #86

Posted on Jan 4, 2008 by Helpful Dog

Well, you can concatenate a node to itself. by doing this you can get very long strings very quickly. Look at this:

http://codespeak.net/~cfbolz/rope_multiplication.png

It's a visualization of the node graph of 'a' * 100000.

This makes calculating hashes very fast too, because for a concatenation node you need to calculate the hash of both children, which are the same, so you calculate it only once.

Comment #87

Posted on Jan 4, 2008 by Happy Bird

Ah, I see.

Comment #88

Posted on Jan 4, 2008 by Happy Bird

Avassalotti, I am not sure what you mean when you say my implementation of ropeobj_contains is quadratic. I iterate over self only once. In the worst case it could be O(n**2) but it also could be O(n) or less. I am not sure how else to implement this. Basically, what I am doing inside the nested loop is the same thing a platform's strcmp() would do except using PyObject*'s.

Comment #89

Posted on Jan 4, 2008 by Happy Bird

One more thing, do you want me to make ropeobj_compare independent of iterators?

Comment #90

Posted on Jan 4, 2008 by Massive Ox

Please, ignore my previous comment about the complexity of your ropeobj_contains. Just didn't understand what I was talking about.

What I meant is, you are using the naive string searching algorithm, which is O((n - m + 1)*m) where n is the length of the "haystack" and m the length of the "needle". My point is there exists better algorithms that perform in average linear time -- such as the one Python's str uses (a variant of Boyer-Moore-Horspool).

So perhaps, a better solution would be to convert the ropes to strings (with rope_str) and use PySequence_Contains.

Comment #91

Posted on Jan 4, 2008 by Massive Ox

One more thing, do you want me to make ropeobj_compare independent of iterators?

Only if it makes the code simpler or more readable.

Comment #92

Posted on Jan 4, 2008 by Massive Ox

So perhaps, a better solution would be to convert the ropes to strings (with rope_str) and use PySequence_Contains.

Nevermind. This would be actually worse than the naive approach.

Comment #93

Posted on Jan 4, 2008 by Happy Bird

Ah I see. I'll implement a better searching algorithm. It can't be the Boyer-Moore-Horspool algorithm however because that would require searching backwards which would make the code more complex. The problem is most of the efficient searching algorithms require random access to the string but calling rope_index continuously is more inefficient than just using the naive approach.

Comment #94

Posted on Jan 4, 2008 by Happy Bird

I've redone the iterators to make them more efficient by using a fixed size array. I'll commit the changes to the repo. I've decided to use your version of ropes.c. The only problem is that for some reason, my system does not seem to define Py_ssize_t. Any ideas on why this might be?

Comment #95

Posted on Jan 5, 2008 by Massive Ox

I think you should stick with the naive approach for now. I will lookup in the literature if there is a way to use a property of search trees to make comparisons and searching more efficient. I will tell you about it if I find anything useful.

The only problem is that for some reason, my system does not seem to define Py_ssize_t. Any ideas on why this might be?

Which version of Python do you use?

Comment #96

Posted on Jan 5, 2008 by Happy Bird

2.4

Comment #97

Posted on Jan 5, 2008 by Massive Ox

Can you upgrade to 2.5? Otherwise, you could temporarily use this typedef:

ifdef HAVE_SSIZE_T

typedef ssize_t Py_ssize_t;

else

typedef long Py_ssize_t;

endif

Comment #98

Posted on Jan 5, 2008 by Helpful Dog

I would stick to naive searching too for now. You can probably get really efficient search algorithms on the trees (even much better than effsearch in some cases :-) ) but they are work and really not simple at all. Alexandre, if you happen to find a nice algorithm in the literature I would be extremely interested to hear about it.

Comment #99

Posted on Jan 5, 2008 by Happy Bird

I'll get python 2.5 ASAP. I've committed the changes to the repo.

Comment #100

Posted on Jan 5, 2008 by Massive Ox

I did some more work on the reviewed parts. The main changes are: - Added proper indexing support, with rope_as_mapping. - Added GC support. No more reference leaks! - Added rope_from_string, to create a rope from a C string (used in rope_new). - Improved the handling of REPEAT_NODE in _rope_str. - Fixed a nasty double DECREF in rope_repr. (Oh, you fixed this one already. :-)) - Renamed rope_traverse to rope_char_iter - Added rope_traverse for GC support.

I will review the balancing code and the iterator, tomorrow.

Attachments

Comment #101

Posted on Jan 5, 2008 by Happy Bird

Balancing isn't done yet. You can review the iterator part. But I wanted to improve the balancer so that it can merge adjacent literal nodes which are too small.

Thanks for the patch but the only thing that I'm confused about is rope_traverse. I thought this was only needed for cyclic garbage collection, where the node can hold a reference to itself. If a node referenced itself, the length of the string would be infinite and hashes wouldn't compute correctly or would take forever to do so. Not to mention, the impossibility of converting the rope to string. I really don't see why this is needed. Could you post an example of code that has cyclic references?

Comment #102

Posted on Jan 5, 2008 by Happy Bird

Just to let you know that I won't be able to do much today as I have a lot of things to do before school starts again. However, I'll try and finish the balancing code later.

Comment #103

Posted on Jan 5, 2008 by Massive Ox

Fine, I won't review the balancing code yet.

And for the GC issue, you are right that a rope forms an acyclic graph. Just been lazy and used the GC to clean up unreferenced objects. I think the INCREF in rope_concat might be superfluous, but I am not sure yet.

Comment #104

Posted on Jan 6, 2008 by Happy Bird

Okay, I've implemented merging of adjacent nodes of small length in the balancing code. I've also added a new debugging tool which implements a balance method so you can test the balancing code. Now the new iterator code which uses a list has a bug inside which I'll fix tomorrow along with the slicing code which I'm basically going to port from the PyPy implementation. Note that because the iterators are not working, comparisons also don't work and so two of the unit tests fail.

Comment #105

Posted on Jan 6, 2008 by Happy Bird

One more thing, do you think MIN_LITERAL_LENGTH is too small? I saw that in the rope_to_string code you don't split the string; should this be the same way in the balancing code? E.g. merge all adjacent literal nodes?

Comment #106

Posted on Jan 6, 2008 by Happy Bear

Extending until 01-08.

Comment #107

Posted on Jan 6, 2008 by Happy Bird

Slice support is complete. I've committed it to the svn repo.

Comment #108

Posted on Jan 6, 2008 by Helpful Giraffe

I hereby nominate this task as the task with the most comments! Impressive work, iammisc, and some good mentoring.

Comment #109

Posted on Jan 6, 2008 by Massive Ox

I hereby nominate this task as the task with the most comments! Impressive work, iammisc, and some good mentoring.

Thanks Titus!

Comment #110

Posted on Jan 6, 2008 by Massive Ox

Slice support is complete. I've committed it to the svn repo.

Could you move rope_slice into rope_subscript? There is already a stub for it.

    ...
    else if (PySlice_Check(item)) {
            Py_INCREF(Py_NotImplemented);
            return Py_NotImplemented;
    }
    ...

You can get the indices of the slice object as follow:

    Py_ssize_t start, stop, step, slicelength;
    if (PySlice_GetIndicesEx((PySliceObject *)item,
                             self->length,
                             &start, &stop, &step, &slicelength) < 0) {
        return NULL;
    }

It is not super important, though. Also, the rope_balance seems to have some problem:

from ropes import Rope [19942 refs] r = Rope("hello") [19944 refs] r[:] Rope('hello') [19946 refs] r[2:4] Rope('ll') [19946 refs] r[2:4:2] NotImplemented [19946 refs] r[2:4] Rope('ll') [19946 refs] r += r [19948 refs] r += r [19950 refs] r += r [19952 refs] r += r [19954 refs] r += r [19956 refs] r += r Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1210681152 (LWP 7362)] 0xb7f075f2 in rope_balance (r=0xb7d1c61c) at src/ropes.c:650 650 if (node_list[i]->type != LITERAL_NODE)

P.S. Could you keep your changes consistent with PEP-7? I find it particularly difficult to review C code that doesn't the follow PEP-7 coding conventions.

Comment #111

Posted on Jan 7, 2008 by Happy Bird

I don't get this error on my computer but I do have an explanation for it. I think it is caused by me not initializing i to 0 at the beginning of the loop. I have committed the new version to the repo. This new version also fixes the formatting changes, gets rid of rope_slice, and replaces it with rope_subscript although stepping support is Not Implemented.

Attachments

Comment #112

Posted on Jan 8, 2008 by Massive Ox

Sorry, I didn't have the time review the code yet. Nevertheless, I found some bugs:

  • rope_repeat (and other methods that changes the length of the rope) doesn't check for integer overflow.

r = Rope('A') * int(2**31-1) 2147483647 r = r * int(2**31-1) len(r) 1

  • rope_concat is leaky. I think this is related to the potentially superfluous INCREF (see Comment 103). At least, the garbage collector does its work. :-)

r = Rope('1234567890') [19940 refs] for _ in range(20): ... r = r + r ... [2018791 refs] del r [19943 refs]

On the positive side, I really like the memory behavior of slicing. It was a good idea to replace the slice nodes.

from ropes import Rope [19763 refs] r = Rope('1234567890') [19765 refs] for _ in range(20): ... r = r + r ... [2018616 refs] r = r[:len(r)/2] [1019192 refs] r = r[:len(r)/2] [519480 refs] r = r[:len(r)/2] [269624 refs] r = r[:len(r)/2] [144696 refs] r = r[:130] [19772 refs] r = r[:1] [19770 refs]

However, concatenation will need to be tuned. The following example consumes more than 800MB of memory:

from ropes import Rope r = Rope('1234567890') for _ in range(23): ... r = r + r ... len(r) 639631360

While, a Python string of the same length consumes 610MB [1Bytes * 639631360 * 1MB/(2^30)Bytes = 610MB]. In theory, a rope could consumes much less memory. With a minimal length of 128, my example could be represented using only with a 160Bytes literal [min(10 * 2^x) > 128 = 160 (x = 4)] and 19 concatenation nodes [23 - 4]. Each node has a 40Bytes overhead (on a 32bit machine). Therefore, the whole string could be represented using a grand total of 960 bytes [160Bytes + (20-node * 40Bytes/node)]. But wait, something doesn't seems right to me; 160 * 2^19 = 83886080, not 639631360. Hm.

s = "1234567890" for _ in range(23): ... s = s + s ... len(s) 83886080

Ha ah! I found another bug.

Nitpicking: It is bad practice to use GNU indent to reformat reviewed code.

Comment #113

Posted on Jan 8, 2008 by Happy Bird
  • rope_repeat (and other methods that changes the length of the rope) doesn't check for integer overflow.

Alright, I'll fix it.

  • rope_concat is leaky. I think this is related to the potentially superfluous INCREF (see Comment 103). At least, the garbage collector does its work. :-)

It could be the INCREF but it also could be the balancing code. I'll check this out.

On the positive side, I really like the memory behavior of slicing. It was a good idea to replace the slice nodes.

Thanks, I feel much more comfortable with the existing implementation too.

However, concatenation will need to be tuned. The following example consumes more than 800MB of memory:

Once again, I think this is related to a bad balancing method which needs to check if adjacent nodes are the same. The balancing method makes a list of nodes. I was actually thinking of always keeping the rope balanced. All concatenations are always made to the right side so keeping it balanced all the time wouldn't be too hard. Actually, I think it will make everything a whole lot easier. What do you think about this?

Nitpicking: It is bad practice to use GNU indent to reformat reviewed code.

:-[ Sorry about that. It's just that I don't like reformatting code after I've written it. I think I'll pay more attention when writing the code next time.

P.S. School starts tomorrow so I won't be able to go as fast as I have. You'll probably have to wait about 24 hours for new code or comments.

Comment #114

Posted on Jan 8, 2008 by Helpful Dog

Once again, I think this is related to a bad balancing method which needs to check if adjacent nodes are the same.

Why does it need to check that? Are you roughly following the algorithm in the paper or in the PyPy implementation?

The balancing method makes a list of nodes. I was actually thinking of always keeping the rope balanced. All concatenations are always made to the right side

Why are all concatenation always made to the right side? You can imagine having a huge rope and concatenating something to the left side.

so keeping it balanced all the time wouldn't be too hard. Actually, I think it will make everything a whole lot easier. What do you think about this?

No, that doesn't make sense to me. The idea of ropes is to make concatenation extremely fast in most cases (creating one concat one is much cheaper than copying two strings). In most cases the rope is not really perfectly balanced but "balanced enough" so that you don't really care. Only when the imbalance becomes too great you do a full rebalance.

Ah, another note: the rope_index function uses tail recursion (the last thing the function does is call itself again) which means it can be trivially rewritten to use iteration instead. This is actually quite beneficial, because you safe a lot of function calls when the rope is deep (a good compiler can theoretically do that itself, but you never know). The result would look roughly like this (untested):

static char rope_index(RopeObject *self, Py_ssize_t i) { while (1) { assert(self && i < self->length);

            switch (self->type) {
            case LITERAL_NODE:
                    return self->v.literal[i];
            case CONCAT_NODE:
                    if (self->v.concat.left && i < self->v.concat.left->length) {
                            self = self->v.concat.left;
                            continue;
                    }
                    else if (self->v.concat.right) {
                            i = i - self->v.concat.right->length;
                            self = self->v.concat.right;
                            continue;
                    }
                    break;
            case REPEAT_NODE:
                    if (self->v.repeat.child) {
                            i = i % self->v.repeat.child->length;
                            self = self.v.repeat.child;
                            continue;
                    }
                    break;
            }
    }
    /* never reached */
    return -1;

}

Comment #115

Posted on Jan 8, 2008 by Massive Ox

Nitpicking: It is bad practice to use GNU indent to reformat reviewed code.

:-[ Sorry about that. It's just that I don't like reformatting code after I've written it. I think I'll pay more attention when writing the code next time.

No need to be sorry. I am really nitpicking. My comments about the formatting are just friendly nudges to make you learn that it is a good habit to keep the code clean while you write it. Just by curiosity, what editor do you use for writing your code?

P.S. School starts tomorrow so I won't be able to go as fast as I have. You'll probably have to wait about 24 hours for new code or comments.

That is fine. I am thinking marking this task as completed soon, since I don't want to drag you over with new requirements. You been doing great work so far, keep it up.

Comment #116

Posted on Jan 8, 2008 by Massive Ox

Ah, another note: the rope_index function uses tail recursion

Oh, it is me that wrote (actually rewrote) rope_index. Travis's initial version used a clever technique to reuse the function stack frame (with a goto). I thought this was unnecessary (since most modern C compilers do sibling call optimization) and rewrote it in a more intuitive style.

(the last thing the function does is call itself again) which means it can be trivially rewritten to use iteration instead. This is actually quite beneficial, because you safe a lot of function calls when the rope is deep

The rope shouldn't have a deep greater than 32. So, the benefit is not noticable.

(a good compiler can theoretically do that itself, but you never know).

Yes, that is right. I know that GCC does since 3.4 (I am not sure about MSVC, though). Anyway, the hand-optimized version doesn't look bad, either.

Comment #117

Posted on Jan 8, 2008 by Helpful Dog

Oh, it is me that wrote (actually rewrote) rope_index. Travis's initial version used a clever technique to reuse the function stack frame (with a goto). I thought this was unnecessary (since most modern C compilers do sibling call optimization) and rewrote it in a more intuitive style.

Oh, I see. I find the while loop pretty natural, but maybe you are right that it does indeed not matter too much.

Comment #118

Posted on Jan 9, 2008 by Happy Bird

No, that doesn't make sense to me. The idea of ropes is to make concatenation extremely fast in most cases (creating one concat one is much cheaper than copying two strings). In most cases the rope is not really perfectly balanced but "balanced enough" so that you don't really care. Only when the imbalance becomes too great you do a full rebalance.

Well you would basically rotate the tree to the right or left depending on if you're appending a concatenation node to a literal or repetition node or vice versa. I'm not quite sure on how to explain this. If you're concatenating a concatenation node to a literal or repetition node and the concatenation node is balanced(which it should be because the rope_concat function would have balanced it), then simply rotating the tree to the right would balance it out. If you're doing it vice versa(e.g. appending a literal node or repetition node to a balanced concatenation node), then rotating it to the left would balance it out. If you're appending a literal node or repetition node to another literal or repetition node, then you don't need any tree rotations. If you're concatenating two concatenation nodes with the same depth, then the old method works. The hardest part is concatenating two concatenation nodes with different depths, in this case, I think you can do the appropriate amount of rotations depending on the difference in depths between the two nodes. Rotating a ropes is an O(1) operation and is pretty cheep. Also, this way is faster than my balancing implementation which now that you mention it, doesn't actually use the algorithm in the paper or the PyPy one, which is why I would like to change it. In the way I described above, the cost of rebalancing is spread out so the rope implementation doesn't seem to all of a sudden slow down because it is balancing a rope. I don't think the cost of a simple rotation ( in most cases ) would substantially slow down the code. I think on most modern computers, the effects would hardly be noticeable(unless you're counting nanoseconds).

No need to be sorry. I am really nitpicking. My comments about the formatting are just friendly nudges to make you learn that it is a good habit to keep the code clean while you write it. Just by curiosity, what editor do you use for writing your code?

I use emacs in its C mode with automatic indentation which is why the indents don't come out right. I only just switched to emacs about two to three months ago. Is there a way for me to get emacs to indent it according to PEP 7. Even better, is there an emacs minor mode that will do it for me?

Please respond with your thoughts, especially on my idea on rebalancing. Do you think it's overkill or do you think the performance losses are negligible? I think the only real noticeable performance loss would be on nodes with the very different depths. Anyway, tell me what you think.

Comment #119

Posted on Jan 9, 2008 by Massive Ox

I use emacs in its C mode with automatic indentation which is why the indents don't come out right.

Euh, I use Emacs too and I don't have any problem with the automatic indentation. That is my mode I use for Python C code:

(defmacro def-styled-c-mode (name style &rest body) "Define styled C modes." `(defun ,name () (interactive) (c-mode) (c-set-style ,style) ,@body))

(def-styled-c-mode python-c-mode "python" (setq indent-tabs-mode t tab-width 8 c-basic-offset 8))

(def-styled-c-mode py3k-c-mode "python" (setq indent-tabs-mode nil tab-width 4 c-basic-offset 4))

Comment #120

Posted on Jan 9, 2008 by Happy Bird

Thanks avassalotti, that works much better.

Comment #121

Posted on Jan 9, 2008 by Massive Ox

Regarding the rebalancing, I am not sure yet what you're suggesting is appropriate for ropes. I will have to think about this a bit more.

Comment #122

Posted on Jan 9, 2008 by Massive Ox

Well you would basically rotate the tree to the right or left depending on if you're appending a concatenation node to a literal or repetition node or vice versa.

You can't append a concatenation node to a literal. Although, you can create a new concatenation node with a literal and another concatenation node.

I'm not quite sure on how to explain this. If you're concatenating a concatenation node to a literal or repetition node and the concatenation node is balanced(which it should be because the rope_concat function would have balanced it), then simply rotating the tree to the right would balance it out. If you're doing it vice

"Rotating the tree to the right" doesn't anything to me, unless you specify the pivot. (I am assuming that you would choose the previous concatenation root as the pivot). Anyway, this idea won't work and here is simple counter-example (with pretty pictures):

http://peadrop.com/rope/balancing.html

cheep. Also, this way is faster than my balancing implementation which now that you mention it, doesn't actually use the algorithm in the paper or the PyPy one, which is why I would like to change it.

Why? The algorithm is the paper is elegant, fast, and simple to implement, no? Do you have difficulties understanding it? If so, I can help you; just ask.

Comment #123

Posted on Jan 9, 2008 by Happy Bird

I see now cfbolz, thanks for the page. I wasn't visualizing it right. I'll implement the algorithm in the paper then.

Comment #124

Posted on Jan 10, 2008 by Happy Bird

Last question, how does the pypy implementation handle keeping the depth below max depth? AFAICT, it just rebalances the rope.

Comment #125

Posted on Jan 10, 2008 by Massive Ox

Well, a rope of depth 32 can have 2^32 leafs. Combine this with the minimum leaf length of 128, this give you more space than you will ever need. And simply rebalancing the rope will reduce significantly the depth of the rope. So, I don't think you have to worry about this.

Comment #126

Posted on Jan 10, 2008 by Happy Bird

Okay. The new implementation works well. However, it fails completely on large ropes(I mean really large with depth 32). Also, literal merging also fails. I'm going to commit the code to the repo anyway. I would suggest that you undefine LITERAL_MERGING and try the rebalancing code. Please tell me what you think. I've had very little time to do anything today because I had to do school stuff too.

Comment #127

Posted on Jan 11, 2008 by Happy Bird

I have a question. Should rebalancing be done in place? E.g. should calling the balance() method on a rope return a new rope that is balanced or should it balance the rope in place? Right now, the rope_balance method returns another rope but it could have completely destroyed the nodes of the old rope. It doesn't matter in a non-debug build where the only place it can be called is the rope_concat function and internal functions that do it "right." However, in debug mode, a user can just call the balance() method and expect to get a new rope. However, the balance method destroys the nodes of the old rope so when you try to repr() the old rope, it will fail. The way to fix this would be to assign the result of the balance() method back to the old variable which is the behavior that the rope_concat functions emulate. For example, say r is a rope. To balance it one should do:

r=r.balance()

because

r.balance()

will make repr(r) fail.

I just wanted to know what you think about this. Should I leave it like this, because balance() is only available in DEBUG mode and because anyone who is doing debugging should already know this or should I make rope_balance balance the rope in place?

Comment #128

Posted on Jan 11, 2008 by Helpful Dog

r.balance() must not change r itself in any way! Consider the following situation:

r1 = r2 = x = r1 + r2

r1.balance() (whether called explicitly or by some other operation does not matter).

If r1.balance() broke r1, then x would be broken as well. If r1.balance() would happen inplace then this might destroy the balancedness of x. It is really best to view ropes as something immutable.

Comment #129

Posted on Jan 12, 2008 by Happy Bird

ok cfbolz.

Comment #130

Posted on Jan 12, 2008 by Happy Bird

Ok, balancing is now much better. I've committed the new version to the svn repository. Please take a look.

Comment #131

Posted on Jan 12, 2008 by Massive Ox

Great to hear! I will review the new balancing code as soon as I get the time (i.e., probably tomorrow morning).

Comment #132

Posted on Jan 12, 2008 by Quick Elephant

I've been asked to relay my experience with "ropes" in ABC. Well, ABC didn't actually use ropes (I think) -- it used a balanced (B?) tree structure. You can still find the source online: http://homepages.cwi.nl/~steven/abc/implementations.html; the file to start is probably abc/btr/i1tex.c.

In my experience there were two major issues with the implementation in ABC: due to the tree balancing algorithm it was very complex, which made it hard to get the code (and the code using it) to be absolutely correct; and, while it was carefully designed to have asymptotically optimal performace for all operations on large strings, the overhead for small strings (which abound in typical programs) was pretty bad.

It was also expensive to get a string out of it in a form that a typical C system or library call could use, although that wasn't much of an issue for ABC (which doesn't have a lot of system interface code, unlike Python).

Comment #133

Posted on Jan 13, 2008 by Massive Ox

Thanks again Guido for sharing your thoughts.

Travis, if you don't already know, ABC is the predecessor of Python. (http://en.wikipedia.org/wiki/ABC_%28programming_language%29). Just like Cedar, the programming environment described in Bohem's paper, ABC used heavily a rope-like structure (a B-tree actually) for its strings.

Of course, ABC's implementation is quite different for yours, and probably won't be much useful to you. Anyway, I thought you would find this little piece of history interesting.

Comment #134

Posted on Jan 13, 2008 by Massive Ox

The balancing operation is still quite leaky (both in memory and in references):

from ropes import Rope [20516 refs] a = Rope("a") [20518 refs] b = Rope("b") [20521 refs] for _ in range(100000): ... a += b ... Traceback (most recent call last): File "", line 2, in KeyboardInterrupt [58449746 refs] len(a) 61324 [memory usage: 980.3MB]

I think you should use an array of MAX_DEPTH elements for storing the forest, instead of a list of pointers, during the balancing operation. This way you wouldn't need to manage memory manually, thus reducing the risk of leaks. Also, why rope_concat_unchecked is needed at all? The forest elements during rebalancing operation shouldn't exceed MAX_DEPTH, no?

Comment #135

Posted on Jan 13, 2008 by Massive Ox

Oh, I don't know if this will be any useful to you, but I wrote, earlier this week, a simple implementation of rope in Python. I used it to improve my understanding the rebalancing operation. Feel free to ignore if you already understand the balancing algorithm, described in the paper, well.

Attachments

Comment #136

Posted on Jan 13, 2008 by Happy Bird

Travis, if you don't already know, ABC is the predecessor of Python. (http://en.wikipedia.org/wiki/ABC_%28programming_language%29). Just like Cedar, the programming environment described in Bohem's paper, ABC used heavily a rope-like structure (a B-tree actually) for its strings.

Yes. I looked up ABC on wikipedia already.

The balancing operation is still quite leaky (both in memory and in references):

I'll look into it. I have an idea on how to reduce the memory usage(esp. in regards to using a malloc()'ed array). See the end of my post. This could also be caused by literal merging creating new literal nodes with the full length. Because new nodes have been created, their are now more literals than just the original one literal.

I think you should use an array of MAX_DEPTH elements for storing the forest, instead of a list of pointers, during the balancing operation.

That is what I was originally thinking but the balancing operation doesn't really guarantee that it will make the rope's depth less than MAX_DEPTH. This is once again, just something for handling worst case scenarios.

Also, why rope_concat_unchecked is needed at all? The forest elements during rebalancing operation shouldn't exceed MAX_DEPTH, no?

As said before, the algorithm in the paper doesn't really make sure the rope depth doesn't exceed MAX_DEPTH so it can cause rope_balance to be recursively called. This is really something for a worst case scenario.

Oh, I don't know if this will be any useful to you, but I wrote, earlier this week, a simple implementation of rope in Python. I used it to improve my understanding the rebalancing operation. Feel free to ignore if you already understand the balancing algorithm, described in the paper, well.

I'll definitely look into it. I think I understand the algorithm well, though.

Right now, I'm using an array of rope nodes because that was easier. However, I'm thinking of replacing it with a singly-linked stack type linked list which would lessen the memory needed. What do you think of this? This might take longer but it wouldn't consume 980.3 MB(well hopefully).

I'm taking part in the open house at my school tomorrow though so I won't be able to do much until maybe Monday. Most probably, I'll have some new version by Tuesday.

Comment #137

Posted on Jan 13, 2008 by Massive Ox

That is what I was originally thinking but the balancing operation doesn't really guarantee that it will make the rope's depth less than MAX_DEPTH. This is once again, just something for handling worst case scenarios.

See comment just below.

As said before, the algorithm in the paper doesn't really make sure the rope depth doesn't exceed MAX_DEPTH so it can cause rope_balance to be recursively called. This is really something for a worst case scenario.

It does actually. See last paragraph of the fifth page of the paper.

And if you want to be sure that your code can handle (very) unlikely worst case, where the tree has more than 2^MAX_DEPTH nodes and all node are filled, then just add a check in _find_fib_slot that ensure that the value returned is never greater or equal to MAX_DEPTH.

Right now, I'm using an array of rope nodes because that was easier. However, I'm thinking of replacing it with a singly-linked stack type linked list which would lessen the memory needed. What do you think of this? This might take longer but it wouldn't consume 980.3 MB(well hopefully).

I am not sure if I understand what you are saying. Could you explain better how your idea would reduce memory usage?

I'm taking part in the open house at my school tomorrow though so I won't be able to do much until maybe Monday. Most probably, I'll have some new version by Tuesday.

Fine. Just to be sure I don't drag you along with new requirements, I putting a "pencil-down" deadline on Saturday, January 19th, 2008 at 3:00UTC. I will take the code written at this date and do a final evaluation. You will be welcome to continue working on your implementation (and I will gladly continue to help you). However, further work won't be considered part of the GHOP contest.

Comment #138

Posted on Jan 13, 2008 by Happy Bird

It does actually. See last paragraph of the fifth page of the paper.

I saw this paragraph but I am unsure how it makes sure the resulting rope has a depth less than the maximum depth. AFAICT, this paragraph only helps you determine the depth of the final rope but not how to make the depth less. Are you saying to fail on ropes larger than depth 32?

And if you want to be sure that your code can handle (very) unlikely worst case, where the tree has more than 2^MAX_DEPTH nodes and all node are filled, then just add a check in _find_fib_slot that ensure that the value returned is never greater or equal to MAX_DEPTH.

Once again, are you saying to fail on ropes greater than MAX_DEPTH?

I am not sure if I understand what you are saying. Could you explain better how your idea would reduce memory usage?

Yes. Right now, I malloc() a whole array which I fill with each node of the rope in left to right order. However, this eats up a lot of memory. In some cases, the nodes of the rope are the exact same. For example, if r is a rope and you keep on doing r=r+r, then all the nodes are going to be the same. It makes little sense to make a whole array just for putting the same exact nodes into. I was going to instead make a singly linked list which would function as a stack. Basically, instead of having a function _rope_get_iter_list, I would traverse the tree in the while loop and only get the next node when I needed it instead of getting all of them before and putting them into a massive array.

One more thing, I think literal merging takes up more memory than just not having literal merging. This is because in some cases, the nodes being merged are the same. What do you think? Should I just scrap literal merging altogether?

Comment #139

Posted on Jan 13, 2008 by Helpful Dog

Why does the array take so much space? In theory the array should never need to be longer than 44 items long because the 44th fibonacci number is greater than sys.maxint, the length of the longest string. Literal merging is definitely worth it, you need to keep in mind that in practice strings don't get constructed by saying "a" * 1000000 or by repeatedly concatenating the same string.

Comment #140

Posted on Jan 13, 2008 by Massive Ox

Comment 138 by iammisc, Today (5 hours ago)

Once again, are you saying to fail on ropes greater than MAX_DEPTH?

Yes. In fact, it should fail much earlier. The length of strings, in Python, can't be greater than PY_SSIZE_T_MAX (2^32 / 2 - 1, on the typical 32-bit machine). Therefore, the length of a rope should not exceed this value.

Yes. Right now, I malloc() a whole array which I fill with each node of the rope in left to right order. However, this eats up a lot of memory.

Why do you allocate the array dynamically -- i.e. with malloc()? Wouldn't it be easier (and faster) to use a static array?

In some cases, the nodes of the rope are the exact same. For example, if r is a rope and you keep on doing r=r+r, then all the nodes are going to be the same.

"r=r+r" is unlikely in a real program.

It makes little sense to make a whole array just for putting the same exact nodes into. I was going to instead make a singly linked list which would function as a stack. Basically, instead of having a function _rope_get_iter_list, I would traverse the tree in the while loop and only get the next node when I needed it instead of getting all of them before and putting them into a massive array.

Hm, a massive array? Why do you store the leaves, at all, in an array? You just need to do a left to right traversal of the tree and build up the balanced tree incrementally.

One more thing, I think literal merging takes up more memory than just not having literal merging. This is because in some cases, the nodes being merged are the same.

It should be quite the opposite. Literal merging should remove nodes, thus reduce the memory usage (only slightly though).

What do you think? Should I just scrap literal merging altogether?

No. I think you should improve your code instead. When done right, literal merging gives a nice speed boost too. Try it yourself. Run the short example, in my simple rope implementation, with literal merging:

% time python rope.py final length: 20000 final depth: 20 python rope.py 1.80s user 0.04s system 95% cpu 1.937 total

Then, change the min_leaf_length parameter of rebalance() to -1 to disable literal merging and try to run the example again:

alex:~% time python rope.py final length: 20000 final depth: 15 python rope.py 1279.52s user 19.93s system 91% cpu 23:46.99 total

It is a bit faster. :-)

Comment #141

Posted on Jan 14, 2008 by Happy Bird

Hm, a massive array? Why do you store the leaves, at all, in an array? You just need to do a left to right traversal of the tree and build up the balanced tree incrementally.

Which is what my new idea would do.

It should be quite the opposite. Literal merging should remove nodes, thus reduce the memory usage (only slightly though).

In the ways you've tested the implementation it would take more wouldn't it?

For example you ran

from ropes import Rope [20516 refs] a = Rope("a") [20518 refs] b = Rope("b") [20521 refs] for _ in range(100000): ... a += b

But in the current implementation, new nodes would be created for all the b nodes. So I would have new nodes with 128 'b's inside even though normally you would only need b. This is what I mean when I say literal merging wastes space.

Yes. In fact, it should fail much earlier. The length of strings, in Python, can't be greater than PY_SSIZE_T_MAX (2^32 / 2 - 1, on the typical 32-bit machine). Therefore, the length of a rope should not exceed this value.

So should I put an if statement in rope_concat that will throw an exception?

"r=r+r" is unlikely in a real program.

Other things such as the code you ran to test the implementation would suffer from the same problems, that I mentioned above.

I change MIN_LITERAL_LENGTH to a higher number and adopted a method of literal merging similar to your implementation. The longer literals really made a difference.

Also, check the new version, it runs your test code fine and without using much memory.

Comment #142

Posted on Jan 14, 2008 by Massive Ox

Hm, a massive array? Why do you store the leaves, at all, in an array? You just need to do a left to right traversal of the tree and build up the balanced tree incrementally.

Which is what my new idea would do.

But, you don't need a linked list for that, no? Just need a procedure similar to rope_str, which, instead of copying characters, insert the rope leaves into the forest.

But in the current implementation, new nodes would be created for all the b nodes. So I would have new nodes with 128 'b's inside even though normally you would only need b. This is what I mean when I say literal merging wastes space.

Actually, no. You forgot that there is two components to the size of a rope, the size of the literal and the node overhead (which is 40 bytes on a 32-bit machine; it's greater on a 64-bit machine). Therefore, without literal merging the total size of the rope, namely a, resulting from:

a = Rope('a') b = Rope('b') for _ in range(100000): a = a + b

is equal to:

= + = (2 * 1 byte) + (100000 nodes * 40 bytes per node) = 4000002 bytes

With literal merging the resulting rope is equal to:

= (128 * (100000 // 128) + 100000 % 128) + ((100000 // 128 + 1) * 40) = 100000 bytes + 31280 bytes = 131280 bytes

So should I put an if statement in rope_concat that will throw an exception?

I think rope_from_type would be a better place to put a such check (since rope_repeat need to be checked too).

Comment #143

Posted on Jan 14, 2008 by Massive Ox

Here is a simplistic prototype of how I think the balancing should be done. It surely has some bugs (I have neither tried, nor proved that it works). However, I think it will give you some guidelines on how to improve the balancing code of your implementation.

typedef RopeObject *Forest[MAX_DEPTH];

static void insert_in_forest(Forest forest, RopeObject *leaf) { int i; RopeObject *x;

i = _find_fib_slot(leaf->length);
assert(i < MAX_DEPTH);
for (;;) {
    if (forest[i] == 0) {
        forest[i] = leaf;
        break;
    }
    else {
        leaf = rope_concat(forest[i], leaf);
        forest[i] = 0;
        i = _find_fib_slot(leaf->length);
    }
}

}

static void _rope_balance(RopeObject self, Forest forest) { if (self->depth == 0) { insert_in_forest(forest, self); } else { / Concatenation node */ _rope_balance(self->v.concat.left, forest); _rope_balance(self->v.concat.right, forest); } }

static RopeObject * rope_balance(RopeObject *self) { int i; Forest forest; RopeObject *rope = NULL;

for (i = 0; i < MAX_DEPTH; i++) {
    forest[i] = 0;
}
_rope_balance(self, forest);
for (i = 0; i < MAX_DEPTH; i++) {
    if (rope == NULL)
        rope = forest[i];
    else
        rope = rope_concat(rope, forest[i]);
}
return rope;

}

Comment #144

Posted on Jan 14, 2008 by Helpful Dog

Hi Alexandre,

I think there are some problems with that code:

  • I think the static array should be a bit bigger, namely _find_fib_index(MAXINT) which is 44 on 32 bit and 90 on 64 bit machines.

  • the insert_in_forest is wrong, because it needs to consider the possibility that there are nodes somewhere in forest[:i], which need to be swept up

  • at the end of rope_balance there can be nodes in all of forest, which need to be concatenated to get the result

Comment #145

Posted on Jan 14, 2008 by Massive Ox

I think there are some problems with that code:

Ah, I knew it. :-)

  • I think the static array should be a bit bigger, namely _find_fib_index(MAXINT) which is 44 on 32 bit and 90 on 64 bit machines.

Wouldn't it be simpler to cap the return value of _find_fib_index so that it can't be greater or equal to MAX_DEPTH (and perhaps increase the value of MAX_DEPTH)? I don't think strings greater than, let's say, 100MB are very common.

  • the insert_in_forest is wrong, because it needs to consider the possibility that there are nodes somewhere in forest[:i], which need to be swept up

Oh, that's true! Initially, I thought that wasn't nescessary, but now I see that it is (since the string could be reconstructed in disorder if that is not done). Thanks for pointing this out.

  • at the end of rope_balance there can be nodes in all of forest, which need to be concatenated to get the result

Hm. I am not sure what you mean there.

Comment #146

Posted on Jan 15, 2008 by Happy Bird

I did not use Alexandre's code though so you shouldn't see those bugs in my implementation. Anyway, I've basically made my implementation recursive like you said. I've also made work_list a static array of MAX_DEPTH nodes. Something you may want to know is that the state of the balancing operation is kept on a new struct RopeBalanceState which really made making the function recursive much easier. Personally, I think Alexandre's idea of making the function recursive is much better. I think it goes faster now(I haven't really timed it) and it works better. I've also removed all the warnings(Finally). I've committed the new version so you might want to take a look.

Tell me if you find any bugs.

Comment #147

Posted on Jan 15, 2008 by Massive Ox

I see that you opted for an approach closer to the PyPy's implementation. That is fine, although, I am not sure if it is the most elegant one in C.

Here's a few things that struck me as odd:

In rope_balance:

state.a = state.b = INT_MAX;

Don't use INT_MAX for that, use PY_SSIZE_T_MAX instead.

In _rope_balance:

if(cur->type==LITERAL_NODE && cur->length < MIN_LITERAL_LENGTH &&

state->string_length < MIN_LITERAL_LENGTH) { if(!state->string) { state->string = PyMem_Malloc(cur->length); memcpy(state->string, cur->v.literal, cur->length); state->string_length = cur->length; } else { state->string = PyMem_Realloc(state->string, state->string_length + cur->length); memcpy(state->string + state->string_length, cur->v.literal, cur->length); state->string_length += cur->length; } return; }

I think you use a static buffer, instead of using malloc/realloc. This will change the semantic for the literal merging algorithm a bit. However, that will avoid many calls to malloc/realloc, and (maybe) speed up the balancing code.

In rope_balance:

_rope_balance(r, &state, 1);
if(state.string) {
    cur = rope_from_type(LITERAL_NODE, state.string_length);
    cur->v.literal = state.string;

    _rope_balance(cur, &state, 0);
}

Why do you call _rope_balance again?

Finally, you need to check for overflow more throughly:

from ropes import Rope [20523 refs]

r = Rope('1234567890') [20525 refs] s = Rope('1234567890') [20527 refs] for _ in range(200): ... r += s ... s += r ...

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1209956160 (LWP 8053)] 0xb7c82ca6 in rope_concat_unchecked (self=0x5704e7, other=0x8b00a9c) at src/ropes.c:358 358 Py_INCREF(self);

On the good side, memory usage usage is much better. Also the balancing operation seems to be faster. Good work!

Comment #148

Posted on Jan 15, 2008 by Happy Bird

I think you use a static buffer, instead of using malloc/realloc. This will change the semantic for the literal merging algorithm a bit. However, that will avoid many calls to malloc/realloc, and (maybe) speed up the balancing code.

It definitely would speed up the code but if all the nodes point to a static buffer on the stack, their contents will be overwritten if you change one of them or if a function decides to use the stack(which is every function).

A way to maybe speed up the code is to allocate the full 1024 bytes and then realloc it if it isn't using the full buffer.

In rope_balance:

_rope_balance(r, &state, 1); if(state.string) { cur = rope_from_type(LITERAL_NODE, state.string_length); cur->v.literal = state.string;

  _rope_balance(cur, &state, 0);

}

Why do you call _rope_balance again?

Because it doesn't work if you take it out? :P No really, it's because the merged string might be incomplete so we have to add it to the list.

Finally, you need to check for overflow more throughly:

Yeah I haven't done any checking in the concatenation functions.

In rope_balance:

state.a = state.b = INT_MAX;

Don't use INT_MAX for that, use PY_SSIZE_T_MAX instead.

I'll fix it immediately.

On the good side, memory usage usage is much better. Also the balancing operation seems to be faster. Good work!

Thanks.

Comment #149

Posted on Jan 15, 2008 by Happy Bird

The new version checks for overflow in rope_balance.

Comment #150

Posted on Jan 18, 2008 by Happy Bird

Avassalotti, have you checked the code yet?

Comment #151

Posted on Jan 18, 2008 by Massive Ox

Not yet (I just arrived from short trip). But, I am looking at it now.

Comment #152

Posted on Jan 18, 2008 by Massive Ox

It definitely would speed up the code but if all the nodes point to a static buffer on the stack, their contents will be overwritten if you change one of them or if a function decides to use the stack(which is every function).

I think you misunderstood what I suggested, but that isn't important.

A way to maybe speed up the code is to allocate the full 1024 bytes and then realloc it if it isn't using the full buffer

Yes, that would also work.

The new version checks for overflow in rope_balance.

Looks fine (I haven't tested it, though). But, what about the overflow case I mentioned in comment 112?

Question: Why the REPEAT_NODE case in rope_slice is so complex? Couldn't you do something like the following (in pseudo-code)?

def slice(repeat(child, count), start, end): start = slice(child, start % len(child)) mid = repeat(child, (end - start) // len(child)) end = slice(child, end % len(child)) return start + mid + end

One more thing, do you plan to use rope_get_iter_list_count and _rope_get_iter_list again?

Comment #153

Posted on Jan 18, 2008 by Massive Ox

Whoops, there is a mistake in my last example.

def slice(repeat(child, count), start, end): left = slice(child, start % len(child)) mid = repeat(child, (end - start) // len(child)) right = slice(child, end % len(child)) return left + mid + right

Comment #154

Posted on Jan 18, 2008 by Happy Bird

One more thing, do you plan to use rope_get_iter_list_count and _rope_get_iter_list again?

Yes RopeIter uses them.

Question: Why the REPEAT_NODE case in rope_slice is so complex? Couldn't you do something like the following (in pseudo-code)?

That is basically what the current code does. Your pseudo code does not take into account the fact that we might not need a mid node at all.

I've committed a new version which handles overflows better and has some other minor changes.

Comment #155

Posted on Jan 19, 2008 by Massive Ox

One more thing, do you plan to use rope_get_iter_list_count and _rope_get_iter_list again?

Yes RopeIter uses them.

Ah, I see. Could you combine the two procedures so that you only do a single traversal? The prototype would become something like this:

int _rope_get_iter_list(RopeObject **node_list, Py_ssize_t *count, RopeObject *node)

That is basically what the current code does.

Well, except that you take a full page of not-so-easy-to-understand code to do it. So, could you try to refactor this part, please?

Your pseudo code does not take into account the fact that we might not need a mid node at all.

This kind of optimization should be done in rope_concat.

I've committed a new version which handles overflows better and has some other minor changes.

Unfortunately, your code is relying on the signed overflow behavior, which is undefined in C. The problem is recent versions of GCC might optimize out your overflow checks. (See http://bugs.python.org/issue1621 for more detail). I suspect that this will be hard to fix everywhere though. So, you might want to leave as is for now.

Comment #156

Posted on Jan 19, 2008 by Happy Bird

Ah, I see. Could you combine the two procedures so that you only do a single traversal? The prototype would become something like this:

Ok.

Well, except that you take a full page of not-so-easy-to-understand code to do it.
So, could you try to refactor this part, please?

Alright I'll try to make it smaller. I don't understand your pseudo-code though, your calls to rope_slice only have one argument, shouldn't there be both a start and end argument?

Comment #157

Posted on Jan 19, 2008 by Happy Bird

Uh oh, One thing avassalotti about the _rope_get_iter_list* functions.

The reason they are split into two is because I need to allocate a buffer for the node_list before. Do you want me to realloc this list?

Comment #158

Posted on Jan 19, 2008 by Massive Ox

Alright I'll try to make it smaller. I don't understand your pseudo-code though, your calls to rope_slice only have one argument, shouldn't there be both a start and end argument?

Oh, that is another bug. Here's the corrected version:

def slice(repeat(child, count), start, end): left = slice(child, 0, start % len(child)) mid = repeat(child, (end - start) // len(child)) right = slice(child, 0, end % len(child)) return left + mid + right

The reason they are split into two is because I need to allocate a buffer for the node_list before. Do you want me to realloc this list?

Hm. Why not?

Comment #159

Posted on Jan 20, 2008 by Happy Bird

Oh, that is another bug. Here's the corrected version:

Yeah but that still doesn't work because left might start on an offset other than 0.

Hm. Why not?

Why not what?

Oh I forgot to mention in the last comment, but I had already committed a new version to the repo.

Comment #160

Posted on Jan 21, 2008 by Massive Ox

Yeah but that still doesn't work because left might start on an offset other than 0.

Ah, that was a typo. I meant "left = slice(child, start % len(child), len(child))". However, the algorithm is still far from being correct. I posted it just to give you an idea of how I would do it -- i.e., it wasn't intended to be implemented directly.

I found a few more bugs, while testing your code:

Rope("") * 2 Traceback (most recent call last): ... OverflowError: The string is too large! r = Rope("abcde") * int(2**31-1) r = Rope("abcdef") * int(2**31-1) Traceback (most recent call last): ... OverflowError: The rope is too long! r = Rope("abcdefg") * int(2**31-1) r = Rope("") r[:] Traceback (most recent call last): ... ValueError: No sane value to slice! Rope('c') in Rope('abc') True Fatal Python error: ../Objects/listobject.c:532 object at 0xb7ce8a38 has negative ref count -606348326

Anyway, here is my final evaluation for this task. You completed most of the requirements for this task:

concatenation [done] indexing/slicing [done] repetition [done] hashing [done] iterator [done] balancing [done] comparison [done] substring search [almost complete] search/replace [not started] automated tests [incomplete]

I can't say the substring search is complete due to the above bug. The automated tests are incomplete. gcov reports:

Lines executed:36.03% of 569 Branches executed:28.23% of 581 Taken at least once:16.01% of 581 Calls executed:25.63% of 238

There is plenty of test cases, that were shown during the past discussions, that could be included in the automated tests. Finally, your code still don't conform to PEP-7 (I know that you think your C style is the best, just as everyone else. )

Overall, you did some good work. I hope you enjoyed working on this project (I did). Also, I remarked that your programming skills improved along the way, which is great. Hopefully, I will see you again around Python core development.

I am still available if you have any question -- just mail at alexandre@peadrop.com.

    Good luck with your projects,
         Alexandre

Comment #161

Posted on Jan 26, 2008 by Happy Bird

Sorry for the late reply.

Thanks for all the help you gave me throughout the whole project. I think I will definitely continue this project. I really enjoyed this project.

Hopefully, I will see you again around Python core development.

How would I get involved in this?

Comment #162

Posted on Jan 27, 2008 by Massive Ox

Thanks for all the help you gave me throughout the whole project. I think I will definitely continue this project. I really enjoyed this project.

Great to hear!

How would I get involved in this?

First, you should read http://www.python.org/dev/intro/ if you want to learn about Python development process. There is also a developer FAQ http://www.python.org/dev/faq/, but that one is mostly a svn FAQ, at this point.

Next, subscribe to python-dev (and python-3000 if you plan to help on Py3k) to get the latest news about Python development.

Finally, start helping out. :-) Bug fixes and improvement to the unit tests are always welcome and usually easy. The list of easy tasks, in the bug tracker, is a good place to start too:

http://bugs.python.org/issue?@columns=title,id,activity,type,components,versions,severity,status&@sort=-activity&@group=priority&@filter=keywords,status&@pagesize=50&@startwith=0&keywords=6&status=1

At some point if you contribute on a regular basis, you should introduce yourself to python-dev and ask for commit privileges.

Comment #163

Posted on Jan 27, 2008 by Massive Ox

Oh, the long URL to the bug tracker got mangled, so here's a short version:

http://tinyurl.com/354fss

Status: Completed

Labels:
C hard pypy Due-20080119.0300 ClaimedBy-iammisc