"Andrew C. Greenberg" werdna@gate.net wrote:
I am not certain I agree that mutable strings has the advantage of efficient use of memory. With mutable strings, the substring and slice operations require a copy. With immutable strings, it might suffice to generate a Decorator with offset and size data. Of course, much depends upon the particular applications, as keeping a pointer to a large string can be inefficient if all that needs to remain is the slice.
How are these operations implemented in NextStep?
By the way, some object memory's have a slice object, along with a garbage collector that knows how to deal with them.
Lex
About a year ago I implemented immutable strings in Java. But, I hear you say, immutable strings are part of the standard Java library. Yes, but not like this.
Cedar Mesa work at PARC included a data structure called a Rope (an "industrial strength" String). A Rope is an immutable string structured as a tree. The leaves are "flat strings", simple arrays of characters. The interior nodes are
(1) concat nodes, which represent a rope that is the concatenation of the left and right child ropes; or
(2) substring nodes, which represent a substring of a child rope, indicated by starting offset and length; or
(3) functional nodes, which compute the characters of their contents by executing code. For example, a functional node might represent a rope as a file name, length and offset within that file.
In an object-oriented language, (3) would of course be replaced by "any object that implements the core methods that define ropefullness".
Ropes appeared again in PARC's Portable Common Runtime (under the name "cords", because they were better than strings, but not as tough as ropes), as a sample application for Hans Boehm's garbage collector, and as part of SGI's C++ standard template library. All I did was re-implement Ropes in Java.
Ropes have a number of nice properties.
(1) They are immutable. We have already talked about some of the advantages of this, primarily the fact that immutable strings can be shared freely without the need to protect oneself by making a copy.
(2) Concatenation, Substring, and fetching the i th character can all be implemented in time proportional to the log of the size of the argument. In practice, we delay re-balancing in order to make concatenation take unit time, because concatenation is such an important operation.
(3) Smalltalk style iterators (do:, collect:, etc.) take time proportional to the length of the rope. Loops of the form
for i := 1 to length(str) do { code involving str[i] .... }
might take time n log n (where n is the length of the string), unless we do some careful caching. It is hard to get C and Java programmers to stop writing code like this. Fortunately, in Smalltalk this should not be a problem.
(4) String operations scale to long strings, with good performance.
(5) Strings stored in files and strings available over the network, for example, can be treated in exactly the same way as strings in main memory. Most of the string manipulation functions can be written once in an abstract superclass and reused.
(6) In Java, all strings are Unicode strings, but there is really no need to allocate the same number of bytes for all characters in a String. A long ascii string with a few 2-byte characters in the middle could be represented efficiently as the concatenation of 1-byte per character flat strings and a 2-byte character flat string.
The main advantage of immutablity is that once the programmer is _assured_ that a string is immutable, she is freed from concerns about whether it is necessary to synchronize updates, make copies, update caches, etc. If mutable strings exist, it of frequently necessary to copy them, just to guard against the possibility that they might change in the future.
One possibility would be to have a subclass "InitiallyMutableString" which can be destructively updated, and then "frozen". Once an InitiallyMutableString is frozen, it would become a String (immutable) and the update operations would fail. This could be implemented with becomes: or by setting a bit in the string.
I had been hoping to find time to translate my Java implementation of Ropes into Smalltalk. There are several "gotchas" in Java that prevent them working as one would like. In particular, since String is a class, not an interface, it is in general impossible to replace String by something better, without recompiling the whole world. It is not even possible to subclass String to introduce other variants, because String is a _final_ class. And then there is the problem with numeric indexing loops mentioned above.
In Smalltalk these problems don't arise. I had really looked forward, for example, to using file strings in the directory browser, instead of the current hack of reading only the first 5k of the file.
I have a student working on the Java implementation right now. My hope is to have her run some comparative performance measures on my version (cord.String) and java.lang.String, and see if it makes sense to lobby the Java world for a wholesale replacement. Once these measurements are done, I think that I could offer to do a Smalltalk port. Most of the code is straightforward, with the exception of tree balancing. However, there is some performance tuning that needs to be done on a per-platform basis. For example, concatenating "ab" and "cd" is most efficiently done by copying them into a new four character string, but concatenating a two 1k strings is best done by allocating a concat node. Up to what length should one copy? The answer depends of the relative costs of copying, and allocation, as well as the pattern of usage; it needs to be tuned on each system.
If someone thinks that "Smalltalk Ropes" are the right way forward, and can't wait to get started, I would be happy to provide them with the Java version.
Andrew
References: On the SGI C++ implementation http://www.sgi.com/Technology/STL/Rope.html and http://reality.sgi.com/boehm/ropeimpl.html.
Background and Performance data: Boehm, Atkinson, and Plass, "Ropes: An Alternative to Strings", Software-Practice and Experience 25, 12 (Dec 1995), pp. 1315-1330.
Ropes sound like a great solution for various text editing problems in Squeak. I would love to see them working.
It's the sort of idea that makes you think "I've thought of something like that before, except for *that* bit... cool".
About the performance tuning - another approach would be to do an actual performance test at runtime (first time a string is created after the image is opened...) to determine the relative costs of these operations in the particular configuration.
Daniel
Ropes sound like a great solution for various text editing problems in Squeak. I would love to see them working.
I've had one volunteer to take a look at my Java version and make a stab at the translation. No promises, but it may happen ...
Andrew
squeak-dev@lists.squeakfoundation.org