Ccna final exam - java, php, javascript, ios, cshap all in one. This is a collaboratively edited question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.
Friday, May 18, 2012
If strings are immutable in .NET, then why does Substring take O(n) time?
Given that strings are immutable in .NET, I'm wondering why they have been designed such that string.Substring() takes O( substring.Length ) time, instead of O(1)?
UPDATE: I liked this question so much, I just blogged it. See http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strings-immutability-and-persistence.aspx
The short answer is: O(n) is O(1) if n does not grow large. Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant.
The long answer is:
An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. Strings in .NET are immutable; your question is essentially "why are they not persistent"?
Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. The expense and difficulty of building a complex persistent data structure doesn't pay for itself.
People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.
The substrings extracted are typically small in size and short in lifetime; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. So using a persistent strategy that encourages reuse of most of the memory is also not a win; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers.
If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.
For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.
Precisely because Strings are immutable, .Substring must make a copy of at least a portion of the original string. Making a copy of n bytes should take O(n) time.
How do you think you would copy a bunch of bytes in constant time?
EDIT: Mehrdad suggests not copying the string at all, but keeping a reference to a piece of it.
Consider in .Net, a multi-megabyte string, on which someone calls .SubString(n, n+3) (for any n in the middle of the string).
Now, the ENTIRE string cannot be Garbage Collected just because one reference is holding on to 4 characters? That seems like a ridiculous waste of space.
Further, tracking references to substrings (which may even be inside substrings), and trying to copy at optimal times to avoid defeating the GC (as described above), makes the concept a nightmare. It is far simpler, and more reliable, to copy on .SubString, and maintain the straightforward immutable model.
Java (as opposed to .NET) provides two ways of doing Substring(), you can consider whether you want to keep just a reference or copy a whole substring to a new memory location.
The simple .substring(...) shares the internally used char array with the original String object, which you then with new String(...) can copy to a new array, if needed (to avoid hindering garbage collection of the original one).
I think this kind of flexibility is a best option for a developer.
Keeping sub-strings as references into the original would have some obvious benefit, but you asked for the drawbacks:
A reference to a (small) substring would block the entire string for GC The GC would have to be able to find the mother-string from the substring, potentially expensive The start of the reference/pointer to the chars would not always be word-aligned The current 'hack' of strings having a Length and a '\0' would have to be abandoned The savings would be rare.
Despite what Eric may claim O(n) is always O(n). Big O is a statement about asymptotic complexity, that is the relative growth of two functions. The very definition of O is that:
f(x) = O(g(x)) iff there exists a values of k and c such that f(x) < c|g(x)| for all values of x greater than k.
Many programmers conflate big O with algorithm performance, but they are very different things.
The most obvious solution to creating an O(1) algorithm for substring is to have a separate concrete class for the type returned by substring, and to use a string interface (say IString) instead of concrete types. This is a lot of extra complexity for minimal trade-off. See Eric's answer. People generally expect strings to be simple concrete types.
UPDATE: I liked this question so much, I just blogged it. See http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strings-immutability-and-persistence.aspx
ReplyDeleteThe short answer is: O(n) is O(1) if n does not grow large. Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant.
The long answer is:
An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. Strings in .NET are immutable; your question is essentially "why are they not persistent"?
Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. The expense and difficulty of building a complex persistent data structure doesn't pay for itself.
People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.
The substrings extracted are typically small in size and short in lifetime; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. So using a persistent strategy that encourages reuse of most of the memory is also not a win; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers.
If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.
For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.
Precisely because Strings are immutable, .Substring must make a copy of at least a portion of the original string. Making a copy of n bytes should take O(n) time.
ReplyDeleteHow do you think you would copy a bunch of bytes in constant time?
EDIT: Mehrdad suggests not copying the string at all, but keeping a reference to a piece of it.
Consider in .Net, a multi-megabyte string, on which someone calls .SubString(n, n+3) (for any n in the middle of the string).
Now, the ENTIRE string cannot be Garbage Collected just because one reference is holding on to 4 characters?
That seems like a ridiculous waste of space.
Further, tracking references to substrings (which may even be inside substrings), and trying to copy at optimal times to avoid defeating the GC (as described above), makes the concept a nightmare. It is far simpler, and more reliable, to copy on .SubString, and maintain the straightforward immutable model.
Java (as opposed to .NET) provides two ways of doing Substring(), you can consider whether you want to keep just a reference or copy a whole substring to a new memory location.
ReplyDeleteThe simple .substring(...) shares the internally used char array with the original String object, which you then with new String(...) can copy to a new array, if needed (to avoid hindering garbage collection of the original one).
I think this kind of flexibility is a best option for a developer.
Keeping sub-strings as references into the original would have some obvious benefit, but you asked for the drawbacks:
ReplyDeleteA reference to a (small) substring would block the entire string for GC
The GC would have to be able to find the mother-string from the substring, potentially expensive
The start of the reference/pointer to the chars would not always be word-aligned
The current 'hack' of strings having a Length and a '\0' would have to be abandoned
The savings would be rare.
Despite what Eric may claim O(n) is always O(n). Big O is a statement about asymptotic complexity, that is the relative growth of two functions. The very definition of O is that:
ReplyDeletef(x) = O(g(x)) iff there exists a values of k and c such that f(x) <
c|g(x)| for all values of x greater than k.
Many programmers conflate big O with algorithm performance, but they are very different things.
The most obvious solution to creating an O(1) algorithm for substring is to have a separate concrete class for the type returned by substring, and to use a string interface (say IString) instead of concrete types. This is a lot of extra complexity for minimal trade-off. See Eric's answer. People generally expect strings to be simple concrete types.