Saturday, October 8, 2011

How SubString method works in Java - Memory Leak Fixed in JDK 1.7

Substring method from String class is one of most used method in Java, and it's also part of an interesting String interview question  e.g. How substring works in Java or sometime asked as how does substring creates memory leak in Java. In order to answer these questions, you knowledge of  implementation details is required. Recently one of my friend was drilled on substring method in Java during a Java interview, he was using substring() method from long time, and of course all of us has used this, but what surprises him was interviewer's obsession on Java substring, and deep dive till the implementation level. Though String is a special class in Java, and subject of many interview questions e.g. Why char array is better than String for storing password . In this case it was, substring method, which took center stage. Most of us rather just use substring(..), and than forgot. Not every Java programmer go into code, and see how exactly it's working. To get a feel of how his interview was let's start .

Update:  This issue was actually a bug http://bugs.sun.com/view_bug.do?bug_id=6294060,  which is fixed in substring implementation of Java 7. Now, Instead of sharing original character array, substring method creates a copy of it. In short, substring method only retains as much data, as it needed. Thanks to Yves Gillet for pointing this. As some of my readers pointed out, java.lang.String class has also grown into some change in Java 1.7 version and offset and count variable which is used to track positions are removed from String. This may save some bytes with each String instance, but not sharing original array makes substring perform linearly, as compared to constant time previously. Anyway, it's worth to remove any string related memory leak in Java. Having said that, if you have not yet upgraded your Server to Java 7 and still working on Java 1.6 updates, this is one thing, which is worth knowing.
 
Question starts with normal chit chat, and Interviewer ask,  "Have you used substring method in Java", and my friend proudly said Yes, lot many times, which brings a smile on interviewer's face. He says well, that’s good. Next question was Can you explain what does substring do? My friend got an opportunity to show off his talent, and how much he knows about Java API;  He said substring method is used to get parts of String in Java. It’s defined in java.lang.String class, and it's an overloaded method. One version of substring method takes just beginIndex, and returns part of String started from beginIndex till end, while other takes two parameters, beginIndex and endIndex, and returns part  of String starting from beginIndex to endIndex-1. He also stressed that every time you call  substring() method in Java,  it will return a new String because String is immutable in Java.

substring in Java , java substring method
Next question was, what will happen if beginIndex is equal to length in substring(int beginIndex), no it won't throw IndexOutOfBoundException instead it will return empty String. Same is the case when beginIndex and endIndex is equal, in case of second method. It will only throw StringIndexBoundException when beginIndex is negative, larger than endIndex or larger than length of String.

So far so good, my friend was happy and interview seems going good, until Interviewee asked him,  Do you know how substring works in Java? Most of Java developers fail here, because they don't know how exactly substring method works, until they have not seen the code of java.lang.String. If you look substring method inside String class, you will figure out that it calls String (int offset, int count, char value []) constructor to create new String object. What is interesting here is, value[], which is the same character array used to represent original string. So what's wrong with this?

In case If you have still not figured it out, If the original string is very long, and has array of size 1GB, no matter how small a substring is, it will hold 1GB array.  This will also stop original string to be garbage collected, in case if doesn't have any live reference. This is clear case of memory leak in Java, where memory is retained even if it's not required. That's how substring method creates memory leak.

How SubString in Java works

Obviously next question from interviewer would be,  how do you deal with this problem? Though you can not go, and change Java substring method, you can still make some work around, in case you are creating substring of significant longer String. Simple solution is to trim the string, and keep size of character array according to length of substring. Luckily java.lang.String has constructor to do this, as shown in below example.

// comma separated stock symbols from NYSE
String listOfStockSymbolsOnNYSE = getStockSymbolsForNYSE(); 

//calling String(string) constructor
String apple = new String( 
               listOfStockSymbolsOnNYSE.substring(appleStartIndex, appleEndIndex)
               );



If you look code on java.lang.String class, you will see that this constructor trim the array, if it’s bigger than String itself.

public String(String original) {
        ...

        if (originalValue.length > size) {
            // The array representing the String is bigger than the new
            // String itself.  Perhaps this constructor is being called
            // in order to trim the baggage, so make a copy of the array.
            int off = original.offset;
            v = Arrays.copyOfRange(originalValue, off, off+size);

        } else {

            // The array representing the String is the same
            // size as the String, so no point in making a copy.
            v = originalValue;
        }
    ...
 }
Another way to solve this problem is to call intern() method on substring, which will than fetch an existing string from pool or add it if necessary. Since the String in the pool is a real string it only take space as much it requires. It’s also worth noting that sub-strings are not internalized, when you call intern() method on original String. Most developer successfully answers first three questions, which is related to usage of substring, but they get stuck on last two, How substring creates memory leak or How substring works. It's not completely there fault, because what you know is that every time substring() returns new String which is not exactly true, since it’s backed by same character array.

This was the only interview question, which bothers my friend little otherwise, its standard service level company  Java interview in India. By the way, he got the call a day after ,even though he struggled little bit on How SubString method works in Java, and that was the reason he shared this interview experience with me.



Related Java tutorials

21 comments :

Anonymous said...

In the substring method of String class, the code is as follows:

return ((beginIndex == 0) && (endIndex == count)) ? this : new String(offset + beginIndex, endIndex - beginIndex, value);


The code is returning a new String object which contains the same backing array. As the array is an object reference, the two string objects will be pointing to the same backing array. No other instance of array is created. So irrespective of number of substrings created from that string, memory will be used for only one character array. The objects will be different but the backing array will be only one.

Anonymous said...

I understand how substring works in Java but what is point of keeping original character array inside substring object ? Since string is immutable you can not change a String once created and SubString are String.

Anonymous said...

Hi,

I read your article and i cant able to understand completely. Why the empty string is returned, when begin index and start index is same, i checked Java api i didn't get clear idea. Why they are doing like this. So i started searching in internet. I found this post.

http://www.coderanch.com/t/564948/java/java/does-substring-method-works

Can you give more explanation on this one?

Apoorv said...

Regarding the first comment related to "return ((beginIndex == 0) && (endIndex == count)) ? this : new String(offset + beginIndex, endIndex - beginIndex, value)". This is correct observation, and in the constructor used while creating a new string in above code of substring, only required array contents are copied and not the entire array. So the assumption that creating a small substring of a 1GB sized String will occupy 1GB space is wrong.

Anonymous said...

@Apoorv: No array contents are copied. Several Strings can point at the same character array with different lengths and offsets since this array will never change (which is the whole point of the String class being immutable). What the new String will contain is a pointer to the 1 GB array with a different length and offset.

What I assume the interviewer was aiming towards was what happens when the original String gets garbaged collected and there is only your tiny substring looking at 6 characters of this (unbelievably) large array.

I would hope that there is some internal mechanism that handles this in the JVM (there must be)! If I was asked this question I'd focus on the massive design flaw of a system that relies on substring on a character array of over a billion elements.

papajohn said...

I was asked exactly this question in a Google telephone interview. Of course, I failed the last part which assumed that I knew how substring is implemented internally.

When my reviewer mentioned this implementation, I told him that this is definitely implementation-dependant and that other JVMs should behave differently.

In the end, I passed that interview.

Dave Conrad said...

The String(String) constructor has changed between Java 6, 7, and (the early access version of) 8. That bit about stripping out the baggage is no longer in there.

What you need to do, if your small sub-string of a large string is in variable "name", is:

name = new String(name.toCharArray());

Anonymous said...

What I don't understand is how this 1G array is backed up by calling substring. The substring will call new String(value, beginIndex, subLen) and inside this constructor the following code exists: this.value = Arrays.copyOfRange(value, offset, offset+count). So our five characters are copied into a new array and the original 1G is not backed up.

Yves Gillet said...

Yea beware the implementation changed since java 7 at least, the substring no longer backs up the original char[] array, it creates a new copy of it. Please reflect this in you article.

Yves Gillet said...

what you are referring to the substring problem was in fact a bug:
http://bugs.sun.com/view_bug.do?bug_id=6294060

I hope interviewers are at least aware that some of their questions are not valid anymore :)

Javin @ Java Classloder Working said...

@Yves Gillet, you are correct mate, It seems, substring method is now free from memory leak. Will update this post. Thanks

Anonymous said...

Is it really true? "no it won't throw Index OutOfBoundException instead it will return empty String, same is the case when beginIndex and endIndex is same in case of second method."

Source - http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#substring(int)
"
substring

public String substring(int beginIndex)
Returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string.
Examples:

"unhappy".substring(2) returns "happy"
"Harbison".substring(3) returns "bison"
"emptiness".substring(9) returns "" (an empty string)

Parameters:
beginIndex - the beginning index, inclusive.
Returns:
the specified substring.
Throws:
IndexOutOfBoundsException - if beginIndex is negative or larger than the length of this String object."

Javin @ BlockingQueue in Java said...

@Anonymous, What is your questions? isn't you answering your own question, if I understood correctly, in "emptiness".substring(9) returns "" (an empty string)

Anonymous said...

how to search for a given pattern in a string without using regex?
e.g.
search "aab" total no of occurances and the index at which this pattern in string "aabcbbacbaabaaaabbabaab"?

Anonymous said...

@javin, can you please move the update note at the top of the article? I fear that readers who will only scan this article will be misinformed.

- Michael

George said...

Indeed substring in Java will not create any more memory leak from Java Java 1.7.0_06, issue has been fixed by not allowing sharing of char array. They have even made some changes in java.lang.String class to remove count and offset int variable, which can further save some memory, but this comes with a cost, by compromising speed. Earlier substring was O(1) operation because of sharing same char array, but now it's linear. I still think it's a good move to avoid memory leak, given excessive use of String and SubString method in Java code.

Anonymous said...

Hey Javin,

I've been reading your articles and although they are very helpful. This article does have much content on the topic of sub-string and whatever it does have is quite confusing. Which actually goes for a lot of your articles.

On reading this i was under the impression that a brand new string gets created when substring method is called and because of memory leak, apart from the original string holding 1 gb of data, the substring too holds the same size of 1 gb in memory which is wrong. When substring is called no new string that gets created and there is only a reference to the old string.

Please explain the concepts more extensively and please don't take this in any bad way.

Thanks

Soumitra Pathak said...

Hi,

while debugging we can see that even after substring it had all the content in the backing array.
so if we don't want to keep the same we can use

String new2 = new String(new1.toCharArray());

this will copy the actual array to the backing array.
I am not sure about what will happen to String 'new1' with data 1 GB.

Thanks,
S

Anonymous said...

SubString will always create a new object, but in 6 it was still referring to the 1gb array, even after performing the sub string. In 7 we create a copy of new char array to hold the substring and leave the huge array for garbage collection if not needed.

Anonymous said...

U have mentioned above that " This will also stop original string to be garbage collected, in case if doesn't have any live reference.". can you please elaborate it more. I didn't understood it.

Anonymous said...

Hey guys, this is Uma. Think this is not quite correct: -.what will happen if beginIndex is equal to length in substring(int beginIndex),for the answer provided - empty string will be returned. Question ought to have been what will happen if beginIndex is equal to length of string from which you want a piece of. That is what the person who gave this example: "emptiness".substring(9) returns "" (an empty string) was trying to say.. Cheers! Nice questions and answers.

Post a Comment