2012-01-22

weird .test() behaviour in Javascript regex

In a previous post, I described how I implemented a search of words contained in an XML file by using a regular expression defined by the user. Well, I ran into a weird problem.



It seemed to work great, but by chance I noticed that it wasn't actually returning all the results. This is a simplified version of my XML:

<root> 
  <w>aunt</w> 
  <w>active volcano</w> 
  <w>Xamoschi</w> 
</root>

This was what I was using (more or less) to test the words for a match:

var wildcardified = userinput.replace(/\*/g, ".*?"); 
var srch = new RegExp(wildcardified, "gi"); 
 
//for loop cycles through the xml, and tests with this: 
if (srch.test(tag[i].firstChild.nodeValue) { 
  //it's a match! 
} 

For the tiny XML file above, the results for various inputs were as follows:

  1. a* matched all three
  2. a*n matched aunt and active volcano
  3. a*t only matched aunt
  4. a*ti only matched active volcano

The problem? #3 should also match active volcano.

So I was tearing my hair out all night, trying to figure out what was going on. I had no doubt it was some stupid small mistake or oversight on my part; it always is. Except that this time it wasn't.

Well, thank the gods for Stack Overflow.

From what I now understand, this is what was happening:

The RegExp method .test() is based on the .exec() method. When you specify that you want a global search using the g flag, each time you call .test() it tries to find the next match in the string. This is a little confusing, so let's look at an example:

//simply match the letter c anywhere in the string
var regex = /c/g; 

alert(regex.test("c"));  //alerts true
alert(regex.test("c"));  //alerts false
alert(regex.test("c"));  //alerts true

... which seems like a WTF moment if ever there was one. We are performing the same exact test on the same string, using the same regular expression object, regex. So what gives? Well, for starters, regex is an object. Objects can have attributes we might not be aware of unless we are looking for them. In this particular case, the attribute is lastIndex, which keeps track of how many matches there have been so far.

In the example, what is actually happening is this:

//simply match the letter c anywhere in the string
var regex = /c/g; 

//regex has an internal counter called .lastIndex
//to start with, lastIndex is 0.
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("c"));
/* alerts:
 *  lastIndex: 0
 *  true
 */

//having found the first match, regex increments lastIndex

//lastIndex is now 1
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("c"));
/* alerts:
 *  lastIndex: 1
 *  false
 */

//this time no match was found, so the lastIndex is set back to 0

//but now it returns true again, because it is starting at the beginning again
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("c"));
/* alerts:
 *  lastIndex: 0
 *  true
 */

This becomes a little clearer if you change the string you are testing to "cccp":

//same simple regex
var regex = /c/g; 

//lastIndex is 0.
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("cccp"));
/* alerts:
 *  lastIndex: 0
 *  true
 */

//lastIndex is now 1
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("cccp"));
/* alerts:
 *  lastIndex: 1
 *  true
 */

//lastIndex is now 2
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("cccp"));
/* alerts:
 *  lastIndex: 2
 *  true
 */

//lastIndex is now 3
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("cccp"));
/* alerts:
 *  lastIndex: 3
 *  false
 */

//lastIndex is now reset to 0
alert("lastIndex: " + regex.lastIndex + "\n: " + regex.test("cccp"));
/* alerts:
 *  lastIndex: 0
 *  true
 */

The first three alerts all return true, because they each match one of the cs. The fourth alert shows false, because there are no cs after that. lastIndex is reset to zero, and so the fifth, sixth and seventh alerts will be true, the eighth will be false, and so on in a cycle ad infinitum.

This behaviour can be avoided very easily by simply not setting the g flag in the first place. However, if you really need it (as I did), a simple workaround is to manually reset lastIndex after each test. Here is my original for loop with the new workaround:

var wildcardified = userinput.replace(/\*/g, ".*?");
var srch = new RegExp(wildcardified, "gi");

//for loop cycles through the xml, and tests with this:
if (srch.test(tag[i].firstChild.nodeValue) {
  srch.lastIndex = 0;
  //it's a match!
}




This is an expanded version of the question I asked at StackOverflow here: http://stackoverflow.com/questions/8943488/javascript-wildcard-regex-search-gives-inconsistent-results

and the answer here: http://stackoverflow.com/questions/8911542/why-does-the-g-modifier-give-different-results-when-test-is-called-twice/8911576#8911576


No comments:

Post a Comment