wildcard search using RegExp

So one of my current projects is the Vocablinator, a dictionary of the words used in the JHS English textbooks in Japan. Believe it or not, this is actually quite useful to some people.

Many users requested two things, which I tried to incorporate into the design:
  1. the ability to list all the words from a given range of pages
  2. the ability to search for a specific word
The words are stored in external XML files in tags like this:

<word book="1" chapter="3" page="42">awesomenitude</word>

In the basic PHP (no-Javascript) version of the page, the user's query is sent via form to a PHP script, which spits back the results. This is hijaxed with Javascript.

The XML basically never changes, so I made a cache system - the first time a user searches, the XML is downloaded and stored in a local Javascript object for quicker access. Since the XML is already present and parsed, subsequent searches are lightning fast. This allows the Javascript version to be much more slick - results are returned every time a key is pressed in the search box (rather like Google's new instant search, but less annoying).

The one problem with the hijax version is: what to do with an empty search box? I had been using indexOf to test for matches, because it is simple and provided all the utility I needed. But under such a system, if a user enters something and then deletes it, the script will try to match a blank string - and a blank string is a match for everything. This means that the entire 2700-word database would come crashing into view.

Obviously this is not too hard to code around, but many users did actually request that functionality. I did want to include the option, but not force it upon people who didn't need it. For a while, I tinkered with the idea of a placeholder - a blank string would yield no results, and a special character such as # would function to return all words. But I think you'll agree, this is hardly intuitive.

Then it hit me - MS-DOS! Ahh, how I miss doing things by the command-line (says he, now that he doesn't actually have to anymore). I remembered the magic asterisk, and how much fun it was (seriously - this was what my life was like as a child) to use.

So I made a regular expression-powered search:

//grab user input and store it in searchTerm
var searchTerm = document.getElementById("searchbox").value;

//replace any asterisks (*) with the regex wildcard (.*)
searchTerm = searchTerm.replace(/\*/g, ".*");

//make a new regular expression
var newSearchTerm = new RegExp(searchTerm, "gi");

//get a nodeList of word tags from the XML, and search them
var words = xmlDoc.getElementsByTagName("word");

for (var i = 0; i < words.length; i++) {
  if (newSearchTerm.test(words[i].firstChild.nodeValue)) {
    //add it to the list of matching results

A bonus of this is that the asterisk can be used for much more powerful searches, not just listing all the word, but all the words beginning with a, or ending with ly, or whatever. Groovetastic.

For anyone interested in testing js regular expressions, I wrote a little utility (that I am reasonably sure works perfectly - let me know if you find bugs!).

No comments:

Post a Comment