Building a markup language: part two

Previously, I've talked in general terms about Whirlcode 2, which is a mark-up language tailored for use on whirlpool.net.au, and coded in JavaScript.

One of the most challenging part of the code was to ensure reliable handling of excessively long strings of characters, which — if allowed to pass unfiltered — can play havoc with HTML page layouts. Not difficult, you might think: just search for long words and litter them with breakable characters? Not so fast — you're dealing with text that contains mark-up.

Here's what I came up with:

var inputString = txt;
var outputArray = [];
var maxColumns = 30;
var spaceMatchRegexp = /^((<[^>]+>|[^<\s]){0,30}\s+)+/;
var ptagMatchRegexp = /^<\/?[p|h1|h2|h3|ul|ol|li|hr|br]>/;
var entityMatchRegexp = /^&\w+?;/;
var charCount = 0;
var howMuchToGrab = 0;
while(inputString.length) {

    // Short-circuit where near/at end
    if(inputString.length < maxColumns) {
        outputArray.push(inputString);
        break;
    }

    var spaceMatches = inputString.match(spaceMatchRegexp);
    var ptagMatches = inputString.match(ptagMatchRegexp);

    // skip to next candidate area
    if(spaceMatches && spaceMatches[0]) {
        howMuchToGrab = spaceMatches[0].length;
        charCount = 0;
    // jump html tags
    } else if(inputString.charAt(0) == "<" && inputString.indexOf(">") > -1) {
        howMuchToGrab = inputString.indexOf(">");
        if (inputString.match(ptagMatchRegexp)) charCount = 0; // if it's a paragraph tag, reset the character count
    // jump entities
    } else if(inputString.match(entityMatchRegexp) ) { 
        howMuchToGrab = (inputString.indexOf(";") > -1) ? 1+inputString.indexOf(";") : inputString.length;
        charCount++;
    // normal character
    } else {
        howMuchToGrab = 1;
        charCount++;
    }

    // move chunk from input to output 
    outputArray.push(inputString.substring(0, howMuchToGrab));
    inputString = inputString.substring(howMuchToGrab);

    // if the limit is hit, add a word break
    if(charCount >= maxColumns) {
        outputArray.push("<wbr>");
        charCount = 2; // not sure why but testing bares this out
    }

}
txt = outputArray.join("");

The little gem in this code is the following regular expression: ^((<[^>]+>|[^<\s]){0,30}\s+)+. It knows how to skip across large tracts of acceptable content, while intelligently ignoring HTML tags. It's the optimisation that makes the rest of the code go fast.

When this regexp isn't matched, the code loops over each printable character, periodically dropping <wbr> tags as it goes.

Read comments (5)

Building a markup language: part one

User-generated content sites like Wikipedia and Whirlpool rely heavily on the ability for users to submit richly and contextually marked up content; everything from references to quotations, from links to lists. And this ability needs to be given to a large array of users who don't know HTML, or worse still, know too much. So we build syntax that gives users the ability to provide rich content within the parameters we set — without opening the doors to unwanted, nuisance or malicious mark-up.

Wikipedia's answer is wikitext, a comprehensive (though occasionally bewildering) syntax which makes it possible to have a generally consistent feel to most articles. It still requires authors to follow the guidelines, but a lot of heavy lifting and boilerplate is handled by the parser.

Whirlpool has a similar need. We've got thousands of new posts every day, hundreds of private messages, a job board and our own Wiki — all needing rich text handling of one sort or another. It's an important component to get right.

Up till now, Whirlpool's answer was Whirlcode. It's a fairly simple mark-up that is also difficult to trigger accidentally. Tags like [*bold*], [/italics/] and ["quotes"] aren't necessarily intuitive, but once observed are easily learned and retyped. However over the past couple of weeks I've been hard at work on Whirlcode 2, a major rewrite of the parser that handles this mark-up. And yes, I've written it in JavaScript.

What features could a parser support? Here's a list of the ones I've specifically dealt with in the development of Whirlcode 2:

Check out this live-in-browser example:

I will be writing more about the specific algorithms developed as part of Whirlcode 2 in a future post.

Read comments (17)

ColdFusion and server side javascript

Whirlpool has its own forum mark-up language which we call Whirlcode; it's analogous to, but not the same as BBCode. In recent weeks I've been considering how to improve the Whirlcode syntax, particularly now that it's being used to mark up more complex documents such as entries in the Whirlpool wiki.

My first instinct was to see what's already out there — and the short-list of candidates were Textile and Markdown. However neither have complete ColdFusion implementations, and from what I can tell the javascript implementations aren't exactly complete either — certainly not consistent with their server-side brethren.

I realised that more than anything else, I needed a mark-up language with 100% compatible server and client-side implementations, otherwise client-side previews could not be useful. And given the existing ‘investment’ in Whirlcode across the site, building my own ‘Whirlcode 2.0’ seemed the way to go. I took some ideas gleaned from Textile, combined them with traditional Whirlcode syntax, wrote a specification, and then built the first implementation in javascript.

My original plan was to port the javascript version to <cfscript>, but was also thinking about Rhino, an implementation of javascript in Java. I wondered how easy it could be to get my javascript parsed server-side, within my ColdFusion code.

Rhino in ColdFusion isn't new territory. Barney Boisvert cracked the puzzle six months ago, though his code uses Rhino to power a whole development framework. All I needed was to proxy one javascript function as a ColdFusion function, a very different, much simpler task.

So I started with Rhino's examples, and got a class working within NetBeans that worked for a very simple javascript function. I then turned it into a Java CFX tag (CFX tags have many limitations, but compiling a class meant easier debugging).

Here is a snippet of the Java:

import org.mozilla.javascript.*;
...
Context context = ContextFactory.getGlobal().enterContext();
try {
   Scriptable scope = context.initStandardObjects();
   Object result1 = context.evaluateString(scope, javascript_as_string, "somelabel", 1, null);
   Object theFunction = scope.get(function_name_as_string, scope);
   Object functionArgs[] = arguments_as_array;
   Function f = (Function)theFunction;
   Object output = f.call(cx, scope, scope, functionArgs);
} finally {
   Context.exit();
}

Finally, I transcoded it into <cfscript>. The following is the simplest possible code for executing a javascript function within ColdFusion. I have stripped out the error handling and function wrapping to show the underlying concept as clearly as possible — a more featureful implementation is left as an exercise for the reader.

<cfscript>
   script = FileRead(path_to_your_script.js);
   try {
      context = CreateObject("java", "org.mozilla.javascript.ContextFactory")
         .getGlobal().enterContext(); // Get the context object 
      try {
         scope = context.initStandardObjects(); // Prep the environment 
         context.evaluateString(scope, script, "somelabel", 1, javacast("null",0)); // Eval script 
         output = scope.get(function_name_as_string, scope)
            .call(context, scope, scope, arguments_as_array)
            .toString(); // Execute the function and return the result as a string
      }
      catch (any excpt) { }
      context.exit(); // Clean up
   }
   catch(any excpt) { }
</cfscript>

There are two alternative points where caching could be applied. Either cache the FileRead so you're not hitting the file system every time, or don't exit the context and cache it (and 'scope') somewhere.

Before this will work, you'll probably need to update the copy of Rhino embedded in ColdFusion. (Does anyone know why it's there at all?) Download the latest version, and to ensure this version loads first, copy js.jar into \ColdFusion8\runtime\servers\lib\. Packages in this directory load before any other.

Read comments (4)

What makes a good site search?

Search is the dominant form of navigation on the internet; when it works, it's the fastest way to get where you want to go. Yet when it comes to site search, many websites really phone it in. As a result, users have started ignoring site search boxes entirely.

So, how do you do site search properly?

Before you begin, ask yourself if building or deploying anything is really necessary. Do you need something that Google or Yahoo's public site search features can't manage for you? Do you have domain-specific knowledge that you can apply to your specific situation?

Perhaps you've got information that Google can't, won't, or is not permitted to index. Perhaps your website is all about search such as it is with online classifieds. Perhaps your needs don't align with the behaviours of the public search engines, for example, when your content has an extremely short shelf life.

Doing search right is all about achieving good results and communicating them well. There are many techniques you can use to tune your site search algorithm. Common ones include:

Okay, so now you have great search results that consistently find exactly what they're looking for. Don't screw things up in the last innings by not conveying the content well on the results page.

How do you know if you've made any real-world improvements to the search quality? Measure it.

In a future article I'll talk about some practical implementations for many of these ideas.

Read comments (5)

A case study of performance optimisation

I run what could quite possibly be the highest trafficked ColdFusion/MySQL website out there, and while our web hosting benefactors have been gracious, we really don't have the computing power needed to competently handle the million or so page views we get each day. So I spend a lot of time looking for ways to do more with less, server-wise.

There's a lot you can do to identify bottlenecks and poorly performing code in your environment. Even after you've picked the low hanging fruit, there can be a lot of small wins that mean a great deal particularly when the server is running hot 18 hours a day. Some of the techniques I use include:

The case study

Back in January 2007, we upgraded to ColdFusion 7 and MySQL 5... and immediately started seeing a whole bunch of queries running in the MySQL process list which I didn't write. SHOW FULL COLUMNS FROM tablename? I wanted to know where these mysterious queries were coming from, as we were having performance problems with the new configuration and I was sure this was the cause... or at least the symptom.

Step one — make sure it's not my fault. Does it happen on a freshly installed copy of ColdFusion? I had one on my local computer, and it was causing the same symptom: whenever a query is run, SHOW FULL COLUMNS followed like a shadow. Similarly, a freshly installed copy of MySQL with a carefully recreated database schema didn't change anything.

Step two — try regressing versions. My gut assumption was that ColdFusion 7 was innocent, as I had recently upgraded a swath of machines to version 7 at work and didn't see these queries popping up on our MSSQL servers. After many hours of installing and reinstalling, it turned out that it was neither program's fault: it was the combination! Regress either back to their previous versions and the problem went away.

Step three — identify the culprit. After tracing network activity between the servers, it quickly became apparent that these queries were being generated on ColdFusion's end... but how? As I was restoring the configuration for the umpteenth time, the answer hit me: Connector/J! I had been dutifully swapping versions of this jar every time I changed versions of MySQL. So I tried connecting to MySQL 5 with an older version of the connector, and magically, the rogue queries disappeared.

Now this presented an awkward situation. The previous connector did fix things, but it didn't support a number of MySQL 5 features I wanted to take advantage of. And I don't like being forced into avoiding upgrades; certainly not for something as trivial as this. So I downloaded the Java source code to see if I couldn't figure this one out for myself.

One download, unzip and search later, I had my answer. The query string existed in the getCollation() method of com.mysql.jdbc.Field, which in turn was exclusively called by isCaseSensitive() in com.mysql.jdbc.ResultSetMetaData. Immediately I made the logical connection to ColdFusion 7's new result attribute for <cfquery>, which returns such metadata as, you guessed it, case sensitivity.

It turns out ColdFusion was asking Connector/J for the metadata on every field, which in turn triggered a SHOW FULL COLUMNS query for every varchar and text column returned. Since I was using case insensitive collation exclusively throughout my database schema (and I wasn't using the result attribute anywhere), I was able to recompile the connector with the isCaseSensitive() method neutered. Problem solved.

I filed a bug report with MySQL, and it has been resolved as of version 5.0.7 of Connector/J.

Read comments (14)

How many iPhones in Australia?

Relating country of origin and forum spam

How many SQL queries does it take?

Why no love for brown?

Whirlpool News

Whirlpool Forums

Broadband Choice

email: simon at (domain)

2clix legal drama

2clix legal drama

New media hands power to the people

Profile: Caught in the Whirlpool

Groove Collective

British Politician

Principals Director

Film Producer

AC/DC Drummer

Melbourne musician