Building a markup language: part two

Previously, I've talked in general terms about Whirlcode 2, which is a mark-up language tailored for use on whirlpool.net.au, and coded in JavaScript.

One of the most challenging part of the code was to ensure reliable handling of excessively long strings of characters, which — if allowed to pass unfiltered — can play havoc with HTML page layouts. Not difficult, you might think: just search for long words and litter them with breakable characters? Not so fast — you're dealing with text that contains mark-up.

Here's what I came up with:

var inputString = txt;
var outputArray = [];
var maxColumns = 30;
var spaceMatchRegexp = /^((<[^>]+>|[^<\s]){0,30}\s+)+/;
var ptagMatchRegexp = /^<\/?[p|h1|h2|h3|ul|ol|li|hr|br]>/;
var entityMatchRegexp = /^&\w+?;/;
var charCount = 0;
var howMuchToGrab = 0;
while(inputString.length) {

    // Short-circuit where near/at end
    if(inputString.length < maxColumns) {
        outputArray.push(inputString);
        break;
    }

    var spaceMatches = inputString.match(spaceMatchRegexp);
    var ptagMatches = inputString.match(ptagMatchRegexp);

    // skip to next candidate area
    if(spaceMatches && spaceMatches[0]) {
        howMuchToGrab = spaceMatches[0].length;
        charCount = 0;
    // jump html tags
    } else if(inputString.charAt(0) == "<" && inputString.indexOf(">") > -1) {
        howMuchToGrab = inputString.indexOf(">");
        if (inputString.match(ptagMatchRegexp)) charCount = 0; // if it's a paragraph tag, reset the character count
    // jump entities
    } else if(inputString.match(entityMatchRegexp) ) { 
        howMuchToGrab = (inputString.indexOf(";") > -1) ? 1+inputString.indexOf(";") : inputString.length;
        charCount++;
    // normal character
    } else {
        howMuchToGrab = 1;
        charCount++;
    }

    // move chunk from input to output 
    outputArray.push(inputString.substring(0, howMuchToGrab));
    inputString = inputString.substring(howMuchToGrab);

    // if the limit is hit, add a word break
    if(charCount >= maxColumns) {
        outputArray.push("<wbr>");
        charCount = 2; // not sure why but testing bares this out
    }

}
txt = outputArray.join("");

The little gem in this code is the following regular expression: ^((<[^>]+>|[^<\s]){0,30}\s+)+. It knows how to skip across large tracts of acceptable content, while intelligently ignoring HTML tags. It's the optimisation that makes the rest of the code go fast.

When this regexp isn't matched, the code loops over each printable character, periodically dropping <wbr> tags as it goes.

8 comments

A question about the maxlength of a field that has escaped characters - how do you handle that in the database. If you say that special and escaped characters should not count towards length limits, how do you select a length in the database field?

Firstly, asking the database server to impose your arbitrary length limits is an inherently controversial subject. This technique was written primarily for handling body text stored in fields without such limits.

But doesn't this expression ^((<[^>]+>|[^<\s]){0,30}\s+)+ also ignore tags that don't really exist such as:
<faketagthatdoesnothing> blah </faketagthatdoesnothing>
BTW HTML/Javascript isn't my strong suit :P

Don't update the blog much do you?

Looking forward to using "Whirlcode 2" Simon, the current system works well but I can see that it will be even better after you implement the new features I've seen you discussing in this blog.

Would it be possible for you to implement a way of neatly displaying tabular data?

The reason for this is because currently its very hard to read messages where people have needed to include chunks of data like their usage figures for the month.

A simple solution would be to allow the use of HTML tables.

Whirlcode 2 is already implemented:
http://whirlpool.net.au/wiki/?tag=whirlcode2

Thanks for the link Adam, Whirlcode 2 has got some useful features that I was not previously aware of.










(no HTML)