September 2009 – Jeremy Smyth's blog

Dan Brown’s The Lost Symbol makes a passing reference to some “redacted text”, where only a highlighted portion of the text becomes available.

In the story, we find a redacted text where results appear like this:

######## secret location UNDERGROUND where the ########
######## somewhere in WASHINGTON D.C., the coordinates ########
######## uncovered an ANCIENT PORTAL that led ###########
######## warning the PYRAMID holds dangerous ###########
######## decipher this ENGRAVED SYMBOLON to unveil #######

…so Trish Dunne, a metasystems expert, constructs a spider to find the document for Katherine Solomon, but the document itself had been redacted.

Now, constructing a “search spider” is not exactly the technical exercise Dan Brown makes it out to be, but the redacted text the search spider produced got me thinking.

A while back, I put together a regular expression for someone who wanted their search engine to produce a summary of searched text, only showing the surrounding words. This isn’t exactly what Trish achieves, but it gave me an idea….

Although it’s not a common requirement, redacting text looks like a bit of fun. Let’s try it in Perl. Given the following input text, let’s do a bit of searching and redacting:

“Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum”

Right. Let’s look for the words “dolor”, “nisi”, and “officia deserunt”:

#!/usr/bin/perl -w

$match = "dolor|nisi|officia deserunt";

while(){
    while(m!((?:w+bs*){2})($match)(w*s*(?:w+bs*){2})!g){
        $x = $`;
        $y = "$1U$2E$3";
        $_ = $';

        $x =~ tr/A-Za-z/#/;
        print "$x$y";
    }
    tr/A-Za-z/#/;
    print;
}

Now this isn’t quite as complex as we’d need it to be; although it allows multiple search words, and catches them all, it doesn’t catch longer strings of them. Nevermind, let’s see how it works.

Unsurprisingly, the match is the magic part. Here’s it exploded:

m!                      # start the match
    (                   # find and remember the two words before the match
        (?:             # define a word (but don't remember it)
            w+         # ...as a bunch of letters/numbers
            b          # (stop matching letters/numbers!)
            s*         # ...followed by zero or more spaces/tabs/etc.
        ){2}            # Now we know what a word is, we want two of them.
    )                   # that's the two words...
    (                   # Now remember the matched word itself.
        $match          # Use the $match variable above.
    )                   #
    (                   # Now the next two words
        w*s*          # Any number of alphanum (even zero, because "dolor"
                        # should also match "dolore")
        (?:             # second word (but don't remember it)
            w+bs*    # same as above
        ){2}
    )
!gx                     # the "!" matches "m!" above, g matches globally
                        # The "x" is explained below.

Little aside first: the exploded code above is perfectly valid Perl, and would work just as well as the code in the complete program above. The magic is performed by the x suffix on the expression; this tells perl to ignore whitespace and comments within a regular expression, and lets people like me explain what’s going on right there in the code. Handy, eh?

So, once we’ve identified each matching pattern, we display it (the $&) after showing “#” for each letter that comes before the match (the $`). Then we reset the text to match ($_) to the remainder of the string, and go again.

(Another aside: if we wanted to replace the spaces and punctuation with ‘#’, as well as all letters, we could do this instead of the tr///: $x = '#' x length($x);)

At the end of matching, we fail to match any more, so we simply translate the remainder to hashes again.

After running it against our file, we get this:

Lorem ipsum DOLOR sit amet, ########### ########### ####, ### ## ####### ###### ########## ## labore et DOLORe magna aliqua. ## #### ## ##### ######, #### ####### ############ ullamco laboris NISI ut aliquip ## ## ####### #########. #### aute irure DOLOR in reprehenderit ## ######### ##### esse cillum DOLORe eu fugiat ##### ########. ######### #### ######## ######### ### ########, #### ## culpa qui OFFICIA DESERUNT mollit anim ## ### #######.

Job done.

Month: September 2009

Redacting Text the Dan Brown Way.