Archive for September, 2009

Redacting Text the Dan Brown Way.

Author: Jeremy Smyth

Dan Brown’s The Lost Symbol makes a passing reference to some “redacted text”, where only a highlighted portion of the text becomes available.

In the story, we find a redacted text where results appear like this:

######## secret location UNDERGROUND where the ########
######## somewhere in WASHINGTON D.C., the coordinates ########
######## uncovered an ANCIENT PORTAL that led ###########
######## warning the PYRAMID holds dangerous ###########
######## decipher this ENGRAVED SYMBOLON to unveil #######

…so Trish Dunne, a metasystems expert, constructs a spider to find the document for Katherine Solomon, but the document itself had been redacted.

Now, constructing a “search spider” is not exactly the technical exercise Dan Brown makes it out to be, but the redacted text the search spider produced got me thinking.

A while back, I put together a regular expression for someone who wanted their search engine to produce a summary of searched text, only showing the surrounding words. This isn’t exactly what Trish achieves, but it gave me an idea….

Although it’s not a common requirement, redacting text looks like a bit of fun. Let’s try it in Perl. Given the following input text, let’s do a bit of searching and redacting:

“Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum”

Right. Let’s look for the words “dolor”, “nisi”, and “officia deserunt”:

#!/usr/bin/perl -w $match = "dolor|nisi|officia deserunt"; while(<>){     while(m!((?:\w+\b\s*){2})($match)(\w*\s*(?:\w+\b\s*){2})!g){         $x = $`;         $y = "$1\U$2\E$3";         $_ = $';                 $x =~ tr/A-Za-z/#/;         print "$x$y";     }     tr/A-Za-z/#/;     print; }

Now this isn’t quite as complex as we’d need it to be; although it allows multiple search words, and catches them all, it doesn’t catch longer strings of them. Nevermind, let’s see how it works.

Unsurprisingly, the match is the magic part. Here’s it exploded:

m!                      # start the match     (                   # find and remember the two words before the match         (?:             # define a word (but don't remember it)             \w+         # ...as a bunch of letters/numbers             \b          # (stop matching letters/numbers!)             \s*         # ...followed by zero or more spaces/tabs/etc.         ){2}            # Now we know what a word is, we want two of them.     )                   # that's the two words...     (                   # Now remember the matched word itself.         $match          # Use the $match variable above.         )                   #     (                   # Now the next two words         \w*\s*          # Any number of alphanum (even zero, because "dolor"                         # should also match "dolore")                   (?:             # second word (but don't remember it)             \w+\b\s*    # same as above         ){2}     ) !gx                     # the "!" matches "m!" above, g matches globally                         # The "x" is explained below.

Little aside first: the exploded code above is perfectly valid Perl, and would work just as well as the code in the complete program above. The magic is performed by the x suffix on the expression; this tells perl to ignore whitespace and comments within a regular expression, and lets people like me explain what’s going on right there in the code. Handy, eh?

So, once we’ve identified each matching pattern, we display it (the $&) after showing “#” for each letter that comes before the match (the $`). Then we reset the text to match ($_) to the remainder of the string, and go again.

(Another aside: if we wanted to replace the spaces and punctuation with ‘#’, as well as all letters, we could do this instead of the tr///: $x = '#' x length($x);)

At the end of matching, we fail to match any more, so we simply translate the remainder to hashes again.

After running it against our file, we get this:

Lorem ipsum DOLOR sit amet, ########### ########### ####, ### ## ####### ###### ########## ## labore et DOLORe magna aliqua. ## #### ## ##### ######, #### ####### ############ ullamco laboris NISI ut aliquip ## ## ####### #########. #### aute irure DOLOR in reprehenderit ## ######### ##### esse cillum DOLORe eu fugiat ##### ########. ######### #### ######## ######### ### ########, #### ## culpa qui OFFICIA DESERUNT mollit anim ## ### #######.

Job done.

  • chad ochocinco career stats
  • bangles eternal flame mp3bengals forum
  • tea party lies
  • memo
  • vince young redskins
  • search engines for kids
  • randy moss combine results
  • chad ochocinco parents
  • arno
  • hp support quick test pro
  • astronomy
  • new england patriots underwear
  • new england patriots 1997 roster
  • dashboard
  • tea party table settings
  • bea exhibitors
  • c span 4 to 5
  • battleship hacked
  • hyosung
  • tea party zombies download
  • search 3 bodybuilding other index
  • bengals preseason schedule 2011
  • carwash
  • mtv rivals
  • bengals history
  • randy moss legal issues
  • zara phillips husband
  • vince young uncle rico
  • vince young dadvince young eagles
  • chad ochocinco and cheryl burke
  • trojans
  • search 50 cent
  • search engines other than google
  • chad ochocinco to detroit
  • la ink ink
  • textured
  • battleship aurora
  • tims
  • glaze
  • northfield
  • vince young released
  • canary
  • la ink 105
  • c span yesterdayc span zelaya
  • new england patriots espn blog
  • speak
  • new england patriots 3 4
  • zara phillips wedding date
  • fenwick
  • new england patriots gillette stadium
  • chicago bears 17 lisa lampanelli
  • federated
  • battleship ipad
  • infrared
  • comforters
  • tea party manifesto
  • chicago bears expo 2011
  • chad ochocinco quotes video
  • steering
  • bea per capita income
  • bengals arrests
  • disassembledis boards
  • nicu
  • dis quand reviendras-tu
  • dis tester
  • chad ochocinco ultimate catch cast
  • stronger
  • bengals undraftedbengals vs steelers
  • cspan ap government review
  • connecticut renaissance faire
  • powertrain
  • bengals hard knocks episode 1
  • hp support error 1005
  • c span youtube obama
  • hp support 6310hp support 7200
  • towns
  • tint
  • cspan facebook
  • la ink 3rd season
  • hangers
  • chicago bears gifts
  • vince young 10 11
  • digitizing
  • connecticut food bank
  • mtv executivesmtv fantasy factory
  • hp support contact number
  • tea party hobbits
  • greg olsen puzzles
  • bea 2011 map
  • chad ochocinco xpchad ochocinco youtube
  • bengals qb situation
  • randy moss college
  • c span shelby foote
  • pudding
  • zara phillips and the queen
  • rufus
  • tea party chicago
  • hp support chat
  • bea goldfishberg
  • randy moss 98 vikings
  • zara phillips school
  • dis pater
  • tea party 8 28 09
  • michelin
  • tea party medicare
  • sticky
  • chicago bears tattoos
  • chicago bears 08 record
  • vince young endorsementsvince young foundation
  • cspan presidents
  • battleship kirishima
  • search and seizure
  • zara phillips and the queen
  • mtv youtube channel
  • search 78search 800 numbers
  • battleship yamato wreck
  • mtv live
  • la ink phone number
  • tea party zombies download
  • beagle
  • battleship lexington
  • bengals merchandise
  • battleship excel
  • dist 95
  • greensboro
  • search xml file
  • dailymotion
  • bea fox
  • bea rims
  • hp support chat
  • bengals cats for sale
  • hp support id
  • bengals new uniforms 2012
  • polos
  • cspan washington correspondents dinner 2011
  • bridesmaid
  • vince young injury
  • xanadu bengals
  • randy moss korey stringer
  • chad ochocinco vs skip bayless
  • opera
  • bea 0b0 105
  • vince young yahoo stats
  • bengals insider
  • squadron
  • dis poem
  • vince young jay cutler
  • hp support 6500a plus
  • mtv 5 cover
  • nozzle
  • bea 4603
  • bengals forum
  • bengals football
  • hp support 530
  • battleship bismarck wreck
  • freida pinto jeansfreida pinto kissing
  • hp support hard drive replacement
  • mtv oddities
  • scored
  • tea party ribbons
  • christina
  • birthdate
  • vince young 6
  • mtv 90s music videos
  • zara phillips tongue
  • search vim
  • bea test
  • connecticut lottery
  • freida pinto 1995
  • randy moss vikings 2011
  • dis 2012 conference
  • vince young status
  • connecticut quarter error
  • dis boards cruise
  • connecticut state parks
  • new england patriots 07
  • search 5500
  • vince young z
  • search tumblr
  • vince young rumors
  • c span yesterdayc span zelaya
  • connecticut 97.7connecticut attorney general
  • bea oracle
  • randy moss bio
  • freida pinto boyfriend