Redacting Text the Dan Brown Way.

Dan Brown’s The Lost Symbol makes a passing reference to some “redacted text”, where only a highlighted portion of the text becomes available.

In the story, we find a redacted text where results appear like this:

######## secret location UNDERGROUND where the ########
######## somewhere in WASHINGTON D.C., the coordinates ########
######## uncovered an ANCIENT PORTAL that led ###########
######## warning the PYRAMID holds dangerous ###########
######## decipher this ENGRAVED SYMBOLON to unveil #######

…so Trish Dunne, a metasystems expert, constructs a spider to find the document for Katherine Solomon, but the document itself had been redacted.

Now, constructing a “search spider” is not exactly the technical exercise Dan Brown makes it out to be, but the redacted text the search spider produced got me thinking.

A while back, I put together a regular expression for someone who wanted their search engine to produce a summary of searched text, only showing the surrounding words. This isn’t exactly what Trish achieves, but it gave me an idea….

Although it’s not a common requirement, redacting text looks like a bit of fun. Let’s try it in Perl. Given the following input text, let’s do a bit of searching and redacting:

“Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum”

Right. Let’s look for the words “dolor”, “nisi”, and “officia deserunt”:

#!/usr/bin/perl -w

$match = "dolor|nisi|officia deserunt";

while(){
    while(m!((?:w+bs*){2})($match)(w*s*(?:w+bs*){2})!g){
        $x = $`;
        $y = "$1U$2E$3";
        $_ = $';

        $x =~ tr/A-Za-z/#/;
        print "$x$y";
    }
    tr/A-Za-z/#/;
    print;
}

Now this isn’t quite as complex as we’d need it to be; although it allows multiple search words, and catches them all, it doesn’t catch longer strings of them. Nevermind, let’s see how it works.

Unsurprisingly, the match is the magic part. Here’s it exploded:

m!                      # start the match
    (                   # find and remember the two words before the match
        (?:             # define a word (but don't remember it)
            w+         # ...as a bunch of letters/numbers
            b          # (stop matching letters/numbers!)
            s*         # ...followed by zero or more spaces/tabs/etc.
        ){2}            # Now we know what a word is, we want two of them.
    )                   # that's the two words...
    (                   # Now remember the matched word itself.
        $match          # Use the $match variable above.
    )                   #
    (                   # Now the next two words
        w*s*          # Any number of alphanum (even zero, because "dolor"
                        # should also match "dolore")
        (?:             # second word (but don't remember it)
            w+bs*    # same as above
        ){2}
    )
!gx                     # the "!" matches "m!" above, g matches globally
                        # The "x" is explained below.

Little aside first: the exploded code above is perfectly valid Perl, and would work just as well as the code in the complete program above. The magic is performed by the x suffix on the expression; this tells perl to ignore whitespace and comments within a regular expression, and lets people like me explain what’s going on right there in the code. Handy, eh?

So, once we’ve identified each matching pattern, we display it (the $&) after showing “#” for each letter that comes before the match (the $`). Then we reset the text to match ($_) to the remainder of the string, and go again.

(Another aside: if we wanted to replace the spaces and punctuation with ‘#’, as well as all letters, we could do this instead of the tr///: $x = '#' x length($x);)

At the end of matching, we fail to match any more, so we simply translate the remainder to hashes again.

After running it against our file, we get this:

Lorem ipsum DOLOR sit amet, ########### ########### ####, ### ## ####### ###### ########## ## labore et DOLORe magna aliqua. ## #### ## ##### ######, #### ####### ############ ullamco laboris NISI ut aliquip ## ## ####### #########. #### aute irure DOLOR in reprehenderit ## ######### ##### esse cillum DOLORe eu fugiat ##### ########. ######### #### ######## ######### ### ########, #### ## culpa qui OFFICIA DESERUNT mollit anim ## ### #######.

Job done.

Advertisements

Who's Afraid of the Big Bad Bignum?

Short one this time.

If you’ve a bignum in Perl, it might be a bit intimidating. For example, how on Earth would you get the third last digit from an enormous number? Convert it to a string first? would that even work? Mod 1000?

Fortunately, bignums are transparently available; once you’ve got one in a variable, you treat it just like any other Perl scalar, so this will work:

$digit = substr($bignum, -3, 1);

Neat, eh?

How to do application support: Padre

I accidentally stumbled across Padre, a Perl-native IDE, the other day, and although I wasn’t hugely impressed with its featureset, I was massively impressed with one small aspect. It’s still fairly new, and although it’s only a year or so old, it already has quite a few necessary features.

I wouldn’t compare it to Eclipse or any other relatively mature IDE, but I was massively impressed with one idea, that of user support.

The screenshots page almost casually refers to the “Live Support” option. Yes, they have Live Support. I’m sure it doesn’t come with a guarantee of instant access to the developers, but for an open source project, it’s pure genius. The developers and enthusiastic early-adopters probably hang out in the IRC channel anyway, even without any payment or expectation of payment. Tying this to the users, via a freely available web-based IRC client, is a remarkably clever use of existing technologies and the developer culture.

I’ve been to application-specific IRC channels before, and they tend to be polarised to either rather elitist “are you a developer? no? then buzz off” attitudes, or the opposite, policed by folk who like answering easy questions, and are therefore lording it over their own fiefdoms, kicking curious lurkers who want to pick up the accumulated wisdom by osmosis.

Padre is the first example I’ve seen where, rather than expecting the curious (or desperate) users to get onto IRC (IRC? what’s that?) themselves, the turn it into a clickable “Live Support” option, and bring the users to them with little or no difficulty.

I wish more applications did this.

Perl editing with Eclipse

Part of the problem with developing Java is the plethora of IDEs out there, and the lack of standardisation. It’s not really a problem with IDEs as much as with Java server platforms, as IDEs are largely the same; server platforms are rarely the same.

With Perl, the lack of standardisation in IDEs is not considered as much a problem, for the simple reason that many Perl programmers are really old-school, and tend to prefer simple text editors. Most of my recent Perl work has been done through Vim.

However, after teaching a Java course recently through a combination of Eclipse and ConTEXT, I had a look at Eclipse’s support for Perl, particularly with a view to debugging support; Vim doesn’t have native step-through debugging, and Eclipse seems already suited to things like that.

If you’re already familiar with debugging in Eclipse, then the EPIC plugin is well worth looking at for its Perl support.

It’s got stepped debugging within the Debug perspective, just like Eclipse has with other languages. Its Perl support is not as strong as the Java support — the Watch features and relatively simple editor features like refactoring support leave a lot to be desired — but it’s got an easier learning curve than e.g. “perl -d” (the ‘standard’ way to debug perl), or even learning a new editor like Emacs, with its Perl debugging integration. Of course, as a Vim user I haven’t even learned to hack Perl in Emacs…

Faking "Private" methods with Perl

Although Perl doesn’t really do the whole OO concept of private/public access modifiers, it’s somewhat possible to approximate them using subroutine references.

my $print_rev = sub { print scalar reverse $_[0]; };

Now we have a lexically scoped variable, visible only within its package, containing a reference to an anonymous sub.

Within our class/package, we call it by doing something like:

&$print_rev("Reversed!");

Voila! Instant private method!

note: this is partway to creating closures, which are functions declared at runtime with, with variables passed as parameter. I might mention them some other time.

Perl source-code formatting with Perltidy

I was working on an open-source Perl application recently, paying particular attention to one of the modules within. Unfortunately, the formatting left a little to be desired, with a highly idiosyncratic and inconsistent level of indentation and use of bracketing.

Not one to be put off by this, I quickly installed perltidy, and ran it against the file.

perltidy Package.pm

After a few seconds, this created a file in the same directory called Perltidy.pm.tdy, with various changes made to the formatting.

The manpage showed it had a huge number of options, allowing one to choose from various different styles. For the most part, I was happy with the defaults. Although when coding C or LPC I prefer 3-space indents rather than two, Perl’s frequent use of block early-returns and flow modifiers like last and next means it makes sense to outdent them slightly. It’s easier to read such outdents when using 4-space indents.

My chosen options, in the end, looked like this:
-b – inplace tidying, saves the original file to .bak, rather than creating a newly-styled file with .tdy extension.
-ce – cuddle elses – the default places else on a new line after the previous closing brace, which allows closing-side comments, but disrupts the flow of the if statement.
-syn – do a syntax check with Perl while tidying
-okw – outdent keywords like next and last
-csc – enable closing-side comments – comments after the closing brace of a long sub or conditional statement.
-csci=12 – minimum number of lines in a block to add closing-side comment – the default is 6.

perltidy -b -ce -syn -okw -csc -csci=12 Package.pm

The problem now is how to check my changes back in without seriously upsetting the package maintainer; almost every line in the package has been changed, so the patch will be practically impossible for him to verify. Oh, well.

Object-Oriented Perl: Inheritance, and why it's not Java

I’m quite a fan of both Java and Perl, for quite opposite reasons; Java is very strict, syntactically, and is rather good in collaborative environments as a result, where Perl is entirely the opposite of strict (even with use strict; in operation), and is therefore a delight to program in.

One thing they share is the ability to work with objects. However, this gives rise to an interesting difference: Java is object-oriented, where Perl is not.

In Java, you call a “function” by using its name, just like in any other language.

// declare the function
double square(int x){
   return x * x;
}


//somewhere else, we call it
double y = square(x);

Ditto, Perl:

# declare the sub
sub square(){
   my $x = shift;
   return $x * $x;
}

# somewhere else, we call it
my $y = square($x);

All very well, until we get into inheritance.

perl-oo-1
In Java, if the function “square” is declared in a superclass of the calling code, the above snippet will still work perfectly well; if we run double y = square(x); in code in class Banana, it will happily execute square() as if it were a local function.

Perl, on the other hand, is fundamentally a procedural language, so it assumes that function calls as shown above are internal to the current package. This means that if the sub “square” as shown above is in a superclass of the calling code, it won’t work.

This is where we have to _tell_ Perl that it’s using objects.

Firstly, some background. Inheritance in Perl is pretty easy: you create the inheriting package in the usual way, and you populate the @ISA array (pronounced “is-a”) with a list of packages to inherit, with the highest priority/affinity first. Yes, Perl supports multiple inheritance, which Java does not. As an example:

package 'Banana';
@ISA = ('Aardvark');

If package Banana contains the line of code my $y = square($x);, it will cause an error on execution, because Perl doesn’t know that square should be treated as an OO function, but rather it looks in the current scope for that function.

Now, if the calling code above is in a _different_ package, and contained an object reference to our package Banana in a variable $b, then we could get it to work by doing this:
$my $y = $b->square($x);

Interestingly, this will work whether square() is implemented in A or B. Go figure.

So, to get over this, we call the function via an object reference. In Perl, when an instance method is called (e.g. $b->square($x);), the first parameter to the function is the object reference. Conventionally, this is often written into a variable called $self.

If we need to call another function from that one, while retaining OO techniques, we use $self->otherfunction(). This solves our problem with inheritance, and is not a million miles removed from Java’s this keyword, although is far from implicit; remember, Java is fundamentally object-oriented, where Perl needs to be reminded.

Code samples:

package Aardvark;

sub square(){
    my($self, $x) = @_;
    return $x * $x;
}

package Banana;

use Aardvark;
our @ISA = ('Aardvark');


sub new {
    my $type=shift;
    return bless {}, $type;
}

# test calling an inherited instance method
sub printstuff {
    my $self=shift;
    print $self->square(5.2) . "n";
}


package main;

use Banana;
my $b = Banana->new();

# see if square works when called from Banana
$b->printstuff();

# see if square works when called from here, via Banana
print $b->square(1.5) . "n";