Intermediate Perl Workshop

Topics

Back to main workshop page

More I/O Fun

Getting Input from the User

We've all been prompted for input from computers. It's very common in some contexts. Perl isn't, generally speaking, one of them. But you'll still find that it can be handy to ask the user for input from time to time.

The way you do this in Perl is <STDIN>. (Nota bene – We'll see the line-input operator (<>) again shortly. And we'll learn what it really means, too.) This retrieves input form the user via "STDIN". This is C/UNIX/Perl-speak for "standard input" (get it), which is usually the keyboard. (Technically, you could change this if you're sufficiently inclined to do so. But honestly, are you? I didn't think so.) You'll want to assign the result of this operation to a variable, of course, since you are probably planning to do things to it.

Chomp!

One quirk of most input methods is that they result in strings that end in newline characters (\n, generally). Most of the time, we don't want these as they are only used to break up the input (be it a file or something a user has typed). Wouldn't it be nice to have a way to get rid of these newlines?

Well, now you can! With the handy-dandy chomp() function! (By the way, chomp() is in serious contention for the best-named built-in function in Perl.) The way it works is this: chomp($my_variable_to_be_chomped); If the string passed to chomp() ends in a newline, it removes the newline. If it does not end in a newline, chomp() sits back and thinks to itself, "Man, I'm so good that they newlines are running away before I even get to them!" Also, it does nothing to the string itself. So after the chomping, the variable will have lost any ending newlines it may have had. Note that you don't assign chomp() to anything, it alters the variable in situ. (So if you want a copy of the variable with the newline, either don't chomp() it, or create a backup copy first. (There's another option, but it's a bit silly. Still, can you see what it is?))

Back to top of page.

Command-Line (Invocation) Arguments

There's another way to get data into a Perl program. (Well, and a third: read a file in. See below.) That's to call the program with arguments in the command line. There are a number of uses for this, particularly when it comes to performing system tasks. (For example, renaming a directory of files.)

The way you pick up command-line arguments in Perl is with the special variable @ARGV. (C-folks: look familiar? But be careful...there's a subtle difference here!) This array contains the command-line arguments passed to Perl in order. So it starts with $ARGV[0] and goes until it runs out of arguments. (C-folks again: see the difference? If not, look carefully where the respective indexings begin for the arguments.)

An almost-functional example would be the followings:

      
	my $filename;
	foreach $filename (@ARGV)
	{
	  Code that does something to the files.
	}
      
    

This would loop over every argument and do the tasks to them. This can be extremely handy, needless to say. (It's even handier when you learn that when you invoke the program with an argument like *.dat, the operating system first expands the wildcard (this is called "globbing") so that the list of arguments that Perl sees isn't "*.dat", but rather all of the files ending in ".dat". How's dat for nifty?)

Back to top of page.

Opening and Closing Files

The next step is to talk about file I/O. But first, we need to learn how to open and close files. The short answer is that you use open() and close(). The latter is very easy to use, as we'll see in a moment. The former is slightly more involved.

To open a file, you'll need to pass open() two parameters. The first is the filehandle. A filehandle is the internal Perl reference for this file which will be used until the file is closed. Filehandles don't start with a leading special character (unlike scalars, arrays, and hashes). It seems to be tradition to use all capitals for them, as well. I tend to go with names like INFILE (an input file), OUTFILE (output file), or ORBITAL_DATA (guess). Once the file is open, you'll always use this filehandle to work with the file. (Whether you're reading from or writing to it.)

The second parameter to open() is a string that contains two things. The first tells Perl if you're reading or writing with the file. To read, being with <. To write, use > instead. You can view these as arrows sending the data either to the filehandle from the file (reading the file) or from the filehandle to the file (writing to the file). In Perl, that's all you can do. IDL lets use do both to a file, but if you want to read a file then write to it in Perl you have to do this in two steps. (There will be an exercise about this.)

The rest of the string is the file's name in the operating system. It can be an absolute path name (/home/origins/weissj/public_html/Perl/index.html) or a relative one (index.html). Which you use will depend on your needs at the time. A full open statement would look like:

      
	open(INFILE, "< index.html");
      
    

which reads the file in. To write to it,

      
	open(OUTFILE, "> index.html");
      
    

To close the file, close(FILEHANDLE) is all you need. If you fail to close a file by the end of the code, Perl closes it for you. However, this is poor practice as you're relying on Perl to do your work and because if you add to the code later, you might run into unexpected trouble. As a rule, explicitly close the file as soon as you're done with it. This way, you will always know where you stand with the file. (If you add to the code and need to work with the file more later, at least you'll get a nice error message if you forget to move the close statement. Failing to close a file you thought was not longer open won't produce an explicit error, it just might do things you didn't expect.)

Oh, one other option for opening files exists. >> opens a file for appending. I don't think that I've ever used this and you can do it with one read-pass and one write-pass, but this would be more convinent. And that's what Perl is all about, really.

Back to top of page.

Reading from a File

Remember the diamond operator? (<>) It's time to dust it off. When you put a file handle inside of it, it reads the next line of the file. (Provided the file is opened for reading in, of course.) As with STDIN (which you might now recognize as essentially just a filehandle as far as Perl knows), this should generally be assigned to a variable.

Sometimes you'll know how many lines to read from a file. But generally, you won't. So what to do? Use a while-loop!

      
	open(INFILE, "< my_input_file.dat");
	my $line;
	while($line = <INFILE>)
	{
	  chomp($line);
	  Do all kinds of cruel things to the line
	}
	close(INFILE);
    

As long as the diamond operator returns something (even a blank line, which will end in a newline), the while-loop continues. As soon as the end of the file has been reached, the diamond operator doesn't return anything (or, rather, it returns a null string, which is false) and the while exits. You can't ask for a cleaner system than this! (By the way, this is one of the only times I break my usual rule about only doing one thing a at a time. In this case, I'm both evaluating the looping condition and assigning the value of $line. You could get around this, if you wanted to, but this is a fairly clean system and I think one can live with it. Still, in general, don't do this sort of thing with other loops.)

Oh, since you'll be getting a newline at the end of each line (except the last one, maybe), I've added a chomp() to the loop. If you want the newline (and you might), don't include that. But usually you'll want it there.

Back to top of page.

Writing to a File

The flip side of reading from a file is, naturally, writing to said file. This is even easier. You just add the file handle to your print statement: print OUTFILE "String that I want to print\n"; Or perhaps printf(OUTFILE, "Format String with an integer: %2d\n", $my_integer);. The newline at the end of the strings is so the outputted line is on its own, well, line. If I don't do that, the next thing printed will get stuck right up to the back of what I just printed. This might be what I want, but it usually isn't.

Back to top of page.

Regular Expressions

What are regular expressions? Basically, regular expressions are a way of coding matches to strings. It's trivial in Perl to check of two strings are identical (see last week), but in real life we're seldom testing that. We want to know if a given bit of text exists inside of a bigger block of text, perhaps with the possibility of slight differences in in spellings and punctuations ("gray" versus "grey", for example). We want to know if a string passed to us matches certain criteria; is it a valid email address or URL, for example. That's the power of regular expressions.

What's so regular about regular expressions? They eat more bran than normal. Next question.

Back to top of page.

Basics of Regular Expressions: Simple Matching

We're going to start with simple matching to give us a concrete way of playing with regular expressions. To check to see if there is a matching substring in a string with regular expressions,

      $my_string ~= m/Regular Expression/
    
.

The main result of this is either a true or false value. Now to break it down a bit.

The =~ is not an assignment (so be careful), it's called the "binding operator". We'll always use it (or a slight variation on it we'll see in a bit) with regular expressions with one notable exception that I'll mention is a second. The m is not, strictly speaking, necessary for matches because Perl automatically assumes you mean m if there is nothing in that spot and you used slashes as the delimiter. (See below.) Still, going along with my philosophy of "be explicit", I'd say always use it. Finally, the slashes are how you delimit regular expressions. Well, that's the normal way, actually. Perl is smart and can handle any delimiters there. Just remember that you have to start and end with the same single symbol. (Why would you ever use different symbols? Well, you might have slashes in your regular expression. In that event, you could use another symbol and not have to escape the slash. That said, I always use slashes and learn to live with escaping.)

What other binding could you use? Well, for matches (and only for matches, with the m// format) you can use !~ which means "does NOT match". It's the same as using the normal binding operator and negating the whole expression. In fact, that's why I usually do, out of sheer habit.

Oh, the same special character escapes from double-quoted strings work in regular expressions. Things like \n, \t, and so forth. So you can match newlines and all of that stuff. (Spaces are just spaces in regular expressions, in case you are wondering. Convenient.)

Well, that's fine and dandy, is it not? Except that I haven't taught you regular expressions at all! So I should probably do that. The simplest regular expression is just the string containing what you want to match. That would be like an eq, except you don't need to match the entire string. For example:

      
	my $my_string = "It was the best of times, it was the blurst of times";

	if($my_string =~ m/best/) {print "It really was the best of times!\n";}
	if($my_string =~ m/worst/) {print "It really was the worst of times!\n";}
      
    

What would this print out? Answer

But, naturally, it get so much better.

Back to top of page.

Wildcards

Sometimes you won't know what character you want to match. The generic wildcard character in Perl is the dot, .. A dot will match any character (with the exception of the newline) in that place in your regular expression. Versatile little bugger, isn't it? The one problem you might foresee is that you might really want a dot in your string. (Actually, I can suggest a few other "problems", although they're all about being more picky. Regular expressions allow for that, though.) If you want the dot, guess what you do? Yep, \..

Back to top of page.

Groupings and Alternatives

Sometimes, you want to group bits of text together. Why? Things like quantifiers will then treat the whole group of text as a block rather than the previous character. For now, know that parentheses group text: (my_group) will now be treated as a block.

Also, you might want to offer alternatives to matches. For example, you want to match the word "gray", but the silly British spelling is "grey". What to do? Offer two separate checks? Never! The pipe, | offers alternative matches. For example, gr(a|e)y would match "gray" or "grey". So there is trans-Atlantic happiness!

Back to top of page.

Quantifiers

You'll also frequently want to specify how often a bit of regular expression occurs. For example, what if you want as many iterations of a sequence as available? There are a few quantifiers available:

+
Specifies that you want one or more of the previous character or group.
*
Like +, but 0 matches is also an option.
?
Specifies 0 or 1 matches of previous character/group.
{m,n}
Matches between m and n of the previous character/group where m and n are non-negative integers. If you leave off a value for n, you'll get a match with a lower limit of repetitions (m), but no upper limit on what's OK. (To get the inverse, m=0 works.) If you have just one value and no comma, you specify an exact number of repetitions. (No fewer and no more than the integer given.) Not only can this save you typing a given character a lot, you'll see in minute that you can do things like specify "word of a specific length".

Back to top of page.

Character Classes

A character class is a list of possible character in a given position. You denote this with square brackets ([]). For example, [aeiouy] would match any vowel. Combine this with quantifiers, and things get even more powerful; what would [aeiouy]+ match? Answer

You can negative a character by putting a caret ^ in front of it. [^aeiouy] matches non-vowels, for example. (Not necessarily consonants. Why not?)

Of course, typing all of the digits or all of the letters would get repetitive. So there are shortcuts! All shortcuts are letters which are escaped with a backslash. For example, \w matches any "word" character. (By the way, a "word" character is any letter, no matter the case, or a number.) \d matches digits (0-9) and \s matches whitespace (newline, tab, carriage return, form-feed, and space). (Buy the way, ranges are also allowed. a-z matches all lowercase letters. 0-4matches the first 5 digits.

You can negate a character class a bit more cleanly by just capitalizing the letter. \D is a non-digit, for example.

Back to top of page.

Anchors

You can also specify where in a string a match is allowed to be made. For example, ^ (when not inside the character class square brackets) and $ match the beginning and the end of the string, respectively. (Why won't $ be confused with part of a scalar?) So m/^John Weiss$/ matches my name if that's the entire string. I've used m/^\s*/ a lot to account for the possibility of blank space at the start of a string.

(Subtle point: $ actually matches either the end of the string or the newline at the end. This could be important to you, but odds are not. I've never found myself worried about it.)

The other form of anchor is the word boundary, \b. This matches the boundaries between words (duh). But do note that this isn't always where you expect it to be. For one this, apostrophes count as boundaries. Oops!

Back to top of page.

Replacements with Regular Expressions

OK, so you've read the above. You know how to write a regular expression and you can generally read them. And matching is fun, but you crave more from life. You want to run, to skip, to search-and-replace out in the open air! Well, we can half satisfy that urge.

Search-and-replace in regular expressions is fairly easy. You just replace the m (the bit right after the binding operator) with an s to start. Then you provide a second argument to the regular expression, the replacement text. Thusly:

      
	$my_string =~ s/old text/new text/;
      
    

Of course, the new text won't use much regular expressiony goodness. Why not? Because it makes no sense to use a wildcard or a quantifier since you need to be assertive and definite about what you want to do. However, you can use all of the coolness that are regular expressions for the old text. And you can't beat that with a stick. (No, really, you can't. It's an abstract concept and therefore unbeatable with current stick technology.)

However, there fun things that you can do, like remembering what you just matched...

Back to top of page.

Memory Variables

So here's the score: you've put together a great regular expression to match a bit of text. And now you want to either replace it, but with something based on the matched text, or manipulate the matched bit. The trouble is that you used character classes, wildcards, quantifiers, and everything else you could think of. (Duct tapes was only passed over because it's so difficult to trim it down to the size of electrons for use in computers.) Gosh, it's a pity that Perl just throws away that information, isn't it?

Ha-ha! Fooled you! Perl doesn't throw that away at all! (Perl wouldn't do that! Perl is watching out for you, like a mafioso godfather. Only not so evil or keen on oranges.) Perl stores the values in each set of parentheses in variables named \1, \2, and so forth. The first set of parentheses is 1, the second is 2, and on and on. As if that wasn't pretty intuitive. These variables are availible inside of a match or a replacement string as soon as they've been matched. This can be handy in matches (so that you can check to be sure that both delimiters match, even if you don't know what delimiter is in the text) and in replacements (you can replace all of the text around a given string and keep that string intact).

More often, I've found that I need a match after the regular expression has done its thing. In this case, the variables change appearance just slightly to $1, $2, and so on. You know what they are and who is who. It does get a wee bit more difficult, though. These variables are only available after the match has been successfully realized in full. So just because what was in the first set of parentheses matched, it doesn't mean that $1 has been set. (In fact, you should pretty much always put matches in if-conditions and only use the memory variables inside the if-block.)

One more potential snag for you: if you have nested parantheses in your regular expression, be careful about which match variable is which. The best way to work it out is to just count the opening parentheses. The match variables with match that order.

Why did I put this off until this section, rather than do it with simple matching? Because, while you can use memory variables with simple matching, they are a lot more useful with search-and-replace. You can see why: you can reformat the text around a particular string, for example, and leave that string intact. In fact, I don't think that I've every used \1 inside of a simple match; I have had much cause to use it with search-and-replace.

Back to top of page.

Other Match Variables

There are three other memory variables that you might want to know about. (Mind you, I've never used them myself. But whatever.) They are $`, $&, and $'. These store everything in the string before the part that matched the entire regular expression, the part that matched the entire regular expression, and then everything after what matched the entire regular expression. So $`$&$' is the entire original string.

Back to top of page.

Modifiers

Like with most things, it's possible to pass along other specifications about how you want matches to be carried out. These apply to both search-and-replace regular expressions and to matching regular expressions. The go right after the final solidus and you can use none of them or as many of them as you like with order not mattering.

i
Case insensitive matching. Capital and lowercase letters are interchangeable.
g
Globally replace all instances of a pattern. By default, a search-and-replace only replaces the first instance it encounters. With this modifier, Perl will also match as much as it can with the pattern.
s
Turns the dot (.) into a match for all characters including the newline.

Back to top of page.

Transliteration

That's a fancy word for switching letters. Except Perl doesn't care about the distinction between letters and everything else. But it's a pretty word, so we'll run with it. Anyway, tr/search list /replacement list/ is for "transliteration". (It can also be done with y///, but that's just silly and it's deprecated as I understand it. So don't. Just forget I even said anything.) It finds items in the search list and replaces them with the corresponding item in the replacement list. (You could do this as a whole lotta s///'s, but who wants to?) The transliteration is done so that the first item in the replacement list replaces the first item in the search list, etc. If there are fewer replacement items than search items, the last replacement item is used for all of the excess search items. (There's a modifier, d that tells Perl to delete the instances of those search items rather than replace them, by the way.

Probably the most common use of this operator is:

       $my_string =~ tr/a-z/A-Z/;
    

which capitalizes every lowercase letter. You'll see this and it's inverse (tr/A-Z/a-z/) most often.

Back to top of page.


Answers to Questions

Answer 1:It really was the best of times! and then a line break. The code won't match the second if-statement.

Answer 2: Any word (of at least one letter!) made up of only vowels.