Back to main workshop page
We've all been prompted for input from computers. It's very common in some contexts. Perl isn't, generally speaking, one of them. But you'll still find that it can be handy to ask the user for input from time to time.
The way you do this in Perl is <STDIN>.
(Nota bene – We'll see the line-input operator
(<>) again shortly. And we'll learn what it really means,
too.) This retrieves input form the user via "STDIN". This is
C/UNIX/Perl-speak for "standard input" (get it), which is usually
the keyboard. (Technically, you could change this if you're
sufficiently inclined to do so. But honestly, are you? I didn't
think so.) You'll want to assign the result of this operation to
a variable, of course, since you are probably planning to do
things to it.
One quirk of most input methods is that they result in strings that end in newline characters (\n, generally). Most of the time, we don't want these as they are only used to break up the input (be it a file or something a user has typed). Wouldn't it be nice to have a way to get rid of these newlines?
Well, now you can! With the handy-dandy chomp()
function! (By the way, chomp() is in serious
contention for the best-named built-in function in Perl.) The way
it works is this: chomp($my_variable_to_be_chomped);
If the string passed to chomp() ends in a newline, it
removes the newline. If it does not end in a newline,
chomp() sits back and thinks to itself, "Man, I'm so
good that they newlines are running away before I even get to
them!" Also, it does nothing to the string itself. So after the
chomping, the variable will have lost any ending
newlines it may have had. Note that you don't assign
chomp() to anything, it alters the variable in
situ. (So if you want a copy of the variable with the
newline, either don't chomp() it, or create a backup
copy first. (There's another option, but it's a bit silly.
Still, can you see what it is?))
Back to top of page.
There's another way to get data into a Perl program. (Well, and a third: read a file in. See below.) That's to call the program with arguments in the command line. There are a number of uses for this, particularly when it comes to performing system tasks. (For example, renaming a directory of files.)
The way you pick up command-line arguments in Perl is with the
special variable @ARGV. (C-folks: look familiar?
But be careful...there's a subtle difference here!) This array
contains the command-line arguments passed to Perl in order. So
it starts with $ARGV[0] and goes until it runs out of
arguments. (C-folks again: see the difference? If not, look
carefully where the respective indexings begin for the
arguments.)
An almost-functional example would be the followings:
my $filename;
foreach $filename (@ARGV)
{
Code that does something to the files.
}
This would loop over every argument and do the tasks to them. This can be extremely handy, needless to say. (It's even handier when you learn that when you invoke the program with an argument like *.dat, the operating system first expands the wildcard (this is called "globbing") so that the list of arguments that Perl sees isn't "*.dat", but rather all of the files ending in ".dat". How's dat for nifty?)
Back to top of page.
The next step is to talk about file I/O. But first, we need to
learn how to open and close files. The short answer is that you
use open() and close(). The latter is
very easy to use, as we'll see in a moment. The former is
slightly more involved.
To open a file, you'll need to pass open() two
parameters. The first is the filehandle. A filehandle is the
internal Perl reference for this file which will be used until the
file is closed. Filehandles don't start with a leading special
character (unlike scalars, arrays, and hashes). It seems to be
tradition to use all capitals for them, as well. I tend to go
with names like INFILE (an input file),
OUTFILE (output file), or ORBITAL_DATA
(guess). Once the file is open, you'll always use this filehandle
to work with the file. (Whether you're reading from or writing to
it.)
The second parameter to open() is a string that
contains two things. The first tells Perl if you're reading or
writing with the file. To read, being with <. To
write, use > instead. You can view these as
arrows sending the data either to the filehandle from the file
(reading the file) or from the filehandle to the file (writing to
the file). In Perl, that's all you can do. IDL lets use do both
to a file, but if you want to read a file then write to it in Perl
you have to do this in two steps. (There will be an exercise
about this.)
The rest of the string is the file's name in the operating
system. It can be an absolute path name
(/home/origins/weissj/public_html/Perl/index.html) or
a relative one (index.html). Which you use will
depend on your needs at the time. A full open statement would look
like:
open(INFILE, "< index.html");
which reads the file in. To write to it,
open(OUTFILE, "> index.html");
To close the file, close(FILEHANDLE) is all you
need. If you fail to close a file by the end of the code, Perl
closes it for you. However, this is poor practice as you're
relying on Perl to do your work and because if you add to the code
later, you might run into unexpected trouble. As a rule,
explicitly close the file as soon as you're done with it. This
way, you will always know where you stand with the file. (If you
add to the code and need to work with the file more later, at
least you'll get a nice error message if you forget to move the
close statement. Failing to close a file you thought was not
longer open won't produce an explicit error, it just might do
things you didn't expect.)
Oh, one other option for opening files exists.
>> opens a file for appending. I don't think
that I've ever used this and you can do it with one read-pass and
one write-pass, but this would be more convinent. And that's what
Perl is all about, really.
Back to top of page.
Remember the diamond operator? (<>) It's
time to dust it off. When you put a file handle inside of it, it
reads the next line of the file. (Provided the file is opened for
reading in, of course.) As with STDIN (which you
might now recognize as essentially just a filehandle as far as
Perl knows), this should generally be assigned to a variable.
Sometimes you'll know how many lines to read from a file. But generally, you won't. So what to do? Use a while-loop!
open(INFILE, "< my_input_file.dat");
my $line;
while($line = <INFILE>)
{
chomp($line);
Do all kinds of cruel things to the line
}
close(INFILE);
As long as the diamond operator returns something (even a blank
line, which will end in a newline), the while-loop continues. As
soon as the end of the file has been reached, the diamond operator
doesn't return anything (or, rather, it returns a null string,
which is false) and the while exits. You can't ask for a cleaner
system than this! (By the way, this is one of the only times I
break my usual rule about only doing one thing a at a time. In
this case, I'm both evaluating the looping condition and
assigning the value of $line. You could get around
this, if you wanted to, but this is a fairly clean system and I
think one can live with it. Still, in general, don't do this sort
of thing with other loops.)
Oh, since you'll be getting a newline at the end of each line
(except the last one, maybe), I've added a chomp() to
the loop. If you want the newline (and you might), don't include
that. But usually you'll want it there.
Back to top of page.
The flip side of reading from a file is, naturally, writing to
said file. This is even easier. You just add the file handle to
your print statement: print OUTFILE "String that I want to
print\n"; Or perhaps printf(OUTFILE, "Format String
with an integer: %2d\n", $my_integer);. The newline at the
end of the strings is so the outputted line is on its own, well,
line. If I don't do that, the next thing printed will get stuck
right up to the back of what I just printed. This might be what I
want, but it usually isn't.
Back to top of page.
What are regular expressions? Basically, regular expressions are a way of coding matches to strings. It's trivial in Perl to check of two strings are identical (see last week), but in real life we're seldom testing that. We want to know if a given bit of text exists inside of a bigger block of text, perhaps with the possibility of slight differences in in spellings and punctuations ("gray" versus "grey", for example). We want to know if a string passed to us matches certain criteria; is it a valid email address or URL, for example. That's the power of regular expressions.
What's so regular about regular expressions? They eat more bran than normal. Next question.
Back to top of page.
We're going to start with simple matching to give us a concrete way of playing with regular expressions. To check to see if there is a matching substring in a string with regular expressions,
$my_string ~= m/Regular Expression/
.
The main result of this is either a true or false value. Now to break it down a bit.
The =~ is not an assignment (so be careful), it's
called the "binding operator". We'll always use it (or a slight
variation on it we'll see in a bit) with regular expressions with
one notable exception that I'll mention is a second. The
m is not, strictly speaking, necessary for matches
because Perl automatically assumes you mean m if
there is nothing in that spot and you used slashes as the
delimiter. (See below.) Still, going along with my philosophy of
"be explicit", I'd say always use it. Finally, the slashes are
how you delimit regular expressions. Well, that's the normal way,
actually. Perl is smart and can handle any delimiters there.
Just remember that you have to start and end with the same single
symbol. (Why would you ever use different symbols? Well, you
might have slashes in your regular expression. In that event, you
could use another symbol and not have to escape the slash. That
said, I always use slashes and learn to live with escaping.)
What other binding could you use? Well, for matches (and only
for matches, with the m// format) you can use
!~ which means "does NOT match". It's the same as
using the normal binding operator and negating the whole
expression. In fact, that's why I usually do, out of sheer
habit.
Oh, the same special character escapes from double-quoted
strings work in regular expressions. Things like \n,
\t, and so forth. So you can match newlines and all
of that stuff. (Spaces are just spaces in regular expressions, in
case you are wondering. Convenient.)
Well, that's fine and dandy, is it not? Except that I haven't
taught you regular expressions at all! So I should probably do
that. The simplest regular expression is just the string
containing what you want to match. That would be like an
eq, except you don't need to match the entire string.
For example:
my $my_string = "It was the best of times, it was the blurst of times";
if($my_string =~ m/best/) {print "It really was the best of times!\n";}
if($my_string =~ m/worst/) {print "It really was the worst of times!\n";}
What would this print out? Answer
But, naturally, it get so much better.
Back to top of page.
Sometimes you won't know what character you want to match. The
generic wildcard character in Perl is the dot, .. A
dot will match any character (with the exception of the newline)
in that place in your regular expression. Versatile little
bugger, isn't it? The one problem you might foresee is that you
might really want a dot in your string. (Actually, I can suggest
a few other "problems", although they're all about being more
picky. Regular expressions allow for that, though.) If you want
the dot, guess what you do? Yep, \..
Back to top of page.
Sometimes, you want to group bits of text together. Why?
Things like quantifiers will then treat
the whole group of text as a block rather than the previous
character. For now, know that parentheses group text:
(my_group) will now be treated as a block.
Also, you might want to offer alternatives to matches. For
example, you want to match the word "gray", but the silly British
spelling is "grey". What to do? Offer two separate checks?
Never! The pipe, | offers alternative matches. For
example, gr(a|e)y would match "gray" or "grey". So
there is trans-Atlantic happiness!
Back to top of page.
You'll also frequently want to specify how often a bit of regular expression occurs. For example, what if you want as many iterations of a sequence as available? There are a few quantifiers available:
+*+, but 0 matches is also an option.?{m,n}Back to top of page.
A character class is a list of possible character in a given
position. You denote this with square brackets ([]).
For example, [aeiouy] would match any vowel. Combine
this with quantifiers, and things get even more powerful; what
would [aeiouy]+ match? Answer
You can negative a character by putting a caret ^
in front of it. [^aeiouy] matches non-vowels, for
example. (Not necessarily consonants. Why not?)
Of course, typing all of the digits or all of the letters would
get repetitive. So there are shortcuts! All shortcuts are
letters which are escaped with a backslash. For example,
\w matches any "word" character. (By the way, a
"word" character is any letter, no matter the case, or a number.)
\d matches digits (0-9) and \s matches
whitespace (newline, tab, carriage return, form-feed, and space).
(Buy the way, ranges are also allowed. a-z matches
all lowercase letters. 0-4matches the first 5
digits.
You can negate a character class a bit more cleanly by just
capitalizing the letter. \D is a non-digit, for
example.
Back to top of page.
You can also specify where in a string a match is allowed to be
made. For example, ^ (when not inside the character
class square brackets) and $ match the beginning and
the end of the string, respectively. (Why won't $ be
confused with part of a scalar?) So m/^John Weiss$/
matches my name if that's the entire string. I've used
m/^\s*/ a lot to account for the possibility of blank
space at the start of a string.
(Subtle point: $ actually matches either the end
of the string or the newline at the end. This could be
important to you, but odds are not. I've never found myself
worried about it.)
The other form of anchor is the word boundary, \b.
This matches the boundaries between words (duh). But do note that
this isn't always where you expect it to be. For one this,
apostrophes count as boundaries. Oops!
Back to top of page.
OK, so you've read the above. You know how to write a regular expression and you can generally read them. And matching is fun, but you crave more from life. You want to run, to skip, to search-and-replace out in the open air! Well, we can half satisfy that urge.
Search-and-replace in regular expressions is fairly easy. You
just replace the m (the bit right after the binding
operator) with an s to start. Then you provide a
second argument to the regular expression, the replacement text.
Thusly:
$my_string =~ s/old text/new text/;
Of course, the new text won't use much regular expressiony goodness. Why not? Because it makes no sense to use a wildcard or a quantifier since you need to be assertive and definite about what you want to do. However, you can use all of the coolness that are regular expressions for the old text. And you can't beat that with a stick. (No, really, you can't. It's an abstract concept and therefore unbeatable with current stick technology.)
However, there fun things that you can do, like remembering what you just matched...
Back to top of page.
So here's the score: you've put together a great regular expression to match a bit of text. And now you want to either replace it, but with something based on the matched text, or manipulate the matched bit. The trouble is that you used character classes, wildcards, quantifiers, and everything else you could think of. (Duct tapes was only passed over because it's so difficult to trim it down to the size of electrons for use in computers.) Gosh, it's a pity that Perl just throws away that information, isn't it?
Ha-ha! Fooled you! Perl doesn't
throw that away at all! (Perl wouldn't do that! Perl is watching
out for you, like a mafioso godfather. Only not so evil or keen
on oranges.) Perl stores the values in each set of parentheses in
variables named \1, \2, and so forth.
The first set of parentheses is 1, the second is 2, and on and on.
As if that wasn't pretty intuitive. These variables are availible
inside of a match or a replacement string as soon as they've been
matched. This can be handy in matches (so that you can check to
be sure that both delimiters match, even if you don't know what
delimiter is in the text) and in replacements (you can replace all
of the text around a given string and keep that string intact).
More often, I've found that I need a match after the regular
expression has done its thing. In this case, the variables change
appearance just slightly to $1, $2, and
so on. You know what they are and who is who. It does get a wee
bit more difficult, though. These variables are only available
after the match has been successfully realized in full. So just
because what was in the first set of parentheses matched, it
doesn't mean that $1 has been set. (In fact, you
should pretty much always put matches in if-conditions and only
use the memory variables inside the if-block.)
One more potential snag for you: if you have nested parantheses in your regular expression, be careful about which match variable is which. The best way to work it out is to just count the opening parentheses. The match variables with match that order.
Why did I put this off until this section, rather than do it
with simple matching? Because, while you can use memory variables
with simple matching, they are a lot more useful with
search-and-replace. You can see why: you can reformat the text
around a particular string, for example, and leave that string
intact. In fact, I don't think that I've every used
\1 inside of a simple match; I have had much cause to
use it with search-and-replace.
Back to top of page.
There are three other memory variables that you might want to
know about. (Mind you, I've never used them myself. But
whatever.) They are $`, $&, and
$'. These store everything in the string before the
part that matched the entire regular expression, the part
that matched the entire regular expression, and then everything
after what matched the entire regular expression. So
$`$&$' is the entire original string.
Back to top of page.
Like with most things, it's possible to pass along other specifications about how you want matches to be carried out. These apply to both search-and-replace regular expressions and to matching regular expressions. The go right after the final solidus and you can use none of them or as many of them as you like with order not mattering.
igs.) into a match for all
characters including the newline.Back to top of page.
That's a fancy word for switching letters. Except Perl doesn't
care about the distinction between letters and everything else.
But it's a pretty word, so we'll run with it. Anyway,
tr/search list /replacement list/ is for
"transliteration". (It can also be done with y///,
but that's just silly and it's deprecated as I understand it. So
don't. Just forget I even said anything.) It finds items in the
search list and replaces them with the corresponding item in the
replacement list. (You could do this as a whole lotta
s///'s, but who wants to?) The transliteration is
done so that the first item in the replacement list replaces the
first item in the search list, etc. If there are fewer
replacement items than search items, the last replacement item is
used for all of the excess search items. (There's a modifier,
d that tells Perl to delete the instances of those
search items rather than replace them, by the way.
Probably the most common use of this operator is:
$my_string =~ tr/a-z/A-Z/;
which capitalizes every lowercase letter. You'll
see this and it's inverse (tr/A-Z/a-z/) most often.
Back to top of page.
Answer 1:It really was the best of
times! and then a line break. The code won't match the
second if-statement.
Answer 2: Any word (of at least one letter!) made up of only vowels.