Introductory HTML

This part of the course is intended for novices. In this seminar, we will discuss how to create a website in your own directory, the basic layout of the HTML document, and a few of the most important tags. In the next two sessions, we will build on these ideas, adding more tags, formatting capabilities, and other niftiness. If this is your first time using HTML, I encourage you to spend some time before next session creating a few personal pages for yourself. These need not be fancy, but they should give you some experience in writing HTML.

Topics

Back to main workshop page

Making a Website

Where to Put the Files

If this is your first time making a website, the first thing we need to talk about is where to put your various files. If you have an account on one of the ITS-run Unix machines on campus (Origins, Cosmos, Bogart, etc.), you can create a personal website by creating a directory in your home directory titled public_html. (Remember, this is case-sensitive!) This will make your website URLs look like http://server.colorado.edu/~username/files, where server is the name of the server (origins, bogart, etc.) and username is your username on that machine. (files will be the name of the files in the directory. More on that shortly.

A few words on permissions: your documents need to be readable to people other than you! Set permissions so that group and world have read privileges. You probably do not want to make you files world writable (or even group writable, most of the time). The execute privilege isn't really relevant to HTML documents, since they aren't programs. I personally tend to usually periodically issue the command chmod 755 *.html in my web directories. This gives me write privileges, as well as everyone read and execute privileges. This is mostly habit, though, and 744 is probably wiser. Incidentally, the only reason not to give yourself write privileges is if you are worried that you might destroy or delete a webpage that you have carefully worked on and don't intend to change. However, be aware that it is the nature of webpages to generally be dynamic entities, changing as our tastes, needs, and whims change. The better solution to the danger of destroying your hard work is to back up you website periodically by downloading it to a local hard drive, writable CD-ROM, etc. (Some servers are, of course, backed up by the operators. I'd encourage you to not rely on that, however. Not only are you counting on things well beyond your control, recovering files off of tapes is often a rather annoying task and I suspect that the Ops would better-used dealing with other matters.)

What should you call your file? Well, here is a good start: index.html. This file, if it exists in any of your web directories (either public_html or any subdirectory of that directory), will automatically be loaded if a user types a URL ending in that directory name rather than a file inside that directory. For example, my homepage's URL (as I usually give it out) is http://moonlets.org. The source for that page, however, is actually in http://moonlets.org/index.html. This feature is nice for a couple of reasons. The first is that the first URL is shorter and probably easier for someone to remember, especially if they're even passingly familiar with the ~ convention. The second is that this allows you to hide your directory contents from snoopers. That's right, if you do not have an index.html file, anyone typing a URL ending in your directory name will get to see the contents of your directory. (OK, technically this is dependent on your server's configuration. That can change with the server or if the operators alter things... so I wouldn't wager on server configurations for privacy.) If you have pages not intended for public viewing or just plain don't like people snooping, this is unfortunate.

Finally, I should mention document extensions. HTML documents should have an extension .html or .htm. Typically, on a Unix machine, the former is preferred. (The latter is, I believe, a relic of DOS's inability to have extensions with more than three characters.) This tells the browser that the document is a hypertext document and that it should render it as such. What happens if you mis-label an HTML document as, say, a .txt document? The user will probably see the document source rather than the document as you intended it to be seen.

Back to top of page.

What is HTML?

HTML stands for HyperText Mark-up Language. "Hypertext" refers to the ability to link parts of documents together. The "mark-up" bit of the name describes the type of language it is. Specifically, you use tags to indicate instructions as to how to display the content of the page. Those of you who have used LaTeX will recognize this concept quite well: it's the same idea. The syntax is, however, quite different-looking in HTML and the two languages are clearly designed for very different applications. (Math in HTML is dicey business, while LaTeX has to make many assumptions about the medium in which it will be presented.)

A tag in HTML is enclosed in a <> combination. Anything inside these delimiters will not be rendered into the final document. Most tags have a start and stop pairing, where the stop entity starts with a /. For example, I start a paragraph with <p> and end it with </p>. (<p> is HTML for "paragraph." More on this matter shortly.) A few tags do not have stop tags associated with them. They will be kind of obvious when you get to them and I'll try to point them out. (Note that the <p> did not have a stop element for quite some time, but the current specifications do require it. Most browsers will know what you mean if you forget the stop tag, but the results are... unpredictable.)

A few more acronyms so that you can speak like a complete nerd. (Amaze your friends! Terrify your enemies! Get no attention at all from people are totally indifferent to you!) URL stands for "Uniform Resource Locator," which just means that it's a general purpose addressing system for all manner of data online. HTTP stands for HyperText Transfer Protocol, which is complete computer nerd for "how to transfer the data back and forth." Many of you will recognize the similarity to FTP (File Transfer Protocol), and this isn't a coincidence. HTTP, HTML, and the Web were actually invented at CERN as a sort of glorified FTP to be used to read papers from other scientists. (And you thought that porn drove all developments in this kind of technology.)

Back to top of page.

Editing files

How do you edit HTML files, anyway? Well, that's kind of up to you. You can edit them with fancy proprietary software, but that's kind of silly. You can also use free software, such as what comes packaged with Netscape/Mozilla. (I'd suggest the latter, but I'm rather keen on Mozilla in general. Feel free to ignore my advice.) You can get Mozilla at www.mozilla.org. However, and I'm a bit of a snob about that, I am a big fan of the good old text editor of your choice. On a Unix machine, you can use vi, nedit, or [X]emacs, as well as a host of other editors, many of which are painful to use. I'm a big fan of [X]emacs, as it is powerful, full of handy shortcuts, has syntax-specific features that you can get (such as color highlighting of parts of your HTML code and automatically creating the nifty date stamp/reply-to stuff at the bottom of the page), and is generally available. (I'd suggest using the Xemacs when you can, since it's a little be friendlier in terms of graphical buttons and what-not.) Most ITS-run machines should have the syntax-specific packages already installed.

And here's where I get on my soap box. (Ever notice that you don't see soap boxes around anymore?) Where ever it is feasible for you, try to avoid converting anything to HTML as a way of generating a webpage. A Powerpoint presentation is an example. These things tend to render very badly in many browsers/platforms and the code is so ugly as to rival the Gorgons. In fact, this is sort of why I stump for editing source code directly: you will tend to write more elegant, easy to read code than any software package. (For example, a lot of software will leave vestigial tags in your files, which only serves to slow downloads and potentially confuse a browser.)

Since HTML is largely oblivious to white space, it is easy to make your documents look nice inside as well as when they're viewed in a browser. Put returns between parts of your document (more returns for more significant breaks), indent your elements, etc. And get rid of tags that are no longer needed. This will make maintaining your webpages easier. Also, since people might occasionally look at your source code (more on that later), it'll impress them to see a nice, well laid-out document.

Back to top of page.

Parts of an HTML Document

This section will look at the parts of an HTML document to provide us with the context for how to write such a document. We'll get that that shortly, I promise.

Document Type

A good HTML document should start with a line like the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

This line tells the agent (the browser or whatever) what kind of document it is about to get. Is this obvious (I mean, it is a web browser, right)? Almost, but not quite. There are other agents running around there and your documents could be in any number of other formats. In this specific case, we're telling the agent that what is coming is an HTML document that follows the W3C's standards (hopefully), version 4.01 (transitional). (If you want to follow the strict standards, just drop the transitional from the definition.) The EN is actually sort of trite: it means you're writing in English.

What if you forget this? Will the Earth be sucked into an enormous black hole? Will your pages totally fail? Unlikely. But it never hurts to be explicit to web browsers about what they are to do and it definitely never hurts to follow the standards. On a personal note, I have never seen the lack of this line cause a problem as such. But as I said, it never hurts to be explicit.

The next line should start the HTML, so put an <html> tag there. (Not that it has to be on its own line, but it does look nicer that way.)

Back to top of page.

Next up is the head of your document. Starting with a <head> tag, the following information is not supposed to get displayed in the rendered document. However, it can set up a lot of the stuff in the document, so make good use of the elements in the header.

The first line in the header should be

<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

This is just another note about how the document is encoded (the character set in particular). Why you need this, I don't know. (Strictly speaking, the server should send this along. It seems that the ITS-run machines do not, however, so it's up to you to do so.) Again, it doesn't hurt to be explicit.

I should note whenever I start a new HTML document, I just copy and paste everything up to here into the new document. Actually, I typically copy and paste a few more lines, too, but we won't see those until later sessions. For now, just go ahead and copy and paste this stuff.

On to the real meat of the HTML document! First thing that you should always add to an HTML document's header is a title. Enclose your title in <title> and </title> tags. The title should be descriptive, but not overly long. (When I first started writing HTML, back when it was on clay tablets, I was told that the title couldn't be more than six words long. I'm not sure if that was ever strictly true, or if it was just good advice. It certainly isn't true now. But it's still a good rule of thumb.) I should note that the title will not explicitly appear in the final rendered document. So what does it do, then? First off, it makes the title appear at the top of most (all?) browsers and in the tabs if the user is using tabbed browsing. This is very handy for anyone who has multiple windows/tabs opened and needs to jump to one without having to examine every single window/tab. Additionally, if a user bookmarks your page, the title will be the title of the bookmark (at least, in all browsers that I've seen). While the user can always modify a bookmark's title, it's nice to help her out by at least giving her at least a name to start with. (I have several bookmarks that I've never gotten around to giving actual, useful names. If the writers had actually taken a few seconds to add titles, I'd at least know what the bookmarks are for.)

That's it for the header for the time being. In later sessions I'll point out other nifty things that you can tuck in here to make your webpages cool. End off this part of the document with a </head> tag.

Back to top of page.

Body

At last, the main part of the webpage, the part that people actually really see! Start off this section with a <body> tag. A few more words on the body tag, since it can take various arguments. First of all, you can set the background image for your document with the background option. If you want a solid color background (or if you want to specify one should the background image not load, which is always a good idea), use the bgcolor option. You can change the color of your main text, of the links, and of the already-viewed links with text, link, and vlink, respectively. So, for example, your body tag might look like:

<body background = "ursae2.jpg" bgcolor = "#000000" text = "#FFFF00" link = "#FF0000" vlink ="#FF8800">

This says to the browser, "I'm starting the body of the document. Please use the file 'ursae2.jpg', in this current directory, for the background. Failing that, set the color to 000000. The text should be FFFF00, the links FF0000, and the viewed links FF8800".

Wait a second! What do those colors mean? They're hexadecimal (base 16) representations of RGB (red, green, blue) colors. The # alerts the browser that a hexadecimal number is coming. Each is two digits long (for a total of six digits), digits ranging from 0 to F (the latter being 15 in hex). The digits are in RGB order, so that "#000000" is no colors at all, so black. "#FF0000" says to turn on all the red and nothing else, so we get bright red. "#FFFF00" is all red and all green on and no blue. This makes a bright yellow. I'll leave it as an exercise to the reader to determine what #FF8800" should mean.

I should note that hexadecimal isn't the only way to notate colors. For example, some colors (like "#FF0000" = "red") have names. But I wouldn't count on that. Besides, hexadecimal is extremely flexible, letting you tweak the colors to juuuuust what you want. So I'd learn them. You can always use trial and error to find the color that you want, of course.

Another note: in the last session, we'll discover that none of the above optional arguments to the <body> tag are necessary. Style-sheets have taken over this kind of thing entirely, and I highly recommend them. But for now, we'll stick with this.

The rest of this session will be devoted to creating the body of your document. But before we get to that, I want to point out a few more general things. First, don't forget to close off the body and the html tags at the end of your document! Most browsers will know what you meant, but it's never wise to make them guess.

Second, there are two types of tags within the body of your document. The first of these is the block-level tag, while the second is an inline-level tag. Block-level tags are (usually) not inside any other tags, except for the html and body tags. These are self-standing entities, like paragraphs, headings, tables, lists, and so forth. The inline-level tags should always appear encased in some block tag. An example is the emphasis tag, <em>, which tweaks the appearance of the font inside of it, but isn't really a part of the document in the component sense. (As a warning: the image tag is an inline tag, even though I'm sure that you, like I, will want to use it in a block context a lot. Wrap it in a paragraph tag and you'll be fine.) I'll try to point out which are which in what follows. After a while, you'll probably be able to guess on your own, though.

Back to top of page.

Basic Tags

This final section of this session is devoted to what I consider the most basic tags. These are the ones that you'll use over and over again, and are generally pretty easy to employ. One warning: you can nest tags, but never close of the outer-layer of tags without closing off the inner-layer first!

Paragraph

Probably the single most often used tag, the paragraph tag (<p>) denotes that the contents between the start and stop tags are a paragraph. Simple idea, and easy to use. As I have previously noted, older versions of HTML didn't apparently support (or at least didn't encourage) the closing tag for paragraphs. Never standards, however, push for these for a variety of reasons that I'll defer for later. Suffice it to say, you should use both tags with each paragraph.

The paragraph tag is a block-level tag.

Back to top of page.

Heading

You'll want to denote headings for different sections in your document, or even the title of the document as a whole. For this, use the heading tags: <h#>, where # is a number 1 through 6. <h1> is the largest, boldest, most in-your-face heading and <h6> is the smallest, meekest heading that you can employ. (Note that we'll see how to adjust these appearances later. That's another style-sheets thing, really.)

This is also a block-level tag, so close it off before you start the paragraphs of wonderful text that you plan to write!

Back to top of page.

Breaking up Text

Sometimes you will want to break up bits of your document not just as paragraphs, headings, and the like. HTML contains a tag that inserts a single return (unlike the two that you tend to get with the paragraph tag) without breaking the current block-element. This is the <br> tag. It is an inline tag, by the way. It's use is pretty straight forward and there is no closing element. (This should make sense: there is no real need to show that you've "stopped" a return.)

The other tag I wanted to mention at this point is the <hr> tag. This tag inserts a "horizontal rule" into your document, further dividing the document in an obvious manner. This fellow is block-level and has no closing tag, either. He looks like this:


Back to top of page.

Image

The image tag, <img> is the second sort of complex tag we've encountered so far. (The first was the body tag.) An image tag really needs to have more than just the tag name to be useful, since you want to load a particular image. The syntax for this is <img src="image.ext">, where you should read src as "source" and image.ext is the file name.

The image tag is an inline-level element; it must be wrapped in a block-level tag to be correctly employed. Inserting the image into a paragraph is quite easy and makes perfect sense.

The image also has additional options. The most important is the alt option. alt is the alternate version of your image, a short bit of text to tell the user that they've missed because the image didn't load for them. (It can range from "a pretty picture of my house" to "Click Here to Submit", depending on what the image was for.) To be totally compliant with HTML standards, you should use the alt option. For a variety of reasons, you cannot guarantee that the image will load, so in the very least you should tell the user what was there.

An example of an image tag in action

<p><img src="../Images/Cavies/minipigs3.jpg" alt="Little Baby Pallas, a Few Hours Old"></p>

This looks like:

Little Baby Pallas, a Few Hours Old

(Isn't he cute?)

If the image wouldn't load, we'd get

Little Baby Pallas, a Few Hours Old

Back to top of page.

Anchor

The anchor tag, <a>, is another important beast. Like the <img> tag, the anchor tag needs other options to be of any use. Unlike the <img> tag, the anchor tag needs to be closed off. (</a>)

The main use of anchor tags is to link to another document. The syntax for this is <a href="link_url">The Linked Text Goes Here</a>, where link_url is the URL of the other page. Which leads us to a discuss we were going to have to have, sooner or later: relative versus absolute URLs.

URLs come in two flavors, the relative and the absolute. The difference is really is how the addressing works. A relative URL is in relation to the current page, an absolute URL is the same from anywhere. Not clear yet? Here's an analogy: street addresses. I could give my street address as "walk 3 blocks east of here, on the left." In some cases, this is a great reference. However, if you were planning to mail me something, I'd suggest something more like

1600 Pennsylvania Avenue
Washington DC, 20500
USA

The difference is that for the latter format, you have a reference that anyone, anywhere can decipher, while the first one requires you to be in a specific spot.

OK, so the absolute reference has its appeal. It's the only way to refer to a totally different website, for example. Also, no matter how you shuffle you pages around, it still works. So why on Earth would you bother with a relative URL? Two reasons: simplicity and ease of maintenance. The first reason is rather trite, but relative URLs are generally shorter than absolute URLs. The second point is a stronger one. If you move your website and you have used absolute URLs for links between your pages, you need to update every single URL to the new machine or directory. A relative URL, however, is still accurate if the pages still bear the same relative "positions" to each other. (For example, if they all started in the same directory and they all ended in the same directory, the relative URLs haven't changed.)

Mechanically, here's the difference: a link with a relative URL looks something like <a href="index.html">. Notice that its simple and short. All this says is, "find the file index.html in this same directory as I am in right now." A link with an absolute URL would look like <a href="http://moonlets.org/WebClass/index.html">. The key difference is that the href starts with http:// ("HypterText Transfer Protocol") and it requires knowing the name of the server.

Some of you are probably already asking if the http:// can ever be something else. The answer is yes, indeed. For example, you can use ftp:// and then a FTP reference. (I've never actually needed to do this, by the way. You need an FTP server to make it work, of course, and HTTP seems more powerful in any case.) A more common and more handy form of link is mailto: in place of the http://. This form of link tells the the address that follows is a mail-to link. So if the user clicks the linked text, their email client starts up and prepares to compose a message to the address in the link. (Provided that their email client is configured on that machine, etc. You can't really bet on this working for everyone, but it's a nice feature to those for whom it does work.)

Back to top of page.

Font Styles

What about making our font look cooler? While straight text is great for emails, data files, and other text-only interfaces, HTML has a lot more power than that. You've all see italic text, for example. How do you do that? Happily, it's an easy thing to pull off with HTML. However, there are two different ways to do this sort of thing, and you need to decide which one makes the most sense for each application. (Don't worry. It sounds like a lot of thinking, but after a few applications you find that you don't really need to think about it much.)

Italic and Bold Font

Basically, there are two ways to tell HTML that there is something stylistically different about some bit of text. The first is a physical style and the second is a logical style. Here's the difference (and here's where you start hearing me preach about keeping your HTML "pure"): a physical style is one that says "make this text look like X" while a logical style says "this text has this role in the overall flow, so format it appropriately."

So here's the preaching. HTML is not a typesetting language and it was never meant to be. (I know that going back to my early days of learning HTML this point was made to me. Then I pretty much ignored it, like everyone else, until recently when I became apparent that there was a problem.) What HTML is really meant to do is tell the client (browser, usually) what role the different parts of the document play and then let the browsers handle displaying it. You can compare this to Word or LaTeX, where you generally know what your final display will look like so that you can (and do) control what the appearance will be quite a bit. With the Web, you can make some guesses, but since webpage can be viewed on different computers with different monitor settings with different browsers (if any browser at all!) in different operating systems, you really can't say a lot about how your end-user will be viewing your page. So this is why it's best to let the browser made more of the decisions. (As we'll see in the final session, style sheets let us give a healthy dose of suggestions to the user's client. The suggestions are generally abided, but the final choice is still on the user's end.)

In practical terms, here's what happens. There are two tags for italicizing text and two for bolding it. For italics, you have <i> and <em> ("emphasis"). For bold-face you have <b> and <strong>. The former in each pair is a physical style which tells the browser, "Yes, I really want this text to be bold/italicized, definitely." The latter in each pair is a logical style which says, "Hey, this bit of text is to be emphasized/strengthened because it is important. Do whatever you think best to render this." In most browsers, these are the same thing, but not in all. For example, if I were blind and using a speech browser, italics make little sense. But emphasized text can be spoken with emphasis by my browser. For most uses, you really want the logical styles, since you're not really so interested in the appearance of the text exactly as in making sure that the text gets its due attention. Occasionally, you really do want to control the actual appearance, such as following certain typesetting rules. (For example, technically – though seldom in reality– foreign words and phrases.) So you should probably default to the logical styles of <em> and <strong>.

A final word on italic and bold text. All of these tags (as with all text-styling tags) have closing tags. You want to remember this because all of the text inside the opening and closing tags have the style applied to it. If you forget a closing tag (or misplace it) you'll have extra text in that style. This can be very bad because lots of bolded text is read as shouting at the reader. This gets really annoying and rude if half of the web page is rendered this way! Italicized text is sort of worse, since it's actually rather hard to read in large quantities. So close off those tags! (And check your pages in a browser to make sure that you've done so.)

Back to top of page.

Other Style Tags

HTML contains a lot of style tags of interest. Most of them are logical, not physical. (Actually, the only remaining physical tag which I can think of , <u> (underline), has been deprecated and should not even be used.) Here are some examples of nifty tags:

<code>
Designates code snippets. You've seen a lot of these in action in this page since I have used them where ever there is a bit HTML code. Usually, this is rendered as a monospace font, such as Courier, to set it apart from the surrounding text. (I've also used the wonder than are style sheets to make the code come out in a different color.) This is very handy since it lets your reader know exactly what gets typed in to the computer.
<pre>
Pre-formatted text. Text enclosed inside of this tag will appear exactly as entered into the HTML document. Returns and multiple spaces are all heeded in this case, unlike other places in HTML. This is useful for copying and pasting text from other sources (when the formatting is even somewhat important) or for making sure that you control the formatting of the text for whatever reason. (However, see next session's discussion of tables for a way to achieve some of the same effect.) Note that this tag, unlike the the others, is a block-level element. Therefore, is should not be nested inside another block-level set of tags, such as the <p> tags.
<blockquote>
Like the <pre> tag, the <blockquote> tag is a block-level element. In this case, it designates an extended quotation (more than a few sentences, typically). Most browsers treat this element by increasing the margins so that the text is narrower than other paragraphs.
<q>
For shorter quotes. This is an inline element so that it sets off quotes, but keeps them inside of paragraphs. Most browsers automatically add quotation marks around the text enclosed in this tag.
<sub> and <sup>
Subscripts and superscripts the enclosed text. Usually, this also decreases the font size of the subscript or superscript.
<cite>
Indications citation of a source. Enclosed text is generally in italics.
<kbd>, <samp>, and <var>
"Keyboard," "sample," and "variable," respectively. The first can be used to designate text that was entered by the user and parsed by a script. The second is for sample output from a program. The last is for variables. The former two generally appear as monospace font, the latter as italic font.
<address>
Used to indicate the email address of the webpage owner.
<abbr> and <acronym>
Designates abbreviations and acronyms. Not really useful for most browsers, but this could be terribly handy for alternate clients (such as text-readers for the blind). In this document, I have tried to use these tags where ever I could (and also remembered to). I could have used the tags to somehow make the text stand out (for example, by changing the color), but that seemed a bit over the top. Also, if you set the title attribute to the full version of the acronym/abbreviation, when a user mouses over it they will see the translation. At least, this occurs in some browsers. This is kinda nifty, I think.

Note that this is not exactly an exhaustive list and that there are other tags out there. If you have logical-tag needs that aren't met by this list, check the HTML specifications to see if the tag you need exists.

Back to top of page.

A Few Deprecated Tags

These tags were either never part of HTML as such or have been officially given the ax. They still work in many cases, but should be avoided. Why, then, am I telling you about them? There are two reasons. The first is because, like above with the <body> tag attributes, I haven't told you how to handle styling of documents using the preferred method, so these folks will tide you over until I do so. The second reason is what I like to call the "sex ed. rationale": you'll probably see these tags around either in source code or hearing people discuss them so it is just as well that I tell you about them now. As I said, avoid them for now (but use them if you really feel a compelling need) and expect to replace them in a few weeks, never to use them again.

Back to top of page.

Center

The <center> tag will do just that, center (horizontally) everything that comes between it and its closing element. Honestly, it's pretty simple.

Back to top of page.

Font

The <font> tag is a way of controlling the look of parts of your text. It has a number of attributes that you can set, including color (sets the color), family (sets the actual font/font family), and size (sets the size of the font). Color just takes a color value (see above for more on that topic). I'm not going to even try to explain font families here, but you shouldn't be too desperate to play with this for a while. Size, however, bears discussing some. You have a few options with the size attribute. One is to do a relative size measurement, like size="+1" (taps the font-size up by one unit). You can use any integer from -7 to +7 in this case. (0 is not allowed for obvious reasons.) You can also specify an absolute measurement as in (size="3").

Back to top of page.

Exercises

Finally, I've created a few exercises for you to try out. These aren't exactly brilliant, but really the best thing for you to do is to work on your own little website in order to try things out. And we'll see each other in a little while!

Back to main workshop page

Back to top of page.


Weiss John