Data In Computers

Web Pages

Web pages can look complicated;  for the most part they are not.

A web page is essentially just a text file.

The difference is that the text is intended to be interpreted by a program called a browser which will interpret bits of text surrounded by the characters   “< >”   as instructions, for instance:

  • as the location of a linked file.
  • or on the meaning of a bit of text - a heading for instance.
  • as instructions to include pictures
  • or on how to lay the page out.

The remainder is treated as the page content. Most of a page is usually readable characters that might have been typed.


Text

Text is material that could be written or typed, the equivalent of typescript on paper.  Many modern web pages don't look as though they are like that.  Pages are often filled with colourful pictures looking like a glossy magazine.

What makes "hypertext" is the "markup" that links a web page to other files that might be pictures and other pages.  Markup is symbols and text that aren't normally meant to be seen.  For instance, 30 years ago if you were sending something handwritten to be typed you might mark up each paragraph with a "p" in the margins to tell the typist or pagesetter to quite definitely make a paragraph break in the text.  You might devise a markup to tell the typist to leave the right side of the page blank for a photograph or diagram.  For instance putting a tag like “<img>” and some dimensions.

Film scripts are a good example of markup. Half of the text specifies the characters and the stage directions, the rest is actually spoken. The whole thing is text but we are supposed to understand that part of it does one job, and is primarily spoken by the actors, the other part does a different job, giving stageing instructions to the director and production crew.

Essentially this is what HTML markup still does.


Files

Web pages are files.  A computer file is a way to group information so that it is normally moved as one thing. Computer files are usually kept on mass storage devices such as disks and then moved into memory or across a communications link when they are needed. Computer systems treat information as files so that people don't have to be too aware of their true nature. A disk file is a one-dimensional array of bytes (or even a stream of bits) and there are other groupings such as blocks, sectors and volumes, file headers directories and file suffixes. Sometimes we do have to know a few details - the directory path to a file or the meaning of a suffix. For instance:

Something called file.txt would be expected to contain plain alphanumeric characters probably in a form that is quite readable by people.

Something called file.htm is likely to contain hypertext markup language; if we read it with a text editor the <body> part would be readable but there would be:

  • a header containing terse machine instructions about the file type, its title and other files it is related to.
  • tags scattered through the text giving hints on it's meaning or layout, and some others beginning “&” and ending “;” which provide punctuation.

Something called file.jpg is likely to be a picture in a format specified by the Joint Photographic Experts Group. When such a file is opened by a program that knows its structure it has a header and compressed pixel data specifying the picture but looked at in a text editor it would look like gibberish.


Getting the File

Fetching a web page means giving it's file-name, either to the local computer operating system of more likely on the World-Wide Web. There are a great many computer operating systems so a way to specify any file on the Web was devised called a Uniform Resource Locator or “URL”. It specifies a protocol, then a “domain” or place where a file should be and then a filename so we might have “http://en.wikipedia.org/wiki/URL”. When you use a browser the software in your machine interprets that along the lines of "ask the network using hypertext transfer protocol for the domain "en.wikipedia.org", look up its IP address and then ask that IP address for it's file "wiki/URL". Expect an answer carried by http protocol and probably as html (although in this case the filename doesn't say so).

What looks like one page when presented on a screen may be one file. More usually there will be several pictures - or perhaps movies and sound and these are in separate files. So when the browser gets a file it may immediately ask for a cluster of others.

Web pages are interpreted by programs called “browsers”. A more formal term is “Web client” or even “User agent” but the whole ethos of the Web has always been one of informality and “a three year old can use it” so the program has an informal name. People are thought of as browsing the web - even if what they are doing might actually be a highly systematic search.


Web Browsers

A “browser” is a program that helps people view hypertext. Browsers interpret the text, pictures and links in HTML into their graphical form on a screen. Making the screen layout from the source code is called “rendering”. If the nature of things seems in doubt, look at the source code. (in Firefox it's View > Page Source.)

The real nature of a page is the source.   Sometimes a page source can look simple, the source code for this one should be fairly straight forward. However pages can be made up from a mixture of text, graphics and scripts. The result can be confusing to look at. Sometimes page authors intend the source to be confusing to keep others from understanding it and copying their ideas.


Markup

On the whole programmers like to keep things simple. Markup is normally distinguished from ordinary text by surrounding it with a couple of simple characters. The angle brakets < and > were never much used in English text so they were chosen as markup delimiters. A paragraph begins <p>. Anything within the angle brackets is markup whilst things outside are ordinary text.

One way to look at things is that the markup is a stream of commands to the browser and particularly the rendering engine whilst the remainder is the information to display.

Markup applies until it is closed so if the markup <b> meaning “bold” is given then all the subsequent text will be bold until the markup is closed with </b>. Likewise a paragraph having begun with a <p> will end </p>.

The letters and words enclosed in angle brackets are usually called “tags”. They specify things like headings and paragraphs, links and pictures.

The web's creators took the opportunity to include specifications for font, size, emphasis, colour, page background and many other things specifying how a page is to look. A page can be made of nothing but pictures. But essentially even if a page doesn't appear to have a word of “text” visible it is still hypertext.


StyleSheets

Specifying fonts and layout directly in the page source becomes a bit of a nuisance. Suppose you use a bold font both for headings and emphasis, using the same tag for both means there is no easy way to have the computer make a list of headings. Now suppose you have hundreds of pages and decide you'd prefer the headings to be blue - that means editing all those pages.

Separating out the semantics such as “<h2>” (heading) and “<em>” (emphasis) from presentation like “<b>” (for bold) has become an objective for web designers. If a heading will always start with an “<h>” tag then Cascading style Sheets can be used to specify the colour.


HyperText

Computers have dealt with text files for years. Wikipedia reckons the first use is in an advert for an RCA computer in 1950. “...the results of countless computations can be kept "on file" and taken out again. Such a "file" now exists in a "memory" tube developed at RCA Laboratories. ...”     Link    

The basic addition hypertext makes is that if there is a reference to another page, such as the one in the link above, you can click on it and go straight to another text. With older computer text mechanisms it would have been necessary to manually open a new file by giving its name - which if it was at all complicated would have meant either a feat of memory or writing it down.   Hypertext is at heart all about linking.   The pretty pictures and glossy magazine design ethos are later additions.


In Summary

Web pages involve three main things:

  • A language called Hyper Text Markup Language (HTML) which is basically about linking but extends to marking up page semantics and presentation.
  • Uniform Resource Locators (URLs) which provide a way to find other pages and items within them.
  • Hypertext Transport Protocol (HTTP) which is a way to get (and save) pages as needed.

For the most part web-site users don't need to know much about these things. They just point and click wherever there is a link and their browser does the rest.   Of course you do need to know a bit about how things work in order to create pages.   Pages with lots of design features tend to require quite a lot of knowledge - or some sort of application package that does most of the work for you.

However if you want to know how pages work look at the underlying text that is actually transmitted. Most browsers can make it visible. It may take some time to understand the more decorative pages but this one is relatively straight forward.