HTML

Before we start learning to build websites, it would be beneficial to briefly explore what they actually are and look at a simplified version of what goes on 'under the hood' when you visit one. For the purposes of this example, we will use a page that should be familiar to most: the Google Homepage (https://www.google.com/).

The Internet and the Web

You are probably used to thinking of websites and webpages as existing somewhere 'in the cloud' or on the internet. The very name web browser suggests something that allows us to look through the world wide web as a thing that exists somewhere else. Like most people, you may have heard of the terms 'internet' and 'world wide web', and may even use the two interchangeably. In reality, however, websites are very much rooted to physical objects and the web and the internet are two separate, but related systems that allow us to access them.

The Internet has been around for over half a century, and is a collection of computers all over the world that are connected (or networked) together to allow exchanging information between them. Each computer connected to the internet is assigned a unique address (an IP, or Internet Protocol address), which is a string of numbers that allows sending information to that computer specifically. Think of it a little like the houses in a country, each one has a unique postal address and, using that address, we can send something to whoever lives there.

There are several systems that use the internet as an infrastructure to exchange information. For most people, the two that they will interact with most frequently are email, and the World Wide Web; there are others, but many have become little used by the majority of internet users.

Email

Email is conceptually the simpler of the two, and can be thought of a little like communicating with your friends or relations via letter. You write a letter and then put it into an envelope (a packet) addressed to the house where your friend lives, then the postal service (the internet) delivers it. Once the packet arrives, your friend can open it and read what you have written.

The important think to note here, is that the delivery mechanism (the post office or internet) doesn't need to understand the contents of the packet, only where it is addressed to. This means that any systems that build upon the existing infrastructure are free to use whatever form of information they like; you can write letters to your friend in any language, using some kind of secret code, or even communicate purely through pictures or objects - the postal service only cares about delivering your package successfully.

This is, of course, a heavily simplified version of what happens, but should be sufficient for our current purposes.

The World Wide Web

Like email, the World Wide Web is another system that is built on top of the internet. It was created at CERN (The European Organization for Nuclear Research) in the early 1990s as a way of sharing scientific data. Since then, the web has evolved enormously, but its fundamental structure is still the same; a collection of documents (now more frequently called pages) organised into sites. Each document conveys a piece of information, whether that is scientific data, instructions on how to cook a dish or fix a piece of machinery, a list of products you might be interested in buying, the latest news stories, or any other of the myriad uses that the modern web is put to.

Some webpages are drifting further and further from the original model of being a document conveying information (online games for example), but the fundamental structure of the web is still tied to that purpose.

Making Requests

You can think of the web like a distributed library; where documents (web pages) are spread out over several different locations (web sites). If you wanted to read a document, you would first send a letter to the site which held the document you are interested in, and they would send you back a copy of the original.

For us to view a webpage, our web browser must request a copy of it from the computer (referred to as a server) on which it is being stored. We could use the server's IP addresses to do this, but that has several downsides; a string of random number and to do that, it must have a way of uniquely identifying the particular page it needs, along with the computer that is storing it. These unique addresses are called URLs.

URL

URL stands for Uniform Resource Locator, and each is divided into 3 main parts: the protocol, which is the first part of the url up until the :// (in this case https); the domain name, which is the central portion of the url (in this case www.google.com); and finally the path, which is the last part of the url from the forward slash onwards / (here only a single forward slash as we are accessing the website homepage). Your browser may hide the protocol (and path if it is the homepage) when a page is displayed in its address bar, but they are still there and will be visible if you copy and paste the url somewhere else.

Illustration of domain breakdown.
Url breakdown of https://www.google.com/. Protocol = https, domain name = www.google.com, and path = /.

Protocols

The protocol is the method in which the browser accesses the resource, think of it a little like the various ways you could send a message to someone, such as writing a letter, calling them on the phone, or sending them a text message etc. For the purposes of building websites we don't need to understand how these work in-depth, just that http is the primary protocol used by the web, and that https is its more secure cousin.

Domain names

The domain name is the unique identifier of the particular website we are visiting. It is also the part of the URL that allows the browser to locate the computer on which the resource it is trying to access is being stored.

If we were to think of resources as items stored in a warehouse, the domain name would be the address of the warehouse itself. Domain names are actually split up into smaller chunks, but we don't need to cover that for now.

Paths

The path is the location of each individual resource within the domain. To reuse the analogy from earlier, if the domain name is the address to a warehouse, the path is the code for the slot on a shelf where the resource is stored.

As each URL must be unique, a domain could not have two resources with the same path; however, the same path on different domains would point to completely different resources (e.g. https://www.smashingmagazine.com/contact/ and https://www.mozilla.org/contact/ are totally different pages on separate websites).

Requests and Responses

When you come to type a web address into your browser's address bar it must first make what is called a DNS lookup. This essentially involves consulting an online address book to find the IP address (a bit like a direct phone number) of the computer that stores resources for the domain name. In web parlance a computer that stores resources in this way is known as a server, and storing said resources is called hosting.

There are far too many websites out there for browsers to maintain their own address books of servers. Not only would they have to keep up to date with new ones being added, but also manage any changes of IP address where a website moves to a different server. Instead, they first consult the DNS directory to find the current IP address of the website you are requesting.

Using this address the browser then makes a request to the server hosting the website for the page that you want to view.

All being well, the server should respond to this request by sending the page back as a .html file. This .html file contains a description of the content of the page you wish to view and your browser uses this to render the page on your computer (displaying it on the screen for many users, but also reading it out for those who are blind or have a visual impairment).

Illustration of simplified request / response for web page. Icons from font-awesome.

To build websites, therefore, we need to create them in the format that browsers can understand: HTML.

Hyper-Text Markup Language

HTML stands for Hyper-Text Markup Language. Hyper-Text is essentially techno-geek speak for documents that link to other documents.

To return to the example of www.google.com, when you make a search on google it scans the web for pages that might match the words you are looking for, and presents them as a list of links sorted by what it feels are the most relevant. Selecting any of these links will take you to that page so you can continue your search. It would probably not be exaggerating things much to say that links between documents are the single thing that contributes the most to the usefulness of the web.

To continue on with our breakdown of HTML you will notice that the L stands for language. In this case we are not referring to a language like English, Chinese, or Arabic, that is used by people to communicate with one another, but instead what is know as a computer language; which allows people to communicate with computers.

Fundamentally, computers deal in binary, familiar to most as long strings of 0s and 1s like this: 011000012. Having to use binary every time we want to tell a computer to do something would obviously become quickly tedious and error prone, so very clever people have invented a whole series of languages which are closer to those you would use in everyday life and which we can use to write instructions that the computer will then convert to binary before it follows them. There are a whole host of different computer languages for different use cases, and HTML is the one we use for describing webpages.

That, then, just leaves us with Markup. Markup describes the fact that HTML is a language for marking up documents; annotating parts of them to give them additional meaning. These annotations are what allow the browser to determine whether a particular part of the page is a piece of text or an image (along with a whole host of other possibilities) and serve as the instructions for how to build the page.

In the next chapter we will start to look at how html documents are formed.

Footnotes

  1. Anything that can be accessed on the web is classed as a resource, so not just pages, but also images; videos' pdfs etc.
  2. The number 01100001 is a binary representation of the lowercase letter a in the ASCII character set (a sort of code for representing letters with just binary numbers). As you can see, just typing out the word "Hello" would be very time consuming if we had to work in pure binary.