[notes on 1/21 by Marissa and Tony]
HTTP is HyperText Transport Protocol. A protocol is a language or format for inter-operation between different systems. HyperText refers to any text with links in it.
URI/URL/URN is a Uniform Resource Indicator/Locator/Name – an identifier for some digital object (text, image, PDF, article, audio/video, etc.) I don’t care so much about the distinction between these three, but this is one aspect of it:
An identifier (URI/URN) does not necessarily tell you where to get it from (Ex. book ISBN), but a URL also tells how to access/retrieve the resource (Ex. FTP = file transfer protocol)
The part up to the colon is the scheme, and is usually
http but can be
ftp (file transfer),
tel (telephone number), and others. An example telephone URL to my office phone is tel:17184881274 – maybe that will be clickable when this page is viewed on your mobile phone.
The host part of the URL can be an IP address or dotted host name (which can be converted to an IP address by DNS). The port can be specified after a colon, but the default port is 80 if it is omitted (actually, 80 for
http and 443 for
https). A computer on the Internet can be listening for connections on multiple ports, and use the port number of the connection to determine what services to provide.
The path starts with the slash, after the host portion, so
/cgi/calendar.cgi in the example above. Sometimes it ends with a file-name extension (
.jpg), sometimes it ends with a slash, and sometimes it’s bare.
The query starts with a question mark, and consists of a set of parameters (variables) and their values. Multiple parameters are separated with
https://www.youtube.com/watch?v=kGOpY2J31pI&t=0m28s In this URL, the scheme is
https, the host is
www.youtube.com, the port defaults to 443, the path is
/watch and the query is
?v=kGOpY2J31pI&t=0m28s, which defines two variables: the video identifier
v=kGOpY2J31pI and the time at which to begin video playback
Many sites also support short/abbreviated URLs, such as
http://youtu.be/kGOpY2J31pI for the same video. In this case, it uses the
youtu domain name within Belgium’s
.be country code.
Another example where you see a query is when you do a Google search. Search for “web jobs” and you’ll see
?q=web+jobs appear in the URL. The plus sign appears because a URL cannot have a space in it. It also cannot contain plenty of other special characters, and characters such as
& are reserved for delimiting query variables, so there are other ways to encode special characters that we’ll see later. (Search google for “the & symbol” and notice
?q=the+%26+symbol appear in the URL.)
An HTTP conversation is a series of requests/responses. Request always starts from the client, and the server responds. A request begins with one line which has three parts:
GET. We’ll see the others later.
So a complete, correct request line would be
GET /watch?v=kGOpY2J31pI HTTP/1.0
Following that one line, the client will specify a series of headers that modify the request in some way. It’s just a generic way for the client/server to exchange further information. Some common examples:
User-Agent: Mozilla Firefox (Win8; v35)identifies what browser or other client software is being used.
Accept-Language: da, en-gb, es-ecidentifies the preferred languages of the user, in order. If the web site is multi-lingual, it will try to match its content to the preferred languages. These codes mean Danish, British English, and Ecuadorian Spanish. See http://www.metamodpro.com/browser-language-codes for more.
Host: www.youtube.comgives the host portion of the URL. If specified, it allows the same web server to serve multiple web sites, called virtual hosting. (Required in version 1.1 of HTTP.)
Referer: https://liucs.net/cs120s15/(yes the name of this header is misspelled!) gives the URL of the page on which this link was clicked (in other words, the previous page in your browser’s history).
The headers end with a blank line. So here is a complete HTTP request that my browser used to fetch the language-codes page:
GET /browser-language-codes HTTP/1.1 Host: www.metamodpro.com User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Referer: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved= 0CCAQFjAA&url=http%3A%2F%2Fwww.metamodpro.com%2Fbrowser-language-codes&e i=uJXCVOOPL4LYggTS_4DQDQ&usg=AFQjCNE2XZTwYEgvHFkw0JQ-vHc2z4ukKA&sig2=h1Q SWPH5pxjG9i1JanilBw&bvm=bv.84349003,d.eXY Connection: keep-alive If-Modified-Since: Fri, 23 Jan 2015 18:41:00 GMT Cache-Control: max-age=0
All the web standards we’re describing are hosted at http://www.w3.org/
GET method, described in the previous section, is meant to fetch a resource. HTTP is designed so that
GET requests can safely be cached and repeated whenever necessary. Therefore, it would be disastrous if a
GET were to trigger actions that have important consequences like adding an item to a database or charging a credit card. (There are some horror stories of this happening in the early days of the Web, before HTTP was widely understood.)
So instead of
GET, we can use the
POST method. Browsers, servers, and proxies know that it would be inappropriate to repeat or cache the results of a
POST. When you submit a form on the web, such as to log in to a server, register a new user account, upload a file, or enter your credit card details – those forms are transmitted using
In addition to the headers of the
GET request illustrated above, a
POST often carries a payload (also called the request body). This is an encoding of the fields in the form, or of the file being uploaded. In the request below, notice the new headers
POST /checkin2 HTTP/1.1 User-Agent: curl/7.40.0 Host: localhost:3000 Accept: */* Content-Length: 19 Content-Type: application/x-www-form-urlencoded name=Chris&score=32
When there is a non-zero content length, then the blank line ends the headers section and the server waits for (in this example) 19 additional bytes of payload. The payload shown above looks a lot like the query string part of a URL. That format is called
x-www-form-urlencoded, but other content types are possible too.
In addition to
POST, HTTP supports several other request methods, but they are far less widely used. They are:
HEAD– identical to a
GET, but omit the response body, so that we just receive meta-data about the resource such as its size and last-modified time (useful)
PUT– create or replace a named resource on the server (sometimes useful)
DELETE– remove a named resource from the server (sometimes useful)
PATCH– make partial updates to a resource on the server (rarely useful)
CONNECT– has something to do with switching to a secure protocol, but not often used (ignore)
OPTIONS– determine what methods the server supports (ignore)
TRACE– has something to do with proxy servers (ignore)
Different subsets of these methods are defined to be safe and idempotent.
Safe means that the method should not have any significant consequence other than retrieval of information. Safe methods include
Idempotent means multiple identical requests should have the same effect as a single request. All the safe methods are also idempotent, but also
POST is explicitly not idempotent, which is why we reserve it for actions that absolutely should not be repeated, such as charging customers’ credit cards.
After the server has received and processed the client’s request, it will issue its own response. Responses begin with a single status line that contains three parts:
Internal Server Error
Following the status line is a headers section. Commonly used headers include:
Content-Type:The type of data to be transmitted in the payload. Often
text/htmlfor an HTML page but can be
text/plainor other data formats like
Expires:Provides a date until which this content can be cached (stored and reused without issuing a new request).
Server:A string that identifies the server software and version – serves the same purpose as the
User-Agentheader that identifies the client software.
Set-Cookie:Provides some data to be returned to the server on the next request, to help implement sessions – we’ll learn more about that soon.
Here is a status line and complete set of headers for a request that my Firefox browser made to
metamodpro.com for that language-codes page:
HTTP/1.1 200 OK Cache-Control: post-check=0, pre-check=0 Connection: Keep-Alive Content-Encoding: gzip Content-Type: text/html; charset=utf-8 Date: Mon, 26 Jan 2015 22:06:27 GMT Expires: Mon, 1 Jan 2001 00:00:00 GMT Keep-Alive: timeout=5, max=100 Last-Modified: Mon, 26 Jan 2015 22:06:27 GMT P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" Pragma: no-cache Server: Apache Transfer-Encoding: chunked X-Content-Encoded-By: Joomla! 1.5 X-Powered-By: PHP/5.3.29
The numeric response codes are organized into several categories, identified by the first digit. I’ll show only the most well-known commonly-used codes here; the complete list is available from Wikipedia or the W3.
Codes beginning with
2 (indicated as
2xx) declare a successful transaction:
200 OKis the most common, by far. You can see that in the extended example above.
201 Createdis used just to indicate the creation of a new resource on the server (used with the
Codes such as
3xx indicate some form of redirection – for example, they ask the client to repeat the request at a different URL.
302 Foundprovides a new URL in the
304 Not Modifiedis used when the client’s request contained an
If-Modified-Sinceheader, and the resource has not been modified, so the client should continue to use its cached copy.
4xx indicate an error on the part of the client.
400 Bad Requestmeans that somehow the syntax of he request was not understood.
403 Forbiddenmeans that access has been denied to the requested resource.
404 Not Foundmeans the requested resource could not be found, but may be available again in the future.
5xx indicate an error on the part of the server.
500 Internal Server Erroris a completely generic error message. If the server-side program crashes or experiences a run-time error, this is the typical response.
501 Not Implementedmeans the server lacks the ability to fulfill the request, but it may be implemented in the future.
503 Service Unavailablemeans the server is currently unavailable, because it is overloaded or down for maintenance.
curl command is an indispensable tool for web developers. It allows you to issue highly-customized HTTP requests directly from the command line, bypassing the web browser.
Either way, when you enter
curl at the terminal prompt, you should get this informational message:
curl: try 'curl --help' or 'curl --manual' for more information
The simplest use of
curl is just to provide a URL on the command line. It will issue a
GET request, and then dump the payload (response body) into your terminal. For example, try:
By specifying the
-I option (that’s the capital letter that rhymes with ‘eye’), you instruct
curl to do a
HEAD instead of
GET and show the response headers instead of the payload.
curl -I http://www.google.com/
On my system, the result was:
HTTP/1.1 200 OK Date: Mon, 26 Jan 2015 22:52:58 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859-1 Set-Cookie: PREF=ID=4ae6a2df633f4c61:FF=0:TM=1422312778:LM=1422312778:S=WkV9 8-7Qk9-y01NS;expires=Wed, 25-Jan-2017 22:52:58 GMT; path=/; domain=.goog le.com Set-Cookie: NID=67=TCZT1yid-KNBAX4NqXJ8QVKIp48mGjzmBFYYE_d9rdvybLTQqNgsID13Y mCssBG54kRC7kLAVeLokFrOBNmzh4-kfVX5C4LPXzi2DBFYvZUArv3yJS0aSqaE_uNSsPV0; expires=Tue, 28-Jul-2015 22:52:58 GMT; path=/; domain=.google.com; HttpO nly P3P: CP="This is not a P3P policy! See http://www.google.com/support/account s/bin/answer.py?hl=en&answer=151657 for more info." Server: gws X-XSS-Protection: 1; mode=block X-Frame-Options: SAMEORIGIN Alternate-Protocol: 80:quic,p=0.02 Transfer-Encoding: chunked Accept-Ranges: none Vary: Accept-Encoding
-v option will show the (mostly) complete conversation between client and server. For example, let’s try it, but also remove the
www. from the URL, so we try access
curl -v http://google.com
The result is large, but manageable. Lines starting with
* are debugging messages from
curl itself. Then data sent by the client is marked
> and by the server is
<. Finally, between
</HTML> is the response body.
* Rebuilt URL to: http://google.com/ * Trying 188.8.131.52... * Connected to google.com (184.108.40.206) port 80 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.40.0 > Host: google.com > Accept: */* > < HTTP/1.1 301 Moved Permanently < Location: http://www.google.com/ < Content-Type: text/html; charset=UTF-8 < Date: Mon, 26 Jan 2015 22:59:11 GMT < Expires: Wed, 25 Feb 2015 22:59:11 GMT < Cache-Control: public, max-age=2592000 < Server: gws < Content-Length: 219 < X-XSS-Protection: 1; mode=block < X-Frame-Options: SAMEORIGIN < Alternate-Protocol: 80:quic,p=0.02 < <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> * Connection #0 to host google.com left intact
You can see that this request resulted in a
301 Moved Permanently response, and the
Location header tells us we should access
http://www.google.com/ instead. (You can add the
-L option to ask
curl to follow these
3xx redirect messages automatically.)
Another fancy trick with
curl is we can force it to send particular headers in its request, using the
-H option. You must put the header content in quotes. For example, if you want to use the
Accept-Language header to get a Spanish version of the Google web page:
curl -H "Accept-Language: es" http://www.google.com/
We’ll cover one more trick that
curl can do: specifying form parameters for a
POST request, using
-d and then a quoted variable assignment. This is part of how to do Check-in 2, so we’ll use the URL provided there as an example:
curl -v -d "name=My+Name" http://cs120.liucs.net/checkin2
-v we get to see (almost) the entire conversation, but I’ll abbreviate it slightly here:
> POST /checkin2 HTTP/1.1 > User-Agent: curl/7.40.0 > Host: cs120.liucs.net > Accept: */* > Content-Length: 12 > Content-Type: application/x-www-form-urlencoded > < HTTP/1.1 400 Bad Request < Server: Warp/3.0.5 < Content-Type: text/plain; charset=utf-8 < InvalidArgs ["Missing required parameter: score"]
The request went through to the server, and it sent 12 bytes of payload (count up the number of characters in
name=My+Name). The response was
400 Bad Request because you are also expected to specify a parameter
score. See the Check-in 2 spec for more details.
To specify additional parameters, use a separate
-d for each.