The http: URL Format

This article describes the http: url format, which are urls to access HTTP resources (known as the web).

What it is used for

The http scheme is used for accessing documents, resources and services through the HTTP protocol, which makes up most of the information retrieval network known as the internet today. HTTP stands for the hypertext transport protocol. The http protocol was historically described by the IETF RFC 1738, and is a registered scheme with IANA.

General http: URL Format

The http: scheme format is generally of the form:

http://user:password@hostname:port/folders/subfolders/document;parameter?query#fragment-or-ajax
----   ------------ --------- ---- ------------------ -------- --------- ----- ----------------
(1)    (2)          (3)       (4)  (5)                (6)      (7)       (8)   (9)

1. HTTP Scheme

Normally this will be "http", but will be "https" if the HTTP protocol is wrapped in a SSL/TLS secure transport (often used by e-commerce sites to provide secure transactions).

2. User Authentification

This provides a user name and password to log into sites that use the HTTP Auth method, this type of authentification is very rare on user facing web sites, which uses POST forms instead, but more common for backend web systems with API accesses.

Because the "at" (@) symbol is easy to hide within query strings in common URLs, this feature had been exploited to spoof sites with another server, by having the spoofed site show after the http://. however before the @ and no slash in between.

3. Hostname

The DNS resolvable hostname of the server which there will be HTTP communication with. For historical reasons alone, many HTTP URLs hostnames have had "www" as a subdomain (and FTP URLs had "ftp." as a subdomain), as DNS-based obscuring and load balancing were not common (in fact, the origins of DNS hierarchy is that organizational systems were subdomains, sub-subdomains for machines under different departments, etc). Many user based systems like forums often infer text fragments that start with "www." and end with ".com" (or other TLDs) to be URLs automatically. The "www" is usually optional, although some sites may be configured incorrectly such that either the www or non-www forms won't resolve.

4. Port

The ":port" part is optional, and rarely used except for internal system design purposes (specialized web server software running from behind other web servers like Apache), and cases where desktop software exposes a browser usable interface, and other reasons where use of the privileged port 80 isn't possible.

Due to the possibility of "Cross-Protocol Exploits", Firefox will block URLs with ports with certain numbers.

5. Directory Location

This portion is somewhat significant, as relative URLs will take into account this part of the current document's URL to infer where another resource is located when it is not an absolute URL. Slashes are explicitly the delimiters in a directory location. However, the advertised directory location may not necessarily map to the web server's file system, or to a file system at all, as is the case for rewritten URLs.

6. Document Name

This is the part of the URL which describes a name to the document, resource or service. Sometimes this part of the URL string will be used if no "Content-Disposition" information is available. This part of the URL may infer information to the user agent when MIME information fails, such as the extension (".html", ".pdf", etc) used. Use of extensions is completely optional, and sometimes hidden to improve URL readability and hide details of the web server implementation.

7. Semicolon Parameters

Some web services uses the semicolon parameter section to store session information, as well as other mode-like information for the current user agent and/or document. Examples of semicolon parameter use is found in servlets that use the PLT Scheme web server.

8. Query Parameters

Sometimes known as the query string, this part is usually produced by the user agent from a GET form, such as is used by search engines, selector apps, etc. Sometimes some sites link to URLs with query parameters, something slightly discouraged by search engine indexers, which is the case with web-driven systems like forums and blogs that don't have URL rewriting set up.

As of lately, some search engines such as Google will index through GET forms, so design a web service with that in mind.

Very early on, the query parameter was simply text that was sent to the server to retrieve results of a search using <ISINDEX>, this is more or less obscure and deprecated at this point. However, because of this, the "query string" can actually be just flat text with no key/value convention depending on the purpose.

9. Document Fragment and AJAX accessor

Document fragments were historically used in the hypertext realm as a way to link to strictly a fragment of an accessed document. In the browser, usually the browser "jumps" to the fragment after clicking on such links.

Today, the document fragment is also used as an AJAX location, where one JavaScript intensive document is fetched from the HTTP server, and further interactions and requests are handled by AJAX from within that document. Since this breaks the traditional hypertext linking system, using the fragment to refer to AJAX resources has become common practice.

The IE http: URL Format Limitations

While Internet Explorer works with http URLs, unlike the other browsers, it has a limitation of 2,083 characters, which possibly could cause problems with URLs with large parameter or query portions.

Links

RFC 1738 contains information on the original URL schemes