Anatomy of a Url

From Well Designed Urls WiKi

Jump to: navigation, search

Contents

Warning and advice

This document is incorrect. This document is too brief and approachable to be correct.

The most authoritative document on the syntax of URLs is RFC 3986, “Uniform Resource Identifier (URI): Generic Syntax”. The RFC Editor maintains the most authoritative version of RFC 3986 at <http://www.rfc-editor.org/rfc/rfc3986.txt>. Roy Fielding, the primary author of RFC 3986, maintains an HTML version of “Uniform Resource Identifier (URI): Generic Syntax”.

Components of a URL

The major parts of a URL are called “components”. Each component is one of the following:

Syntax at a glance

URL components
Name Aliases Necessity Position Emptiness Leader Trailer Subcomponents
scheme protocol mandatory first forbidden  :
authority domain name optional after scheme allowed // userinfo, host, port
path file name, directory mandatory after authority or scheme allowed path segments
query query string, querystring optional after path allowed  ?
fragment anchor optional last allowed #
URL subcomponents
Name Parent Necessity Position in parent Emptiness Leader Trailer
userinfo authority optional first allowed @
host authority mandatory after userinfo or after “//” allowed
port authority optional after host allowed  :
path segment path mandatory and repeatable all allowed /

Syntax of a URL

Within a given URL, each component either appears once or does not appear at all; it is not possible for a given URL to have, say, two schemes.

Each URL has a scheme component. Each URL has a path component. The authority, query, and fragment components are optional in the generic syntax, with the provision that the definitions of particular URL schemes may require or forbid the authority and query components.

The scheme component is the first component within a given URL. Many people mistakenly call a scheme component a “protocol”. This practice likely started because the definitions of several widespread URL schemes (“http”, “https”, “ftp”) identify particular protocols as authoritative mechanisms for resolving URLs of the respective schemes. Immediately following the scheme component is a colon (“:”, U+003A).

The authority component, if present, immediately follows a sequence of two slashes (“//”, U+002F U+002F), which sequence immediately follows the colon which follows the scheme component. An authority component has subcomponents:

Each authority component has a host subcomponent. The userinfo and port subcomponents are optional in the generic syntax, with the provision that the definitions of particular URL schemes may require or forbid the userinfo and port subcomponents.

The userinfo subcomponent, if present, immediately precedes a commercial-at sign (“@”, U+0040) which immediately precedes the host subcomponent. The userinfo subcomponent may be empty, in which case the commercial-at sign immediately follows the sequence of two slashes which precedes the authority component (“//@”, U+002F U+002F U+0040). If a URL has an authority component but does not have a userinfo subcomponent, then the authority component does not have a commercial-at sign (“@”, U+0040). An empty userinfo subcomponent is not equivalent to the absence of a userinfo subcomponent.

The port subcomponent, if present, immediately follows a colon (“:”, U+003A) which immediately follows the host subcomponent. The port subcomponent may be empty. If a port subcomponent is not empty, the port subcomponent comprises a sequence of decimal digits ({“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”}, {U+0030–U+0039}) representing an integer. The integer identifies the port (as in TCP or UDP) on which the host identified in the host subcomponent listens for requests on the resource which the entire URL identifies. If a URL has an authority component but does not have a port subcomponent, then the authority component does not have a colon (“:”, U+003A) following the host subcomponent. An empty port subcomponent and the absence of a port subcomponent have the same meaning.

The host subcomponent takes one of four forms:

The IPvFuture literal form is for identifying Internet hosts by an address from some future version of the Internet Protocol. Except for testing conformance to the URL-syntax specification, nobody uses IPvFuture literal forms as of 2007.

The IPv6 literal form identifies Internet hosts by an address from Internet Protocol version 6. Briefly and not quite correctly, an IPv6 literal comprises an opening square bracket (“[”, U+005B), a sequence of hexadecimal numbers and colons ({“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “a”, “b”, “c”, “d”, “e”, “f”, “A”, “B”, “C”, “D”, “E”, “F”, “:”}, {U+0030–U+0039, U+0061–U+0066, U+0041–U+0046, U+003A}), and a closing square bracket (“]”, U+005D).

The IPv4address form identifies Internet hosts by an address from Internet Protocol version 4. An IPv4address is a sequence which comprises a dec-octet, a period (“.”, U+002E), a second dec-octet, a second period (“.”, U+002E), a third dec-octet, a third period (“.”, U+002E), and a fourth dec-octet. Each dec-octet is a sequence of decimal digits ({“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”}, {U+0030–U+0039}) representing an integer in the range 0–255, inclusive. Except for the dec-octet representing zero, a dec-octet does not begin with the digit zero (“0”, U+0030). In other words, a dec-octet is one of the strings in the set {“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “10”, “11”, “12”, “13”, “14”, “15”, “16”, “17”, “18”, “19”, “20”, “21”, “22”, “23”, “24”, “25”, “26”, “27”, “28”, “29”, “30”, “31”, “32”, “33”, “34”, “35”, “36”, “37”, “38”, “39”, “40”, “41”, “42”, “43”, “44”, “45”, “46”, “47”, “48”, “49”, “50”, “51”, “52”, “53”, “54”, “55”, “56”, “57”, “58”, “59”, “60”, “61”, “62”, “63”, “64”, “65”, “66”, “67”, “68”, “69”, “70”, “71”, “72”, “73”, “74”, “75”, “76”, “77”, “78”, “79”, “80”, “81”, “82”, “83”, “84”, “85”, “86”, “87”, “88”, “89”, “90”, “91”, “92”, “93”, “94”, “95”, “96”, “97”, “98”, “99”, “100”, “101”, “102”, “103”, “104”, “105”, “106”, “107”, “108”, “109”, “110”, “111”, “112”, “113”, “114”, “115”, “116”, “117”, “118”, “119”, “120”, “121”, “122”, “123”, “124”, “125”, “126”, “127”, “128”, “129”, “130”, “131”, “132”, “133”, “134”, “135”, “136”, “137”, “138”, “139”, “140”, “141”, “142”, “143”, “144”, “145”, “146”, “147”, “148”, “149”, “150”, “151”, “152”, “153”, “154”, “155”, “156”, “157”, “158”, “159”, “160”, “161”, “162”, “163”, “164”, “165”, “166”, “167”, “168”, “169”, “170”, “171”, “172”, “173”, “174”, “175”, “176”, “177”, “178”, “179”, “180”, “181”, “182”, “183”, “184”, “185”, “186”, “187”, “188”, “189”, “190”, “191”, “192”, “193”, “194”, “195”, “196”, “197”, “198”, “199”, “200”, “201”, “202”, “203”, “204”, “205”, “206”, “207”, “208”, “209”, “210”, “211”, “212”, “213”, “214”, “215”, “216”, “217”, “218”, “219”, “220”, “221”, “222”, “223”, “224”, “225”, “226”, “227”, “228”, “229”, “230”, “231”, “232”, “233”, “234”, “235”, “236”, “237”, “238”, “239”, “240”, “241”, “242”, “243”, “244”, “245”, “246”, “247”, “248”, “249”, “250”, “251”, “252”, “253”, “254”, “255”}.

The reg-name form is a name from some registry of names. Most URL schemes that use reg-name forms accept or require an Internet domain name. A reg-name may be empty.

A host subcomponent may be an empty reg-name, an authority component may omit the userinfo subcomponent, and an authority component may omit the port subcomponent. Therefore an authority component may be empty, in which case the path component immediately follows the sequence of two slashes which precedes the authority component.

The path component may be empty. If the path component is not empty and follows an authority component, the path component must begin with a slash (“/”, U+002F). The subcomponents of a path component are path segments. A single slash (“/”, U+002F) is the separator between path segments.

Path segments may be empty. Therefore a path component may have a sequence of two or more slashes and may end in a slash (“/”, U+002F). An empty path segment is not equivalent to the absence of that path segment.

In the general case, an empty path component is not equivalent to any non-empty path component. The definitions of URL schemes may specify a particular non-empty path as equivalent to an empty path component; the equivalence holds for that scheme only.

The query component, if present, immediately follows a question mark (“?”, U+003F) which immediately follows the path component. Many people call a query component a “query string” or “querystring”. A query component may be empty. An empty query component is not equivalent to the absence of a query component. A query component has no subcomponents, but, because of the legacy of HTML forms, many people talk about “query string parameters” or “query parameters”.

The fragment component, if present, immediately follows a number sign (“#”, U+0023). If a query component is present, the number sign immediately follows the query component. If the query component is absent, the number sign immediately follows the path. People sometimes call a fragment component an “anchor”. A fragment component may be empty. An empty fragment component is not equivalent to the absence of a fragment component. A fragment component has no subcomponents.

Examples

“http” URL with Scheme, Authority (Host), Path and Query

For example:

http://store.apple.com/1-800-MY-APPLE/WebObjects/AppleStore?family=iMac

The above gives us:

  • Scheme = http
  • Authority = store.apple.com
    • Userinfo is absent
    • Host = store.apple.com
    • Port is absent
  • Path = /1-800-MY-APPLE/WebObjects/AppleStore
  • Query = family=iMac
  • Fragment is absent

“http” URL with Scheme, Authority (Host), Path and Fragment

As another example:

http://en.wikipedia.org/wiki/URL#URLs_as_locators

This gives us:

  • Scheme = http
  • Authority = en.wikipedia.org
    • Userinfo is absent
    • Host = en.wikipedia.org
    • Port is absent
  • Path = /wiki/URL
  • Query is absent
  • Fragment = URLs_as_locators

“mailto” URL with Scheme and Path

And as a final example:

mailto:billg@microsoft.com

This gives us:

  • Scheme = mailto
  • Authority is absent
  • Path = billg@microsoft.com
  • Query is absent
  • Fragment is absent
Personal tools

Mediawikihosting