URLs: A Magnificent Obsession
A couple of weeks ago, I admitted to a long-term obsession with search engines. This time around, I'm admitting to a long-term obsession with URLs. I guess I'm just a neurotically obsessed individual.
I'm into URLs for a number of reasons. First of all, without them, there's no hyperlinked-based Web. This should be reason enough. Second, they are elegantly rule-based. They have a syntax. Third, they are functional in all sorts of ways. They determine the nature of the conversation between your browser and the Web server host of the page you're requesting. They can not only take you to a destination, but they can also query a database. They can call a script, which in turn can do any number of things. They can track sessions. They can point you to a proxy server to give you off-campus access to licensed research materials. They can lead you from sources to the targets of an OpenURL. And so on.
Despite all this, I don't know too many librarians (except for some catalogers and people who do proxy server configurations) who pay close attention to the URLs that they create. Librarians put them all over their Web pages but rarely test them when they do, and even more rarely correct them when they're broken. I always found this unacceptable in a Library 1.0 world. It's even less acceptable now.
A few years ago, I wrote an (entirely obsessive) article about URLs in Information Technology and Libraries, "Issues in URL Management for Digital Collections." (PDF format) This helped to get some of these matters out of my system, but I'm not finished yet.
From what I've seen, these are the most common problems with URLs on library Web sites:
- Librarians leave off the final trailing slash on directory URLs. Example: http://library.albany.edu/reference rather than http://library.albany.edu/reference/. My favorite URL-related article on the planet alerted me to this, Tom Dahm's 2001 "Load Time Tip: Use a Trailing Slash on Directory URLs". If you leave off the trailing slash of a directory URL, the server will look for a file with the name of the directory, rather than the default file name (such as index.html or default.htm) configured for that directory on the server. In fact, good link checking programs will return a 301 "permanent redirect" error because of the missing slash. The problem is self-correcting, but the point is, it needs to be corrected. Leaving off the slash in a directory URL doubles the traffic between the browser and the server to complete the process of retrieving the default file. This wastes bandwidth and, potentially, the users' time because of the lag time for the extra trip. Wasting bandwidth is not good Internet citizenship.
- Librarians use absolute rather than relative URLs for links to pages located within the same Web site. Example: http://library.albany.edu/reference/ rather than /reference/. Using absolute URLs sends the browser request out to the campus DNS server and back into the Web site to retrieve the file. This is another waste of bandwidth. Also, if the base URL of the Web site changes, all your internal links will also need to be changed.
- If the library Web site has its own absolute root (/), librarians use relative URLs (../../images/image.jpg) rather than absolute local URLs (/images/image.jpg). There's nothing terribly wrong with this, except when you want to move the page containing this link. In this case, you'll have to modify all the relative URLs. If these URLs were absolute local ones, nothing would need to be changed.
- And then there's the matter of starting point URLs. (As far as I can tell, this is a term popularized by EZproxy's Chris Zagar.) Starting point URLs are the URLs to put on your Web pages. They become destination URLs on connection. These latter can be non-durable, especially in the case of databases and e-journals. This might be because, for example, session information becomes embedded within the URL. When the session no longer applies, the URL won't connect. You always need to use a starting point URL.
These are universal issues. My library also has a few local URL issues of its own - and that some other libraries are dealing with, too. The first can be found on our Web site. We've set up a redirect URL script that saves librarians from the need to update the URL of a database if it changes. If you use this specialized URL, the link will point to a master database record maintained by one person (me). Example: http://library.albany.edu/databases/libresre.asp?resourceid=413. In this case, record 413 contains the current URL, and it will load into browsers whenever the script is called from anywhere on the Web site. As vendors update their URLs, or as resources switch from one vendor to another, only one person has to worry about it. This has worked out really well. The only issue has been librarian training, and this hasn't been a problem.
Another thing we've done is to proxy our OPAC. We use EZproxy for off-campus authentication, and it has an amazing feature, called OPAC proxying. This feature allows us to proxy the OPAC URL so that the many thousands of URLs within the catalog will become transformed automatically to point to the proxy server for off-campus authentication. This feature saves catalogers from needing to prepend the EZproxy string in front of umpteen URLs. However, it also results in an ungainly URL for our catalog. In our case, it's http://[our-proxy-address]/login?url=http://[our-catalog-address]. This looks great in advertising flyers and on Web pages, and is so easy to explain at the reference desk and to type into Web pages...NOT! Besides, this URL also sends on-campus catalog requests through the proxy server, and that's a waste of resources. So, we created a simple script that is called from a handsome URL, http://minerva.albany.edu/. This script redirects to the catalog server for on-campus users, and sends the browser request through the proxy server for off-campus users. This is well and good, except that some librarians forget to use the "minerva" URL and instead use the direct URL. This has the effect of preventing off-campus users from accessing our licensed resources.
We also proxy our electronic reserves system, ERes, but - on request from our ERes administrator - without an underlying redirect script. So the complete URL is http://[our-proxy-address]/login?url=http://eres.ulib.albany.edu/. If you click on this URL, it becomes http://eres.ulib.albany.edu.[our-proxy-address]/eres/default.aspx. This URL has been rewritten by EZproxy into its destination URL, and there is no guarantee that it will work for off-campus users. You've got to use the former, starting point URL. Not everyone in my library is remembering to do this.
Believe me, I'm not blaming anyone. I'm merely pointing out that URL management is a complicated matter and needs to be respected as an issue worthy of some care. You don't have to be obsessed like me in order to give this some thought - though it wouldn't hurt!
If Library 2.0 is all about users, then we need to pay attention to the URLs that get these users to the resources they need. It may not seem terribly exciting to concern ourselves with URL structures, but the effect of ignoring this issue could be lag times, complaints from off-campus users, broken links when relocating Web pages, and a sloppy infrastructure. In a Library 2.0 world, who needs that?

Comments
Thanks for your paper. I agree with the rules. Just two comments on your statements:
1. trailing slash
It's correct, another http-request is necassary. However with nowadays persistent connections (HTTP/1.1) or keep-alive (HTTP/1.0) that's not an issue. Usually loading a html page also requires subsequent http-requests anyway (e.g. for css file, js file, background image, inline images, ...).
2. relative URLs
For maintenance reasons relative URLs should be prefered, that's ok. The browser extents relative URLs to absolute URLs. Regarding DNS calls there is absolutely no difference between the use of relative or absolute URLs.
Just my 0.02,
MFG
Posted by: MFG | February 20, 2007 12:40 PM
MFG, Thank you. I hear what you're saying, but I'm still comfortable recommending the trailing slash. Any improved efficiency in processing browser requests is a good thing. The Web server logs for my library's public site are filled with 301 errors - nearly 250,000 last year. In my view, an error is an error! The link checking software used in my library (LinkScan) returns a warning for all URLs missing the slash. It claims that the omission may cause significant problems for some users that access the Web via proxy servers. I realize this isn't referring to the EZproxy type of proxy server.
I'm curious about your second statement, since the browser's DNS cache is limited.
Interesting issues.
Posted by: Laura Cohen | February 20, 2007 02:47 PM
Alright, you consider code 301 as error code. I don't. IMHO 4xx and 5xx codes only are error codes. 3xx are just redirects (see RFC below). BTW another popular 3xx code to be found in log files is "304 not modified" (possible answer to "GET if modified since" requests from proxy/cache servers or browsers to check whether the cached copy of a document is still fresh).
See also RFC 2616 HTTP/1.1 http://www.ietf.org/rfc/rfc2616.txt
or RFC 1945 HTTP/1.0 http://www.ietf.org/rfc/rfc1945.txt
Redirection 3xx
This class of status code indicates that further action needs to be
taken by the user agent in order to fulfill the request. The action
required may be carried out by the user agent without interaction
with the user if and only if the method used in the subsequent
request is GET or HEAD.
...
Client Error 4xx
...
Server Error 5xx
...
MFG
Posted by: MFG | February 20, 2007 04:54 PM
MFG, I see where you're coming from. By error, I mean that a correction, i.e., a further action, must take place in order for the file to be retrieved due to faulty URL construction. Code 304 represents the activity of a cache, and does not occur due to poor infrastructure.
Posted by: Laura Cohen | February 20, 2007 07:29 PM
Hi,
With regard to relative URLs - our CMS, Drupal, detects and deletes any absolute roots or URLs for us. So although I avoid writing them in the first place, my colleagues (who may be less tech-savvy) don't have to worry about it.
With regard to /news/, is there an issue here to do with clean URLs? For example, we use them for blog entries: "/team-blog/feb-links" So here, it's not a directory with an index file but an actual URL. If you don't use clean URLs, can you do something to .htaccess in order to pre-empt the redirect? Here my knowledge is thinning!
Your blog's just been recommended to me this morning by a colleague, thought I should mention that to explain the sudden influx of comments by me today!
Cheers
Jb
Posted by: James Brown | March 15, 2007 06:57 AM
James, Good for Drupal! URL management is definitely a feature to look for in a good CMS, and it's good to know that Drupal pays attention to this.
I'm not sure I can answer your second point. If something like "/team-blog/feb-links" queries a database to get the content, then a final slash would probably prevent the query from running. Also, I work on Windows servers, so I can't speak to the possibilities with .htaccess files. In the Windows server world, I don't know of anything that would pre-empt a redirect. But mine is hardly the final word on this!
Posted by: Laura Cohen | March 15, 2007 11:01 AM
Hi Laura,
I can't take the credit for choosing Drupal (it was installed on our website before I arrived!), but it's been pretty good.
It must be really quite clever since the trailing slash actually makes no difference to the page showing. Not sure what's going on behind the scenes there. I think my insight on the matter should come to an end since I don't want to make claims about knowledge I don't have!
Let me summarise what I wanted to say at the start - you're absolutely right that URL management is an issue people should be thinking about; and I'm lucky that Drupal does much of the work for me. If I was somewhere else, I might think about issuing a memo to any web-authors/editors - with my fingers crossed that people would take notice. I suppose from a monetary pov you could argue that decreased bandwidth and decreased resources spent implementing later redirects both help the bottom line.
Cheers
Jb
Posted by: James Brown | March 15, 2007 11:55 AM
I am like librarians. I never believe in putting anything extra at the end of url. But you are right. I tested it and its faster. Thanks.
Posted by: John Cooley | May 28, 2007 09:35 AM