apache mod rewrite - making your dynamic content look static

mod rewrite: how to get those dynamic pages cached

The Problem: My pages are not getting cached

The reason: Dynamically generated web pages can cause web robots that follow links, to become trapped within a website following endless hyper-links to pages that are generating pages containing links dynamically.

Why is it important for a web bot or spider in cache my page?

Modern web bots and spiders take a snapshot of the page they visit. This is the cached version of your page. The content of the page is then used to evaluate its ranking or applicability to the search query. If your pages don't get crawled, they won't get indexed. If they don't get indexed, people won't know about them. If people don't know about them, then there's no point in maintaining a website.

Be careful which dynamic pages.

Well for you it could be a horror trip into the land of "web bot has swallowed my monthly transfer of megabytes" and seriously messing up any statistical analysis of your web-logs.

How do web bots and spiders combat against getting trapped in a website?

Web bots and spiders combat this in a number of ways:

  1. The fully ignore web-links that link to CGI scripts by taking note of the ? and ignoring the content of the HTTP query string.
  2. They limit the CGI pages crawled by crawling only to a certain depth or limiting the number of links from a page.
  3. They ignore DHTML in HTML pages attempting similar things with javascript. That is why linking via javascript is a bad idea. Try surfing your DHTML menus in Lynx!
  4. They are careful of pages with specific query string phrases i.e. ?date=.... or ?session=...

Making the website seem static

My website is dynamic because all my products come from a database and I only have 600 products which equate to 600 pages. Is there no way I can get them cached without converting them all in static HTML pages?

Yes, use the mod_rewrite rule on your Apache server to make the web bot think all your pages are static HTML. Use the module mod_rewrite which is the Swiss Army Knife of URL manipulation! It means we change the CGI name of the link in to a structured static one and when the Apache web-server receives its it rewrites the structure with a regular expression in to a CGI script URL.

I am still not with you can you show me an example?

Example. You sell widgets and you have them categorized in you database by color and size. Usually your CGI links to the widget description with the following URL:

widget.cgi?color=B&size=10

The web bots are refusing to cache this because its a dynamic link. What we want is a link that looks like:

widget_B_10.html or widget/B/10.html 

The file .htaccess may be known to you as the place to set up login areas (insecure) but it is the place for a whole lot more .

So what goes in the .htaccess file?

RewriteEngine On
RewriteRule ^widget_(.*)_(.*).html widget.cgi?color=$1&size=$2

Now any visitor would think you have a single directory full of static web pages with the name determining the widget and the second link would look like you have categorized all your static pages in sub directories.

What's is happening?

Upon the receiving the URL widget_B_10.html, the apache server using a regular expression to match what is between the word widget_ at the start of the page and the second underscore. This match is passed into 1 and then whatever is between the the second underscore and the .html is passed into $2.

The second part of the link is our script name with the variables it will expect to receive. We simply pop in the matched values from variables $1 and $2 into their respective places.

Is that it?

In the .htaccess file -yes, you see now the link has to maintain a structure so it can be parsed correctly. You could have made you links before by having either widget.cgi?color=B&size=10 or widget.cgi?size=10&color=B . They now have to be written as normal HTML link with the color variable first and the size variable second. The main thing now is that you can write all you links much easier within you CGI scripts but you have to be careful about the order and what to do if you suddenly need more variables.

In case you have to configure your own apache server...

You need to ensure that the apache httpd.conf has the module activated and that the .htaccess is being read by apache. Its all in the documentation.

Firstly - LoadModule rewrite_module

The LoadModule rewrite_module should not have a hash in front of it.

#LoadModule spelling_module     libexec/httpd/mod_speling.so 
#LoadModule userdir_module     libexec/httpd/mod_userdir.so 
#LoadModule alias_module       libexec/httpd/mod_alias.so 
LoadModule rewrite_module     libexec/httpd/mod_rewrite.so 
#LoadModule access_module      libexec/httpd/mod_access.so 
#LoadModule auth_module        libexec/httpd/mod_auth.so 

Secondly - Enable up followsymlinks in the httpd directory

FollowSymLinks has to be on if you are using the slash as you separator.

# This may also be "None", "All", or any combination of "Indexes", 
# "Includes", "FollowSymLinks", "ExecCGI", or "MultiViews". 
# # Note that "MultiViews" must be named *explicitly* --- "Options All" 
# does not give it to you. #
 #    Options Indexes FollowSymLinks MultiViews ExecCGI   
  Options FollowSymLinks MultiViews ExecCGI 
# # This controls which options the .htaccess files in directories can 
# override. Can also be "All", or any combination of "Options", "FileInfo", 
# "AuthConfig", and "Limit" 
#     AllowOverride None # 
# Controls who can get stuff from this server. # 

Side effects?

A bit of additional CPU processing before every page is delivered. This is because each page will be compared with the contents of the .htaccess file while a match is being looked for. If your just delivering dynamic content your will probably be running you Apache server on a monster CPU thing with gigabytes of RAM. If not just buy a new computer and run it on Linux or FreeBSD.