MagPortal.com   Clustify - document clustering
 Home  |  Newsletter  |  My Articles  |  My Account  |  Help 

Location: Help / Free Feeds / Caching Feeds

These instructions show you how to copy a web page to your local disk. This is useful if you are fetching our XML/RSS feed data, or if you are using our JavaScript feeds and you want to cache the data on your local computer. Caching the data can make the feed display faster for the user because their browser will not have to establish a connection to the MagPortal.com server to download the data when they view the page.

This information is for use on Unix/Linux systems where cron can be used to periodically invoke a Perl (version 5.004 or higher) script to fetch the data. If anyone can provide us with suitable instructions for doing this on NT we would appreciate it.

  1. Get "libwww-perl" and install it. You can get it from: http://search.cpan.org/search?module=LWP
  2. Copy the Perl script provided at the end of this page. We assume that you will name it mp_fetch_feed.pl for this example. Parts that you must modify are in orange. The comments in the script explain the necessary modifications. There are 3 things to modify:
    • The location of Perl on your computer.
    • Set $robot_from to your email address.
    • Insert one call to the fetch function for each page you want to cache. In this function call you specify the URL of the page to fetch and the filename to store it into (the file will be overwritten if it already exists).
  3. Set the permissions of the script to make it executable:
       chmod 755 mp_fetch_feed.pl
    and move the script to some suitable place. For the next step I'll assume it is in /usr/local/bin
  4. Edit your list of cron jobs by doing:
       crontab -e
    and insert a line like this:
       27 3 * * * /usr/local/bin/mp_fetch_feed.pl
    The above will fetch the feed data at 3:27am every day (pick a time in your time zone which corresponds to 2am-6am in Eastern Time). Note that the cron job will run with the permissions of the user who created it, so make sure that any necessary output directories exist and have appropriate permissions.

That is all you need to do to copy the data to your computer every day. If you are using the JavaScript feeds, you will need to modify the HTML code on your web pages to load the data from the cached copy on your server instead of from the MagPortal.com server. Look for a line like:
  <SCRIPT LANGUAGE="JavaScript" SRC="http://MagPortal.com/nr/feed.php?c=92&t=1&i=33"></SCRIPT>
and change the URL shown in orange to point to the cached copy of the data on your own server.

Here is the mp_fetch_feed.pl script (click here to download). Make sure that the #!/usr/bin/perl line is the first line (no blank lines above it).


#!/usr/bin/perl -w #### MODIFY LINE ABOVE TO HAVE PROPER PERL PATH FOR YOUR COMPUTER # This program is provided by Hot Neuron LLC free of charge without # any warranty of any kind whatsoever. It is entirely your responsibility # to asses the suitability of this program for your use. By using this # program you accept all responsibility for any damage that it may cause. # NOTE: This program checks the robots.txt file on the server and it will # not send requests too rapidly (don't strain the server) so it # normally takes a little time to run: 1 minute for each page request # You will need the LWP module which is in "libwww-perl" which you can get at # http://search.cpan.org/search?module=LWP use LWP::RobotUA; $robot_from='[email protected]'; ######## MODIFY - PROVIDE YOUR EMAIL ADDRESS $robot_name='mp_fetch_feed'; # don't change - name of the spider $ua = new LWP::RobotUA($robot_name, $robot_from); ####################################################### # MODIFY - Put in one call to the 'fetch' function for each # page you want to fetch below. # 1st argument = URL of page to fetch # 2nd argument = full path of file to store page in (FILE WILL BE OVERWRITTEN) ####################################################### # Example: # fetch('http://MagPortal.com/nr/feed.php?c=92&t=1&i=33', '/tmp/mp_cache/uspolitics.js'); sub print_error($) { my $msg = shift; print STDERR "mp_fetch_feed.pl: ERROR - $msg\n"; } sub fetch($$) { my ($url, $filename) = @_; my $req = new HTTP::Request(GET => $url); my $page = $ua->request($req); # if we got some sort of server error, try the request one more time if ($page->code() >= 500 && $page->code() < 600) { $page = $ua->request($req); } if ($page->is_success) { if (open(OUTFILE, ">$filename")) { print OUTFILE $page->content; close OUTFILE; } else { print_error("Unable to open output file: $filename"); } } else { print_error("Unable to fetch data from URL: $url"); } }