Something Special Making it EVEN better.


Log analysis with Pig and gnuplot

There are lots of easy ways to analyse standard Apache (and Tomcat) access logs. Tools like Webalizer and AWStats have existed for some time and are well tailored to logs generated from "standard" web traffic. What happens when you want to get a bit more advanced, look for different trends and graph things that standard tools just won't do? When your servers generate gigabytes (or terabytes) of data, the natural choice right now would be to gravitate towards Hadoop. However starting to hack away at MapReduce jobs in Java somehow seems a bit, well, wrong to be honest. Then came along a little Pig.

As per the website, "Pig is a platform for analysing large data sets that consists of a high-level language for expressing data analysis programs..." Just as Hadoop is the open source (aka Yahoo!) version of Google's MapReduce framework, Pig can be thought of (in a way) as the open source version of Google Sawzall. With Pig you can quickly say goodbye to low-level MapReduce functions and express your goals in a more natural, high-level language. Pig's contrib library, Piggy Bank, also has a ready built parser for the standard Apache (or Tomcat) access logs, just to make it all the more more appealing.

Okay, enough talk, let's get started! Here's what we're going to achieve in this post: Extract a subset of GET requests from the access logs and create a histogram in order to determine the effectiveness an HTTP cache might have.


  • Hadoop Core
  • Pig (instructions)
  • PiggyBank (comes with Pig)
  • Apache "common" or "combined" format access log files

Pig Scripting

After getting Pig installed, we're ready to start scripting some simple Pig Latin. To start with, let's just get the logs parsed and something running through Pig.

REGISTER piggybank.jar;</p>

<p>DEFINE ApacheCommonLogLoader;</p>

<p>logs = LOAD '$LOGS' USING ApacheCommonLogLoader
    as (remoteHost, hyphen, user, time, method, uri, protocol, statusCode, responseSize);
Next we want to filter and keep just the requests that we care about. Additionally, we will do a simple projection to keep only the fields we need.
logs = FILTER logs BY method == 'GET'
    AND statusCode >= 200 AND statusCode &lt; 300
    AND uri matches '/.<em>/rest/v1/places/.</em>';</p>

<p>logs = FOREACH logs GENERATE uri, responseSize;
We have kept responseSize but won't use it here. We could conceivably use this to determine how much bandwidth we would save by caching responses or how large the cache will likely need to be. From here, we will group based on the URI and count the occurrences of each. We will store this intermediate result for possible later analysis.
groupedByUri = GROUP logs BY uri;</p>

<p>uriCounts = FOREACH groupedByUri GENERATE
    group AS uri, COUNT(logs) AS numHits;</p>

<p>STORE uriCounts INTO 'uri_counts';
At this point we have something that is producing output and could run as is. Pig runs stand-alone by default, so testing is pretty simple. I strongly recommend running against a sample log file first, just to verify correctness. We have one parameter defined in the Pig script, so we need to remember to pass this in on the command line.

pig -x mapreduce -f apacheAccessLogAnalysis.pig -param LOGS='access.test.log'

Running what we have so far would give us the following output.

/pnss/pmds/rest/v1/places/URI3 3 /pnss/pmds/rest/v1/places/URI4 4 /pnss/pmds/rest/v1/places/URI1A 1 /pnss/pmds/rest/v1/places/URI1B 1 /pnss/pmds/rest/v1/places/URI2A 2 /pnss/pmds/rest/v1/places/URI2B 2

Last step in creating the histogram data set is to do a further grouping based on the above output. This will give us an idea of how many URIs would benefit from being cached. That is, how many URIs are hit multiple times.

groupedByCount = GROUP uriCounts BY numHits;</p>

<p>histogram = FOREACH groupedByCount GENERATE
    group AS numHits, COUNT(uriCounts) AS numUris;</p>

<p>STORE histogram INTO 'histogram';
The final output is pretty concise and ready to be graphed. The first column shows us the hit count and the second column shows the total number of URIs that have that hit count. So far so good, it also looks like our numbers match up with what we can see from the first output.

1 2 2 2 3 1 4 1

At this point we can easily create some aggregates as well to show the percentage of cache hits and misses. To do this, we want to sum up all of the values of the second column except for the first row (which have just 1 hit). I'll call these the one hit wonders.

numOneHitWonders = FILTER histogram BY numHits == 1;
numOneHitWonders = FOREACH numOneHitWonders GENERATE numUris AS num;</p>

<p>STORE numOneHitWonders INTO 'single_hits';
The last step is to get the total number of hits on URIs with 2 or more hits, and also to count the number of URIs that have two or more hits on them. These are the URIs that are candidates for caching.
-- gather up the rest that have have 2 or more hits
multiHits = FILTER histogram BY numHits >= 2;</p>

<p>-- count the total number of hits on URIs with 2 or more hits
numHitsOnCachedUris = FOREACH multiHits GENERATE
    (numHits * numUris) AS num;</p>

<p>numHitsOnCachedUris = GROUP numHitsOnCachedUris ALL;
numHitsOnCachedUris = FOREACH numHitsOnCachedUris GENERATE SUM(numHitsOnCachedUris.num);</p>

<p>STORE numHitsOnCachedUris INTO 'multi_hits';</p>

<p>-- count the number of URIs that will benefit from caching
numUrisCached = GROUP multiHits ALL;
numUrisCached = FOREACH numUrisCached GENERATE SUM(multiHits.numUris);</p>

<p>STORE numUrisCached INTO 'uris_benefit_cache';
Given these last two results, we could easily calculate the expected cache hit rate, but that is easier done outside of Pig since it just involves some simple algebra, however I will leave that exercise up to the reader.

Okay, enough plain old numbers. What I want to see are some graphs, sweet graphs!


I could easily use Excel for this since I have everything in nice tab delimited columns, but what geeky project is complete without a command line generated graph?! Plus, with gnuplot, we could easily automate this to run on a monthly, weekly or even nightly basis.

Here's a gnuplot configuration to chart the histogram we just generated. It loads a file called histogram.txt (which is the output from Pig renamed to a nice filename) and charts this simply as a png file.

set boxwidth 1 relative
set style data histograms
set style fill solid 1.0 border -1</p>

<p>set yrange [0:*]
set xlabel '# hits'
set ylabel '# URIs'</p>

<p>set term png
set output 'histogram.png'</p>

<p>set datafile separator "        "</p>

<p>plot 'histogram.txt' using 2:xticlabels(1) notitle
Based on our sample data set, this generates a small, concise histogram.



In the end, was this really necessary? No, not really. Most tools like awk would probably be able to handle most average access log files for something like a day or week. Was it an interesting experience? Absolutely! Plus, larger data sets would definitely benefit from some parallelism and more complex scripts and analysis can definitely be based on this simple script.

Here's a look at a real histogram produced from production traffic. A little more interesting than the sample chart! Note that this output was charted as logarithmic on the y-axis since the numbers are so much larger.

big histogram logarithmic

To see the real scale, here is the same graph charted without a logarithmic y-axis.

big histogram non-logarithmic

For reference, this Pig script took about 25 mins. (with some debug output) to run in stand-alone mode on my laptop at about 80% CPU utilization. The data set it ran against was from access logs totalling ~15M entries. It could definitely process a lot more data on a cluster.


All of the source for this post can be downloaded here.

  • Print
  • Google Bookmarks
  • Facebook
  • Tumblr
  • Twitter
  • LinkedIn
  • StumbleUpon
  • Digg
  • Reddit
  • Slashdot
Comments (2) Trackbacks (1)
  1. Hi,

    I get an error ” ERROR – ERROR 2998: Unhandled internal error. Implementing class”.

    No helpful documentation on the internet.

    Can you help me out?

  2. Not sure, off the top of my head, but these kinds of errors tend to vary a bit by Pig release so send a message to the mailing list. They’re very helpful!

Leave a comment