There are lots of easy ways to analyse standard Apache (and Tomcat) access logs. Tools like Webalizer and AWStats have existed for some time and are well tailored to logs generated from "standard" web traffic. What happens when you want to get a bit more advanced, look for different trends and graph things that standard tools just won't do? When your servers generate gigabytes (or terabytes) of data, the natural choice right now would be to gravitate towards Hadoop. However starting to hack away at MapReduce jobs in Java somehow seems a bit, well, wrong to be honest. Then came along a little Pig.
As per the website, "Pig is a platform for analysing large data sets that consists of a high-level language for expressing data analysis programs..." Just as Hadoop is the open source (aka Yahoo!) version of Google's MapReduce framework, Pig can be thought of (in a way) as the open source version of Google Sawzall. With Pig you can quickly say goodbye to low-level MapReduce functions and express your goals in a more natural, high-level language. Pig's contrib library, Piggy Bank, also has a ready built parser for the standard Apache (or Tomcat) access logs, just to make it all the more more appealing.
Okay, enough talk, let's get started! Here's what we're going to achieve in this post: Extract a subset of GET requests from the access logs and create a histogram in order to determine the effectiveness an HTTP cache might have.
- Hadoop Core
- Pig (instructions)
- PiggyBank (comes with Pig)
- Apache "common" or "combined" format access log files
After getting Pig installed, we're ready to start scripting some simple Pig Latin. To start with, let's just get the logs parsed and something running through Pig.
REGISTER piggybank.jar;</p> <p>DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();</p> <p>logs = LOAD '$LOGS' USING ApacheCommonLogLoader as (remoteHost, hyphen, user, time, method, uri, protocol, statusCode, responseSize);Next we want to filter and keep just the requests that we care about. Additionally, we will do a simple projection to keep only the fields we need.
logs = FILTER logs BY method == 'GET' AND statusCode >= 200 AND statusCode < 300 AND uri matches '/.<em>/rest/v1/places/.</em>';</p> <p>logs = FOREACH logs GENERATE uri, responseSize;We have kept
responseSizebut won't use it here. We could conceivably use this to determine how much bandwidth we would save by caching responses or how large the cache will likely need to be. From here, we will group based on the URI and count the occurrences of each. We will store this intermediate result for possible later analysis.
groupedByUri = GROUP logs BY uri;</p> <p>uriCounts = FOREACH groupedByUri GENERATE group AS uri, COUNT(logs) AS numHits;</p> <p>STORE uriCounts INTO 'uri_counts';At this point we have something that is producing output and could run as is. Pig runs stand-alone by default, so testing is pretty simple. I strongly recommend running against a sample log file first, just to verify correctness. We have one parameter defined in the Pig script, so we need to remember to pass this in on the command line.
pig -x mapreduce -f apacheAccessLogAnalysis.pig -param LOGS='access.test.log'
Running what we have so far would give us the following output.
Last step in creating the histogram data set is to do a further grouping based on the above output. This will give us an idea of how many URIs would benefit from being cached. That is, how many URIs are hit multiple times.
groupedByCount = GROUP uriCounts BY numHits;</p> <p>histogram = FOREACH groupedByCount GENERATE group AS numHits, COUNT(uriCounts) AS numUris;</p> <p>STORE histogram INTO 'histogram';The final output is pretty concise and ready to be graphed. The first column shows us the hit count and the second column shows the total number of URIs that have that hit count. So far so good, it also looks like our numbers match up with what we can see from the first output.
At this point we can easily create some aggregates as well to show the percentage of cache hits and misses. To do this, we want to sum up all of the values of the second column except for the first row (which have just 1 hit). I'll call these the one hit wonders.
numOneHitWonders = FILTER histogram BY numHits == 1; numOneHitWonders = FOREACH numOneHitWonders GENERATE numUris AS num;</p> <p>STORE numOneHitWonders INTO 'single_hits';The last step is to get the total number of hits on URIs with 2 or more hits, and also to count the number of URIs that have two or more hits on them. These are the URIs that are candidates for caching.
-- gather up the rest that have have 2 or more hits multiHits = FILTER histogram BY numHits >= 2;</p> <p>-- count the total number of hits on URIs with 2 or more hits numHitsOnCachedUris = FOREACH multiHits GENERATE (numHits * numUris) AS num;</p> <p>numHitsOnCachedUris = GROUP numHitsOnCachedUris ALL; numHitsOnCachedUris = FOREACH numHitsOnCachedUris GENERATE SUM(numHitsOnCachedUris.num);</p> <p>STORE numHitsOnCachedUris INTO 'multi_hits';</p> <p>-- count the number of URIs that will benefit from caching numUrisCached = GROUP multiHits ALL; numUrisCached = FOREACH numUrisCached GENERATE SUM(multiHits.numUris);</p> <p>STORE numUrisCached INTO 'uris_benefit_cache';Given these last two results, we could easily calculate the expected cache hit rate, but that is easier done outside of Pig since it just involves some simple algebra, however I will leave that exercise up to the reader.
Okay, enough plain old numbers. What I want to see are some graphs, sweet graphs!
I could easily use Excel for this since I have everything in nice tab delimited columns, but what geeky project is complete without a command line generated graph?! Plus, with gnuplot, we could easily automate this to run on a monthly, weekly or even nightly basis.
Here's a gnuplot configuration to chart the histogram we just generated. It loads a file called
histogram.txt (which is the output from Pig renamed to a nice filename) and charts this simply as a
set boxwidth 1 relative set style data histograms set style fill solid 1.0 border -1</p> <p>set yrange [0:*] set xlabel '# hits' set ylabel '# URIs'</p> <p>set term png set output 'histogram.png'</p> <p>set datafile separator " "</p> <p>plot 'histogram.txt' using 2:xticlabels(1) notitleBased on our sample data set, this generates a small, concise histogram.
In the end, was this really necessary? No, not really. Most tools like
awk would probably be able to handle most average access log files for something like a day or week. Was it an interesting experience? Absolutely! Plus, larger data sets would definitely benefit from some parallelism and more complex scripts and analysis can definitely be based on this simple script.
Here's a look at a real histogram produced from production traffic. A little more interesting than the sample chart! Note that this output was charted as logarithmic on the y-axis since the numbers are so much larger.
To see the real scale, here is the same graph charted without a logarithmic y-axis.
For reference, this Pig script took about 25 mins. (with some debug output) to run in stand-alone mode on my laptop at about 80% CPU utilization. The data set it ran against was from access logs totalling ~15M entries. It could definitely process a lot more data on a cluster.
All of the source for this post can be downloaded here.