Module 4 b Perl for Web Log Analysis
Module 4 b: Perl for Web Log Analysis 152. 98. 11 - - [16/Nov/2005: 16: 32: 50 -0500] "GET /jobs/ HTTP/1. 1" 200 15140 "http: //www. google. com/search? q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; . NET CLR 1. 1. 4322)“ 252. 113. 176. 247 - - [16/Feb/2006: 00: 06: 00 -0500] "GET / HTTP/1. 1" 200 12453 "http: //www. yisou. com/search? p=data+mining&source=toolbar_yassist_button&pid=400 740_1006" "Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; My. IE 2)" 252. 113. 176. 247 - - [16/Feb/2006: 00: 06: 00 -0500] "GET /kdr. css HTTP/1. 1" 200 145 "http: //www. kdnuggets. com/" "Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; My. IE 2)" 252. 113. 176. 247 - - [16/Feb/2006: 00: 06: 00 -0500] "GET /images/KDnuggets_logo. gif HTTP/1. 1" 200 784 "http: //www. kdnuggets. com/" "Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; My. IE 2)" © 2006 KDnuggets
Perl - introduction § A full-featured, fast, and easy to use scripting language § Very powerful pattern-matching facilities § More powerful than gawk; very popular for web programming and CGI files § Many Perl tutorials, e. g. learn. perl. org/ www. perl. com/pub/a/2000/10/begperl 1. html www. perlmonks. org/index. pl? node=Tutorials © 2006 KDnuggets
Perl – historical note § PERL stands for Practical Extraction and Reporting Language § Developed by Larry Wall § Perl 1. 0 was released to usenet's alt. comp. sources in 1987 § Perl is the most popular web programming language – due to powerful text manipulation and quick development. § Perl is widely known as "the duct-tape of the Internet". © 2006 KDnuggets
Perl - running § First Perl script (on Unix) file 1. pl #!/usr/local/bin/perl -w print "Hi there!n"; Note: On Windows, first line usually is #!c: /Perl/bin/perl. exe -w % file 1. pl Result: Hi there! © 2006 KDnuggets
Perl for Windows § Active Perl – ready-to-install Perl distribution § Runs on Windows, Linux, MAC OS, and other OS § Free download www. activestate. com/Products/Active. Perl/ © 2006 KDnuggets
Perl basics § Two data types: numbers and strings § Perl uses many special characters $, @, %, as part of its syntax § Perl variables: § Scalars (simple variables, things) start with $, e. g. $count § Arrays (lists) start with @, e. g. @array 1 § Hashes (associative arrays) start with % § Usual control structures § Full introduction to Perl is beyond the scope of this module © 2006 KDnuggets
What does this code do? @P=split//, ". URRUUc 8 R"; @d=split//, "nrekcah xin. U / lre. P rehtona tsu. J"; sub p{ @p{"r$p", "u$p"}=(P, P); pipe"r$p", "u$p"; ++$p; ($q*=2)+=$ f=!fork; map{$P=$P[$f^ord ($p{$_})&6]; $p{$_}=/ ^$P/ix? $P: close$_}keys%p}p; p; p; map{$p{$_}=~/^[P. ] /&& close$_}%p; wait until$? ; map{/^r/&&<$_>}%p; $_=$d[$q]; sleep rand(2)if/S/; print Answer: We do NOT want to know ! © 2006 KDnuggets
The Tao of Coding § Human time is MUCH more precious than computer time § It is much better (and faster) to develop programs using methods that AVOID mistakes than try to find bugs in badly written programs © 2006 KDnuggets
Perl style: understandability first § Perl allows you to do tricky programs to save a few lines of text § AVOID this approach § Use careful, step by step development § Test after every step § A good program should be easy to understand § Only after you have an understandable program, and only if you need it, you can improve efficiency © 2006 KDnuggets
Perl coding § Variables can be declared implicitly by their first use, e. g. $oldvar=$nevar+27 § if $nevar was not declared before, it will be initialized to zero § Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ? ) § Much better to declare variables explicitly e. g. my $newvar = 0; § Enforced by command use strict © 2006 KDnuggets
Sample log file § We will again use file d 100. log – first 100 lines from the Nov 16, 2005 KDnuggets log file. § We will give useful code examples You are encouraged to try the code examples in this lecture on this file § You should get the same answers! © 2006 KDnuggets
Perl for parsing a web log file Program 0: logparse 0. pl - read and print log file #!c: /Perl/bin/perl. exe -w use strict; while (<>) { my $line = $_; # current line print $line; } © 2006 KDnuggets
Perl regular expressions, 1 § Usage: $var =~ / regex / where regex is a regular expression. E. g. $line =~ /google/ will match all lines containing "google" Note: / delimit regular expression, so / can't be used inside (unless escaped like this / ) © 2006 KDnuggets
Perl log parsing, 1 Check how many lines refer to google #!c: /Perl/bin/perl. exe -w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~/google/) {$cnt++; } } print " $cnt lines matched google"; Applying this code to d 100. log, you get: 2 lines matched google © 2006 KDnuggets
Perl regular expressions, 2 Special characters: . : matches one character a* : matches zero or more repeats of "a" a+ : matches 1 or more repeats of "a" S : matches any non-white space character ^ : anchor – matches beginning of string $ : anchor – matches end of string © 2006 KDnuggets
Log parse 2: IP address § IP address is the first item on the log line. § In almost all log files it is followed by " - - ", representing missing "ident_user" and "auth_user" fields § Regular expression for matching these 3 fields: $line =~ /^(S+) - - /; © 2006 KDnuggets
Perl regex: parentheses capture match variables § Perl regex items enclosed in parentheses () correspond to special match variables. § Variable $1 contains value matched by regular expression in the first parentheses, etc © 2006 KDnuggets
Perl regex: match variables Note: First line with Perl is probably different on your machine #!c: /Perl/bin/perl. exe –w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~ /^(S+) - - /) { my $ip = $1; print "ip $ipn"; $cnt++; } else { print "bad line $linen"; } } print " processed $cnt log linesn"; this program shows how to assign IP to variable $ip; also shows error processing if match is not successful © 2006 KDnuggets
Perl regular expression 4: brackets § Brackets [ ] allow you match any character inside § Example: § [cmt]an will match can, man or tan, § will not match ban or dan. © 2006 KDnuggets
Perl regular expression 4 b: brackets [^ ] [^x] will match any character except x § (note: here ^ is not the beginning of text anchor) Example: [^: ]* will match any string that does not include a colon : . Example: if $date is 16/Nov/2005: 031415 , after $date =~ ([^: ]*): . * [^: ]* will match 16/Nov/2005 Because it was enclosed in (), match result stored in $1 © 2006 KDnuggets
Parsing log: Date, Time § Date, Time is specified in the log as [DD/Mon/YYYY: HH: MM: SS timezone] Matching regular expression [([^: ]+): (. . ) -0500] © 2006 KDnuggets
Parsing log: Date, Time Matching regular expression in detail [([^: ]+): (. . ) -0500] [ matches brackets ] [^: ] matches any string that does not contain : ([^: ]+) will match DD/Mon/YYYY ; value in $1 first (. . ) will match HH (hours); value in $2 second (. . ) will match MM ; in $3 third (. . ) matches SS; in $4 © 2006 KDnuggets
Parsing log: Time Zone § The time zone is relative to GMT § The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same in the time log § but it changes during daylight savings time § In our test log file the time zone is -0500, US Eastern time zone © 2006 KDnuggets
Parsing log: Request Regular expression for parsing Request field: opening and closing quotes "(GET|HEAD|POST|OPTIONS) (S+) HTTP(S+)" method © 2006 KDnuggets URL, captures any string of 1 or more non-blanks HTTP version - usually ignored
Parsing log: Status code and Object size Status (Response) code is always a 3 -digit number, followed by space, so it can be matched with (ddd) Object size is either a number or "-" followed by space. Simplest regex to match it is (S+) © 2006 KDnuggets
Parsing log: Referrer The Referrer is a string enclosed in double quotes "…" Can have anything inside except for a double quote Can also be "-" in case of a direct request. Not documented, but can be "" (nothing between the quotes). Referrer can be matched by: opening and closing quotes "([^"]*)" anything except a double quote © 2006 KDnuggets appearing zero or more times
Parsing log: User agent is also a string enclosed in double quotes "…", that can have anything inside except for a double quote. It can also be "-". User agent can be matched by: opening and closing quotes "([^"]+)" anything except a double quote © 2006 KDnuggets appearing one or more times
Parsing a web log line: putting all together The matching is done by the following (should be all on one line) if ($line =~ /^(S+) - - [([^: ]+): (. . ) -0500] "(GET|HEAD|POST|OPTIONS) (S+) HTTP(S+)" (ddd) (S+) "([^"]*)" "([^"]+)"/ ) { … } Full code is in program weblog_parse. pl © 2006 KDnuggets
Perl arrays § Perl array is an ordered list of items § Array names begin with @ § Array initialization: @days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat") © 2006 KDnuggets
Perl arrays, num of items § When referring to a single array item, name begins with "$". E. g. we print the first array item (index 0) using print $days[0] ; § Number of items in an array is $#array $#days is 7 © 2006 KDnuggets
Perl array iteration § Iterating over entire array foreach $day (@days) {print $day, "n" } ; § is the same as for $n ($n=0; $n <7; $n++) { print $days[$n], "n" } ; © 2006 KDnuggets
Perl hash § Hash is unordered list of key, value pairs. § Hash names begin with % § Hash initialization: %capitals=("USA", "Washington D. C. ", "France", "Paris", "China", "Beijing") ; © 2006 KDnuggets
Perl hash reference § Referring to a single hash item, name begins with "$". § To get capital of China from %capitals we use $capitals{"China"} § To add the capital of UK, we use § $capitals{"UK"} = "London" ; © 2006 KDnuggets
Perl hash iteration Iteration over the entire hash foreach $country (keys %capitals) { print "$country capital $capitals{$country}n"; } © 2006 KDnuggets
Additional tools for Web log analysis § Perl for web log analysis www. oreilly. com/catalog/perlwsmng/chapter/ch 08. html Some web log analysis tools § Analog www. analog. cx/ § AWstats awstats. sourceforge. net/ § Webalizer www. mrunix. net/webalizer/ § FTPweblog www. nihongo. org/snowhare/utilities/ftpweblog/ © 2006 KDnuggets
- Slides: 35