Friday, July 18, 2008

Monitoring Your Web Server Logs

In my SEO post, I noted that you should really pay attention to how people are using your site. You can do this with services like the free Google Analytics, or your web hosting provider may provide some traffic analysis tool. Other products that process your server logs are also available.

On my smaller sites, I actually just use Excel to sift through my server logs! Here's how I do it. (My web hosting company uses the Apache web server, so my notes are specific to that type of server.) Here's an example line from my Apache log file:


193.53.87.89 - - [17/Jul/2008:00:51:18 -0400] "GET /index.shtml HTTP/1.1" 200 16640 "http://www.google.com/search?hl=en&q=helpful+information&btnG=Search" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"


Here's what each bit of that logged data is:

  • Host - 193.53.87.89 - This is the internet address ("IP address") of the site vistitor's computer. Sometimes you'll see a slightly more recognizable web domain name like c-193-53-87-89.hsd1.ga.comcast.net here instead of just numbers.

  • The two dashes are reserved for password-protected pages. If my site were password protected, the dashes would be replaced with the logged-on user's id.

  • Time - [17/Jul/2008:00:51:18 -0400] - This is the date and time when the visitor requested this page from the server. The -0400 is my server's time zone (i.e. 4 hours before GMT).

  • File - "GET /index.shtml HTTP/1.1" - This is the page requested from my site, with the technical name of the request method (HTTP 1.1). When the page is a form that's submitted, you'll see POST instead of GET. When someone's server is just making sure your page is still there, you may see HEAD instead of GET.

  • Code - 200 - This is the technical code for whether the server is returning a page or an error. You're probably familiar with the code 404 for a page not found error. 200 just means, "OK nothing wrong. Here's the requested page."

  • Bytes - 16640 - This is the number of bytes the server delivered. My site's home page is about 16,640 bytes. This number probably won't exactly match your page file sizes because the number includes the size of extra information the servers sends with your page ("headers").

  • Referer - "http://www.google.com/search?hl=en&q=helpful+information&btnG=Search" - This value is the page your visitor was on right before coming to your page. This is really valuable information for a marketer. Most of the time the page is just another page on your own site, but in this case, it tells me that my visitor used Google to search for sites about "helpful information." However, if this value is just "-", then someone used a bookmark to visit your site, or typed your web site address into their browser by hand. More commonly, web indexing spiders often will not provide a referer value and you'll just see "-" here.

  • Browser - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" - This big long hairy string contains information about your visitor's web browser and computer operating system. This can be helpful if you want to know whether people are using PCs or Macs to view your site. Web developers can also use this information to be sure your site works well with the various web browsers available.



I download log files daily from my web hosting account using FTP. I usually download the previous day's file, which on my server is called access.yesterday. This is just a plain text file that you could open with Wordpad or Notepad, but it's tough to read through that way.

I open my access.yesterday file in Excel. When Excel's Text Import wizard pops up, I choose Delimited and click Next. Then I unselect the Delimiters button for Tab and select the button for Space. I leave all the other settings alone and click the Finish button.

Now I have a spreadsheet with each bit of log information in a separate column, but the data is still a little messy, and there are no column headings. I used to add a header row, do a lot of cleanup, and sort the data manually, but finally I recorded an Excel macro to do that for me. Here's the macro code; feel free to use it. Again, it only works for Apache server log files. (By the way, change 2008 to the year you're in if you're reading this post from the future!)


Sub Apache_Log_Cleanup()
'
' Apache_Log_Cleanup Macro
' Cleans up an imported Apache web server log
'
'
Rows("1:1").Select
Selection.Insert Shift:=xlDown, CopyOrigin:=xlFormatFromLeftOrAbove
Range("A1").Select
ActiveCell.FormulaR1C1 = "Host"
Range("D1").Select
ActiveCell.FormulaR1C1 = "Time"
Range("F1").Select
ActiveCell.FormulaR1C1 = "File"
Range("G1").Select
ActiveCell.FormulaR1C1 = "Code"
Range("H1").Select
ActiveCell.FormulaR1C1 = "Bytes"
Range("I1").Select
ActiveCell.FormulaR1C1 = "Referer"
Range("J1").Select
ActiveCell.FormulaR1C1 = "Browser"
Range("B:C,E:E").Select
Range("E1").Activate
Selection.Delete Shift:=xlToLeft
Columns("B:B").Select
Selection.Replace What:="[", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Selection.Replace What:="2008:", Replacement:="2008 ", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Columns("C:C").Select
Selection.Replace What:="GET ", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Selection.Replace What:=" HTTP/1.0", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Selection.Replace What:=" HTTP/1.1", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Rows("1:1").Select
Selection.Font.Bold = True
Range("A2").Select
ActiveWindow.FreezePanes = True
Range("C1").Select
Range("A1").Select
ActiveWorkbook.Worksheets("access").Sort.SortFields.Clear
ActiveWorkbook.Worksheets("access").Sort.SortFields.Add Key:=Range("A:A" _
), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
ActiveWorkbook.Worksheets("access").Sort.SortFields.Add Key:=Range("B:B" _
), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
ActiveWorkbook.Worksheets("access").Sort.SortFields.Add Key:=Range("C:C" _
), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
With ActiveWorkbook.Worksheets("access").Sort
.SetRange Range("A:G")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
Columns("A:A").ColumnWidth = 25
Columns("B:B").EntireColumn.AutoFit
Columns("C:C").ColumnWidth = 25
Columns("C:C").ColumnWidth = 25
Columns("D:D").EntireColumn.AutoFit
Columns("E:E").EntireColumn.AutoFit
Columns("F:F").ColumnWidth = 30
End Sub


Once you run that macro, you'll be able to sort your data in Excel however you want. The macro automatically sorts by Host, then Time, then File to group visits from the same user (or web spider) together. Pages accessed from the same Host generally represent a visit from one user.

Now that your web log data is all cleaned up, you can scroll through it to find any insteresting visit patterns. I like to sort it on Referer so I can see where my visitors found a link to my site, or what terms they used on search engines that turned up a link to my site. If often also sort on File and delete all the rows with images, css, js, xml and other files that aren't web pages. That makes it easier to focus just on visits to your web pages.

No comments: