(Statistics and Data Mining I)
For a variety of reasons, meaningful website visitation and visitor behavior statistics are an elusive data set to generate. This article introduces the visitor statistics problem, and describes seven challenges that must be overcome by statistical and data analysis techniques aiming for accurate estimates. Along the way, we’ll encounter the “Good News Cheap, Bad News Expensive” Paradox of Data Mining — or, why information is often used “as-is”.
This article is the first in a series on algorithms, statistics and data analysis techniques (using free and open source tools) using the visitor statistics problem as a vehicle for illustration.
What are Visitor Statistics?
Every site with online articles deals with a fundamental question: who’s the audience? And the more fine-grained the answer, the better. For example, one might like to know what kinds of readers are reading (academic?, individuals?, commercial?, military?, government?), from where they are reading (by country?, city?, browser?, from a desktop or mobile platform?), whether they are new visitors to the site or returning visitors, how often they visit, how many pages were read on each visit and which ones. And if a reasonable gauge of interest or engagement level could be had, that would be even better.
Why is it difficult to get meaningful results? The answer lies in where website visitation information (the “raw data”) comes from and how it is generated.
Where the Information comes from: Server-Side vs. Client-Side Statistics
There are two fundamentally different ways in which one obtains web traffic statistics: at the server side, generated by the web-server itself, or at the client side, generated by client-side code that executes within the user’s browser whenever a page is loaded that has this tracking code in it. Each method has its own pros and cons, and its own biases.
For client-side statistics, fast traffic-recording servers with high uptime are crucial in order to minimize the overhead that the user experiences due to the additional round trip time caused by the execution of the client-side code. For this reason, client-side statistics were typically a paid-for service. You paid for two things: the uptime and fast response time of the traffic-recording server, and for the analytical methods — sometimes transparent, sometimes proprietary — that were used to generate the statistics that were then made available to you as reports. But the client-side statistics game is changing, thanks to Google Analytics, a free client-side statistics package. (A Wikipedia review is here.)
By contrast, server-side statistics are typically generated from information that resides in the logs that your own web server and its various modules maintain by default. Thus server side statistics do not affect the user’s browsing experience. From the processing standpoint, server-side statistics are typically free, courtesy of some excellent open source server-side web statistics packages, notably AwStats and Webalyzer.
Perhaps more significantly, with server-side statistics you have complete control over how the entire history of your web-site’s traffic data is analyzed. The advantage is that you know exactly what processing methods you’re using to obtain your visitation numbers, and thus what kinds of biases are involved. You can analyze, re-analyze, and set up heuristic filters of your own, re-processing the entire history at will. This also means that you can evolve your heuristic filters to better account for your site’s particular traffic patterns. As your algorithms evolve, the entire history can be re-processed using the same algorithm, thus allowing the entire data set to be examined as a time-series.
Fortunately, these two methods are not mutually exclusive. And as with any estimation problem with elusive quantities, you should ideally use more than one estimator, exploiting the particular characteristics of each to improve the joint estimate.
With this as background, let’s look at the headlined question:
Why is it Difficult to Obtain Meaningful Website Visitation Statistics?
For most content generating sites, “meaningful” visitation statistics is a count of interested humans reading the material. Robot and Web Crawler visits don’t count. Casual glances (i.e. the “gone in 60 seconds” visitors) don’t count. The driving desire is to understand one’s audience through its online behavior, as opposed to using survey techniques.
With the advance in web based technologies and platforms, one might think that it would be reasonably straight-forward to get meaningful traffic-based statistics. But if you’re looking for counts and histograms that are ready to be interpreted as “interested human visitors”, you are likely to come away less than satisfied.
- Ubiquitous Auto-bot Traffic (Autonomous Web Crawlers).
- Congestion Reducing Proxy Cache services.
- Dynamically Allocated IP Addresses.
- The Absence of Unique Identifiers
- Content Distributing Feeds and Feed Readers
- Your Own Organization’s Interfering Traffic
- Self-Reinforcing Visitation Patterns and Time-Varying Change in Variance
Let’s look at each of the seven reasons in turn. This discussion will set the stage for the reasoning behind the heuristic filters we’ve implemented.
For the past several years, there has been an ever increasing volume of autonomous network traffic that does not represent human eyeballs. This traffic consists of auto-bots, indexers, search engine spiders, content scrapers, link fishers, automatic comment spammers, and automated hacking bots, among others, all of whom generate “hits” and often even page loads that inflate your server-side traffic statistics.
Now, some auto-bots identify themselves as such, making it relatively easy to filter for them. But the increasing global availability of cheap server space through server farms or low cost cloud computing platforms (such as Amazon’s) has made the identification problem more difficult. Autonomous activity originating from these clusters are now able to draw from a much larger pool of IP addresses and to operate from better and faster platforms. Typically, these auto-bots are not the helpful type, consisting largely of hacking bots, automatic comment spammers, content scrapers, link fishers, and the like.
The type of auto-bots that visit your site and the regularity (or lack of it) with which they visit, affect the kind of biases you can expect in your statistics. For example, search related auto-bots typically return repeatedly to your site, and do so from a small and relatively stable pool of IP addresses. Thus, the visits due to such auto-bots adds unfounded strength to the apparent proportion of returning (human) visitors than is in fact the case. As another example, indexers typically return within seconds or minutes of a change being made to a page that they are tracking. So, if you tend to make lots of edits “live” on your site, you will generate spikes of auto-bot traffic correlated with your edits, spikes that may be misinterpreted as interested readers.
Thus, though not all auto-bots are bad — indexers and search engines are important for your site to be discovered by those who would find it useful. But if your goal is understanding your site’s human readership, then auto-bots of all types are an undesirable source of noise in your data set, and need to be filtered out.
(2) Congestion reducing Proxy Cache services.
Proxy Caches are servers that cache frequently requested pages on a network so that they can re-serve them without having to send a request for content onward to your web server and wait for the return. Proxy Caches are often part of wide area network congestion management solutions, and can be quite efficient in reducing traffic. But in order to maintain their relevance, they reach out to your website periodically (many quite frequently), and perform a quick check to see if their cached pages have changed.
The challenge for server-side statistics is that there is no server side indication of how many eyeballs accessed your content pages from the proxy cache server instead of from your web server. This is one of the areas where client-side statistics do better than server-side, since the tally is generated by tracking code executing with the user’s browser whenever any of your content is loaded, regardless of whether that content is served from your server or from a proxy server.
Raw visitor statistics are primarily based on IP Address. But a large proportion of IP addresses are issued dynamically by an ISP or an institutional ASN. So, if you merely look at returning IP Addresses to measure returning visits, your perception of returning visitors will be biased low.
For example, a typical ISP may respond to every IP address request with one of a number of IP addresses available in its allocation space. This may mean a new IP address allocation every time a user connects to the Internet, or after a reboot, or once a particular IP allocation has expired.
At the extreme end of dynamic IP address allocation are ISPs such as AOL that re-issue a new IP address for every page that a visitor views.
Dynamic allocation makes it difficult to equate IP Addresses with visitors, an assignment that requires additional criteria.
(4) The Absence of Unique Visitor Identifiers
Browser privacy being what it is, you typically cannot (nor in good conscience should you) search for an actual identity. However, what you can do is put together a virtual “pseudo-identity” that is likely to be associated with either an individual or a small number of closely associated individuals (e.g. those sharing a computer or access point). This gives enough informatino to obtain a reasonable estimate on whether a given visitor is new or returning, and on their engagement level with the site and content.
(5) Feeds and Feed Readers
Feed issuers and feed readers mean that users can (depending on how you set these up) read text versions of your content without actually visiting your site. This is a problem for both server-side as well as client-side statistics, and even for feed stat-keepers such as Google’s Feedburner.
(6) Your Own Organization’s Interfering Traffic
As a site owner, your activity on your site is an additional, possibly non-trivial source of “noise” on your visitor statistics. Depending on how often you review your site, add content, test it, and maintain it — especially in the early stages of a site launch — all of these add to the total number of hits and pages counters. Unless you filter out your own activity, you will be biasing your results high.
Similarly, if your organization spends time on various parts of your site, then these visits should also be filtered out — unless page reads by your team count as ‘reads’ that you want to track.
(7) Self-Reinforcing Visitation Patterns and Time-Varying Change in Variance
As your web-traffic grows, traffic monitors and other bots are attracted, which makes volumetric comparisons tricky. The variance of the time-series is affected both by your and your organization’s behaviors, but also by the behaviors of other users.
The Need for Filtering
The reasons discussed above mean that using raw visitor statistics, without careful cleanup, for estimates of visitor traffic is fraught with inaccuracy. Both the direction of the biases as well as the variance in the estimates may be changing. Trying to untangle the data takes some effort.
“Good News Cheap, Bad News Expensive” — or Why Information is often used “As Is” in Business Decision-Making
As we’ve seen, website statistics are a perfect example of a moving, noisy target. Somehow related to the wealth of raw traffic data, it is clear that there are reasonably good answers. But what these relationships are and how to get at them is not obvious and takes work.
I’ll call this one the “Good News Cheap, Bad News Expensive” Paradox. The name highlights the problem. Everyone wants good news. But, if the raw data already deliver so-called “good news”, in this case through high visitation numbers, or a large proportion of returning visitors, then this “good news” is in itself a counter-motivation to investing further time and energy in developing better heuristic filters. This is even more so if it is clear that the advanced techniques are likely to deliver worse news, for example, that your real visitors (the human ones who are likely to be actually reading your articles) are fewer than the raw numbers suggest.
The difference between the two numbers can be quite dramatic. For example, in a small, relatively new site, the difference between raw or lightly filtered visitor stats and thoroughly filtered stats in which a substantial amount of spurious traffic is removed, can be as much 30% of the total traffic. That kind of reduction in “good news” requires substantial additional insights to soften the blow.
Enter data mining techniques.
Escaping the Paradox — the Value of Data Mining
Like many paradoxes, the resolution to the “Good News Cheap, Bad News Expensive” Paradox is a change in point of view. Through the use of data mining techniques, one obtains a more detailed, multi-dimensional characterization of visitor profiles and behavior, something that is impossible when looking only at the raw or lightly filtered statistics. It is this additional information that offers the additional value required to escape the “Good News Cheap, Bad News Expensive” Paradox.
For those willing to put in the effort in pursuit of that elusive better understanding, heuristic filtering and data mining techniques are an approach well worth considering.
In the second article of this series, I’ll turn to the questions of designing heuristic filters. In particular, I’ll discuss a radar-tracking approach to improving visitor estimates that works by modelling the web-server as a detection sensor. Stay tuned!