Web server statistics

The Apache web server can log information about client access in many independent configurable ways. However, a typical installation will create log files in a common fixed content format.

This exercise demonstrates how perl or php can read an http access log and store the information, encoded to reduce storage requirements, in a PostgreSQL database.

The logfile

The common logfile format is as follows:

remotehost rfc931 authuser [date] "request" status bytes <"software">

The information contained in a line of a logfile can be read using this perl statement:

( $l_clientAddress
, $l_rfc1413
, $l_username
, $l_localTime
, $l_httpRequest
, $l_statusCode
, $l_bytesSentToClient
, $l_referer
, $l_clientSoftware
)
= /^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/o;

or this php statement:

preg_match (
  '/^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/'
, $logLine
, $match
);

    $l_clientAddress     = $match[1];
    $l_rfc1413           = $match[2];
    $l_username          = $match[3];
    $l_localTime         = $match[4];
    $l_httpRequest       = $match[5];
    $l_statusCode        = $match[6];
    $l_bytesSentToClient = $match[7];
    $l_referer           = $match[8];
    $l_clientSoftware    = $match[9];

So essentially, we have nine pieces of information

( v0, v1, v2, v3, v4, v5, v6, v7, v8 )

The database

A database system will be used to store the information read from the access log. The next task is to determine a data structure that will contain the same information as the logfile, but will take up less space. Since each value (containing strings of characters) found in the logfile will have a probability of being repeated in different logfile lines, it makes sense to encode the value by assigning a numeric identifier to correspond with the value and using the numeric identifier in place of the string.

Consider these two core tables:

access
  logfile_id    INTEGER
  line_number   INTEGER
  client_id     INTEGER
  rfc1413_id    INTEGER
  username_id   INTEGER
  local_time    TIMESTAMP
  request_id    INTEGER
  status        INTEGER
  bytes_sent    INTEGER
  referer_id    INTEGER
  software_id   INTEGER

and

ids
  id        SERIAL
  id_value  VARCHAR(200)
  id_type   INTEGER

The id_type corresponds to v0 to v8 fields read from the logfile. Since the values of fields v3, v5 and v6 can be stored directly in an integer column, they do not need to be encoded using identifiers. The actual id_type values for the various types are not fixed, since they are also encoded into the ids table. A first run of the program should yield this table portion:

 id |          id_value          | id_type
----+----------------------------+---------
  1 | id types                   |       0
  2 | logfile                    |       1
  3 | client ip                  |       1
  4 | rfc1413                    |       1
  5 | username                   |       1
  6 | request                    |       1
  7 | referer                    |       1
  8 | software                   |       1

When a line in the logfile is read, the values of v0, v1, v2, v4, v7 and v8 are looked up in the ids table using a getId function. If the value is found, the identifier is returned; if not, the value is inserted into ids and the new identifier is returned.

For client ip id_types, there will be a corresponding additional id for the remote host of the client per gethostbyaddr. The id_type of a client ip's host is the client ip's id.

Populating the database tables

A perl or php program is used to read the logfile and insert the information into your database tables.

The program can be run any number of times at any interval you wish. The logfile from which an access record came from is identified by the localTime value of the first line in the logfile and the line number in the logfile.
Note: The program reads all access log files in a path, including older ones that are gzipped.

When the program is run, the logfile identifier is determined, as is the last row stored in access. The program will not begin inserting new data until parsing data in the logfile beyond the previous last row processed.

The programs

perl	http-log-to-db.pl	Uses DBI. Database is Postgresql, so it also uses package Pg
php	http-log-to-db.php	Database is Postgresql, so it uses pg_* functions

Using the data in the database

Reports can be generated from data stored in the backend by creating appropriate SQL statements.

This query recreates the original access log

create view client_info as
select
  A.id
, A.id_value as client
, B.id_value as client_host
from
  ids A
, ids B
where
  A.id = B.id_type
  and A.id_type = (select id from ids where id_value = 'client ip')
;

create view access_log as
select
  A.client
, A.client_host
, B.id_value as rfc1413
, C.id_value as username
, access.local_time
, D.id_value as request
, access.status
, access.bytes_sent
, E.id_value as referer
, substr (E.id_value
         , 1
         , position ('/' in substr (E.id_value
                                   , 2 + position ('//' in E.id_value)
                                   )
                    )
         + position ('//' in E.id_value)
         + 1)
  as referer_host
, F.id_value as software
from
  access
, client_info A
, ids B
, ids C
, ids D
, ids E
, ids F
where
    access.client_id   = A.id
and access.rfc1413_id  = B.id
and access.username_id = C.id
and access.request_id  = D.id
and access.referer_id  = E.id
and access.software_id = F.id

Rank referers by descending frequency

select
  count (*) as n
, referer
from
  access_log
where
  referer != '-'
group by
  referer
order by
  n DESC;

Rank referer sites by descending frequency

select
  count (*) as n
, referer_host
from
  access_log
where
  referer_host != '-'
group by
  referer_host
order by
  n DESC;

Etc...

This page was last updated 25 March 2001.