The Apache web server can log information about client access in many independent configurable ways. However, a typical installation will create log files in a common fixed content format.
This exercise demonstrates how perl or php can read an http access log and store the information, encoded to reduce storage requirements, in a PostgreSQL database.
The common logfile format is as follows:
remotehost rfc931 authuser [date] "request" status bytes
<"software">
The information contained in a line of a logfile can be read using this perl statement:
( $l_clientAddress , $l_rfc1413 , $l_username , $l_localTime , $l_httpRequest , $l_statusCode , $l_bytesSentToClient , $l_referer , $l_clientSoftware ) = /^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/o;
or this php statement:
preg_match ( '/^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/' , $logLine , $match ); $l_clientAddress = $match[1]; $l_rfc1413 = $match[2]; $l_username = $match[3]; $l_localTime = $match[4]; $l_httpRequest = $match[5]; $l_statusCode = $match[6]; $l_bytesSentToClient = $match[7]; $l_referer = $match[8]; $l_clientSoftware = $match[9];
So essentially, we have nine pieces of information
( v0, v1, v2, v3, v4, v5, v6, v7, v8 )
A database system will be used to store the information read from the access log. The next task is to determine a data structure that will contain the same information as the logfile, but will take up less space. Since each value (containing strings of characters) found in the logfile will have a probability of being repeated in different logfile lines, it makes sense to encode the value by assigning a numeric identifier to correspond with the value and using the numeric identifier in place of the string.
Consider these two core tables:
access logfile_id INTEGER line_number INTEGER client_id INTEGER rfc1413_id INTEGER username_id INTEGER local_time TIMESTAMP request_id INTEGER status INTEGER bytes_sent INTEGER referer_id INTEGER software_id INTEGER
and
ids id SERIAL id_value VARCHAR(200) id_type INTEGER
The id_type corresponds to v0 to v8 fields read from the logfile. Since the values of fields v3, v5 and v6 can be stored directly in an integer column, they do not need to be encoded using identifiers. The actual id_type values for the various types are not fixed, since they are also encoded into the ids table. A first run of the program should yield this table portion:
id | id_value | id_type ----+----------------------------+--------- 1 | id types | 0 2 | logfile | 1 3 | client ip | 1 4 | rfc1413 | 1 5 | username | 1 6 | request | 1 7 | referer | 1 8 | software | 1
When a line in the logfile is read, the values of v0, v1, v2, v4, v7 and v8 are looked up in the ids table using a getId function. If the value is found, the identifier is returned; if not, the value is inserted into ids and the new identifier is returned.
For client ip id_types, there will be a corresponding additional id for the remote host of the client per gethostbyaddr. The id_type of a client ip's host is the client ip's id.
A perl or php program is used to read the logfile and insert the information into your database tables.
The program can be run any number of times at any interval you wish. The logfile from which an access record came from is identified by the localTime value of the first line in the
logfile and the line number in the logfile.
Note: The program reads all access log files in a path, including older ones that are gzipped.
When the program is run, the logfile identifier is determined, as is the last row stored in access. The program will not begin inserting new data until parsing data in the logfile beyond the previous last row processed.
perl | http-log-to-db.pl | Uses DBI. Database is Postgresql, so it also uses package Pg |
php | http-log-to-db.php | Database is Postgresql, so it uses pg_* functions |
Reports can be generated from data stored in the backend by creating appropriate SQL statements.
create view client_info as select A.id , A.id_value as client , B.id_value as client_host from ids A , ids B where A.id = B.id_type and A.id_type = (select id from ids where id_value = 'client ip') ; create view access_log as select A.client , A.client_host , B.id_value as rfc1413 , C.id_value as username , access.local_time , D.id_value as request , access.status , access.bytes_sent , E.id_value as referer , substr (E.id_value , 1 , position ('/' in substr (E.id_value , 2 + position ('//' in E.id_value) ) ) + position ('//' in E.id_value) + 1) as referer_host , F.id_value as software from access , client_info A , ids B , ids C , ids D , ids E , ids F where access.client_id = A.id and access.rfc1413_id = B.id and access.username_id = C.id and access.request_id = D.id and access.referer_id = E.id and access.software_id = F.id
select count (*) as n , referer from access_log where referer != '-' group by referer order by n DESC;
select count (*) as n , referer_host from access_log where referer_host != '-' group by referer_host order by n DESC;