optimisation
juust | December 15, 2008Adriaan asked about handling a large mysql table with regard to a traffic log :
I have a similar problem with a rank tracker, a table that grows fast, and after a while queries become painful slow. I want to show results per domain on screen in less than a second the next two years (generally 5 to 20 urls per domain, with 200.000 urls for 80.000 domains per month).
warehouse
Some information systems offer a snapshot reporting system. Using it means shifting the focus from querying on raw entry tables towards aggregating the entry data into tables and reporting from these tables (technorati cosmos, google analytics, sap business warehouse). Successful systems minimize their costs (resources like database and application server capacity, and admin hours) whilst offering end users a rich data representation with short report generation time.
If I use the traditional “data function form” application model, I can optimise all three areas and avoid querying on large tables.
data
First the data layer :
For raw entries I use one archive table, one table to store entries of the current period,
and besides that, a set of aggregated data tables.
- for raw entries : a small temp-table and basic insert-queries
- for raw entries : an archive table with a periodic insert of the temp-table
- for end user reporting : basic queries on the snapshot/warehouse tables
- a procedure to archive raw entries
- a procedure to aggregate to snapshot tables
The insert instructions are performed on a small table, which works faster. Transforming data into information is done in the aggregation procedure, once a day or once a month. These are the same queries I would use for end-user reports, but now I report to the snapshot ‘warehouse’.
The cycles and time for operations that I would normally perform on the whole set for each report query, are stored in the warehouse aggregates. After aggregating the raw entry data, I archive the records (=dump the temp table into the archive table) and clear the temp tables for new input. With a bit of luck, I don’t have to access the archive anymore.
functions
After bringing down query-time in the data layer, I can diminish the number of times functions are run by caching parts of pages to disk, as html tables or a serialized array. In stead of generating the same table every time a user hits a page, I can store the generated table and retrieve it from a server cache directory. That can also save some connection handles.
forms
In the form layer I can use the apache gzip module and compress data sent to a clients browser which decreases traffic and page loading time.
evaluation
I am testing it with a site, for now it saves up to 80% in data traffic and page loading, next few months I’ll see if it holds up. It seems to work fine for registration systems with a lot of static historic data like a hit log, rank tracker, time sheets, banking statements.
Negative :
- more tables
- requires more programming and planning
- I cannot easily add ‘on the fly’ reports
Positive :
- minimal claim on database/application server resources
- more control over transformation of data
- good basis for a ‘rich’ frontend
- scalable
- fast pages
- no large tables, fast queries







