The archiving process

Log data cannot be used directly for end-user reports because it would require to process an enormous amount of data every time the report is needed.

To solve that problem, the archiving process aggregates log data into archive data. Reports are then built using archive data.

Example

Let's take as an example a website that received 1000 page views in one day. The log data would be the list of those 1000 events along with other information, for example:

URL         Time     ...
/homepage   17:00:19 ...
/about      17:01:10 ...
/homepage   17:05:30 ...
/categories 17:06:14 ...
/homepage   17:10:03 ...
...

The archiving process aggregates this raw data into archive data.

For example, to build the report of the number of views per page (to see the most popular pages), the archiving will list all pages and sum the number of views for each page:

URL         Page views
/homepage   205
/categories 67
/about      5
...

That data is the archive data.

While pre-computing archive data seems of course superfluous for 1000 page views, it is not when dealing with higher amounts of data.

When?

By default, archive data is calculated and cached on-demand. When a specific report is requested, Piwik will check if the required archive data exist and generate it if not.

Pre-archiving

When tracking a website with a lot of traffic, the archiving on-demand might take too much time. In those situations, archiving on demand must be disabled and pre-archiving needs to run in background at a scheduled time.

Pre-archiving can be run for every site and period (except custom date ranges) using the core:archive console command:

$ ./console core:archive

A usual setup is to run that command at fixed interval using cron.

The command will remember when it was last executed and will only archive a website if there have been new visits.

How?

Log data is aggregated into archive data for each:

  • site
  • period: day, week, month, year or custom date range (custom date ranges cannot be pre-archived)
  • segment

Archiving logic (i.e. the way of aggregating log data) is defined by plugins. All reports defined by a plugin are archived together rather than individually.

If no segment is given in the query and data cannot be found, every report of every plugin will be generated and cached all at once. If a segment is supplied, then the reports that belong to the same plugins as the requested data will be generated and cached.

Period aggregations

Archive data is calculated differently based on the period type:

  • "day" periods are aggregation of log data
  • "week", "month", "year" and custom date ranges are aggregation of "day" reports

For example archive data for a week is created by aggregating archive data of the 7 days of the week. This is much faster than aggregating log data.

Plugin Archivers

Plugins that want to archive reports and metrics define a class called Archiver that extends Piwik\Plugin\Archiver. This class will be automatically detected and called during the archiving process.

Log data aggregation is handled by the LogAggregator class. Archive data aggregation is handled by the ArchiveProcessor::aggregateDataTableRecords() and ArchiveProcessor::aggregateNumericMetrics() methods.

Plugins can access a LogAggregator and ArchiveProcessor instance with Piwik\Plugin\Archiver.

To learn more about how aggregation is accomplished with Piwik's MySQL backend, read about the database schema.

Persisting archive data

Archive data is persisted using ArchiveProcessor.

Metrics are inserted using insertNumericRecord().

Reports are first serialized using DataTable::getSerialized() and then inserted using ArchiveProcessor::insertBlobRecord():

// insert a numeric metric
$myFancyMetric = // ... calculate the metric value ...
$archiveProcessor->insertNumericRecord('MyPlugin_myFancyMetric', $myFancyMetric);

// insert a record (with all of its subtables)
$maxRowsInTable = Config::getInstance()->General['datatable_archiving_maximum_rows_standard'];j

$dataTable = // ... build by aggregating visits ...
$serializedData = $dataTable->getSerialized(
    $maxRowsInTable,
    $maxRowsInSubtable = $maxRowsInTable,
    $columnToSortBy = Metrics::INDEX_NB_VISITS
);

$archiveProcessor->insertBlobRecords('MyPlugin_myFancyReport', $serializedData);

Persisted reports and metrics are indexed by the website ID, period and segment. The date and time of archiving is also attached to the data. To learn the specifics of how this is done with MySQL see the database schema.

Reports vs Records

When a report is archived, it is called a record not a report. We make a distinction because multiple reports can sometimes be generated from one record.

For example, the UserSettings plugin uses one record to hold browser details of visitors. This record is used to generate both the UserSettings.getBrowserVersion and UserSettings.getBrowser reports. The second report simply processes the first to make a new report. The plugin could have archived both reports, but this would have been a massive waste of space, considering the new report would be cached for every website/period/segment combination.

Record storage guidelines

Care must be taken to store as little as possible when persisting records. Make sure to follow the guidelines below before inserting records as archive data:

  • Records should not be stored with string column names. Instead they should be replaced with integer column IDs (see Metrics for a list of existing ones).
  • Metadata that can be added using existing data should not be stored with reports. Instead they should be added in API methods when turning records into reports.