Apache Parquet for Compliance Archiving | Scalefield Secure by CYBERTEC

Regulatory Context

Compliance Doesn't Expire.
Your Archive Shouldn't Either.

Every major compliance framework mandates that database audit logs are retained for years, not months. A legacy solution that stores data in bloated row-based formats — or proprietary binary files — becomes a financial and operational liability as your archive grows. Scalefield Secure solves this with Apache Parquet.

SOX

7 yrs

Sarbanes-Oxley requires financial system audit logs for a minimum of 7 years for public companies.

GDPR

6 yrs

GDPR Article 5 accountability principle requires data access logs to demonstrate compliance for up to 6 years.

HIPAA

6 yrs

US Health Insurance Portability Act mandates covered entities retain audit control documentation for 6 years.

PCI DSS

1 yr

Payment Card Industry standard requires 12 months of online audit logs with 3 months immediately accessible.

ISO 27001

3 yrs

ISO 27001 Annex A.12.4 requires audit logs to support forensic investigations with a minimum 3-year retention.

BSI C5

10 yrs

BSI Cloud Computing Compliance Controls Catalogue mandates up to 10-year log retention for German-regulated entities.

The Foundation

Why Columnar Storage Changes Everything

Traditional relational databases — and most legacy compliance log stores — use row-oriented storage. Every row is written together, which is efficient for transactional writes but disastrous for analytical queries over millions of rows.

Apache Parquet uses columnar storage: values for the same column are stored contiguously on disk. When you want to ask "how many SELECT queries did user X run last quarter?", Parquet only reads the query_type and user columns — skipping everything else entirely.

Read only the columns your query touches — skip irrelevant data entirely
Identical data types in each column compress far more efficiently
Column statistics enable predicate pushdown — skip entire row groups before reading
Vectorised CPU instructions process entire column batches at native speed

Columnar layout reads only relevant columns — skipping irrelevant data at the hardware level

File Format Internals

Inside a Parquet File

A Parquet file is a self-describing, hierarchically structured binary format. Understanding its anatomy explains why it is simultaneously compact, fast to query, and resilient to corruption — properties that matter deeply in compliance contexts.

Parquet file structure — self-describing with schema and statistics embedded in the footer; row groups enable parallel I/O

Row Groups

A Parquet file is divided into horizontal partitions called Row Groups — typically 128 MB to 1 GB each. Each row group is fully independent, enabling parallel reads across CPU cores or distributed workers. For a 10-year compliance archive, this means analytical queries can be serviced simultaneously across hundreds of row groups.

Default size: 128 MB — tunable for your hardware profile
Independent row groups enable full parallel I/O
Each row group stores its own column statistics for pruning
Corruption in one row group never affects others

The Self-Describing Footer

Parquet stores all schema and statistical metadata in a Thrift-encoded footer at the end of the file. Unlike CSV or plain logs, a Parquet file carries its own data dictionary: column names, data types, row counts, minimum and maximum values per column chunk. This metadata is read first — allowing query engines to skip row groups entirely without touching data pages.

Schema is embedded — no external catalog required
Min/max/null statistics drive predicate pushdown
Column offsets allow seek-based random access
Magic bytes PAR1 verify file integrity

Storage Efficiency

Why Columnar Data Compresses So Well

Because Parquet stores identical data types contiguously — all user names together, all timestamps together, all query types together — standard compression algorithms achieve dramatically higher ratios than row-based formats where varied data types are interleaved.

Format & Codec	Size (1 TB raw)	Saving
Raw CSV / Plain text No compression, row-based	1,000 GB	baseline
Row-based DB export (gzip) Compressed rows — limited gains	~600 GB	~40% saved
Parquet + Snappy Fast columnar compression	~220 GB	~78% saved
Parquet + Zstd (level 3) Balanced compression — Scalefield default	~130 GB	~87% saved
Parquet + Zstd (level 19) Maximum compression for deep archive	~80 GB	~92% saved

* Ratios are representative for typical database audit log workloads with high-cardinality usernames and low-cardinality action types. Actual ratios depend on data entropy.

Encoding

Smart Encoding
Before Compression

Parquet applies column-level encoding before any compression codec. This transforms data into a form that compresses far more efficiently — and often speeds up queries independently. Different columns use different encodings automatically based on data characteristics.

Dictionary Encoding

A lookup table stores unique values once. Each data page references values by index. Perfect for low-cardinality columns like query_type, database_name, or status_code — replaces long strings with tiny integers.

SELECT → 0
INSERT → 1
UPDATE → 2
DELETE → 3

[0,0,1,0,3,2,0] → 7 bytes

Delta Encoding

Instead of storing absolute values, only differences (deltas) between consecutive values are stored. Ideal for monotonically increasing data like timestamps and auto-increment IDs — common in audit logs.

Timestamps:
1700000000
1700000003 → +3
1700000009 → +6
1700000011 → +2

Stores: [3, 6, 2]

Run-Length Encoding (RLE)

Consecutive repeated values are stored as a count + value pair rather than writing the value repeatedly. Works extremely well for boolean flags, null bitmaps, and status columns in audit data.

success × 128
failure × 2
success × 47

Stores:
[128:true, 2:false, 47:true]

Bit Packing

Small integer values use only as many bits as needed. Dictionary indices rarely exceed 8 bits for compliance data — so Parquet packs 8 indices into a single 64-bit word, reducing page size by up to 8×.

Index values 0–15
require only 4 bits each

Standard: 8 × 8 bits = 64 bits
Bit-packed: 8 × 4 bits = 32 bits
Saving: 50%

Query Performance

Predicate Pushdown & Column Pruning

The most powerful feature of Parquet for compliance queries is predicate pushdown. Because each column chunk stores min/max statistics in the footer, a query engine can decide — without reading a single data byte — which row groups cannot possibly contain matching rows.

For example: "Find all DDL events in the last 30 days." Parquet reads the footer, checks the event_time statistics for each row group, and skips every row group whose max timestamp is older than 30 days. In a 10-year archive, this routinely means 99%+ of data is never read from disk.

Row group skipping based on min/max statistics
Bloom filters for high-cardinality point lookups (user IDs, IP addresses)
Column projection — read only queried columns
Page-level statistics for finer-grained skipping within row groups
Compatible with DuckDB, Spark, Trino, Athena pushdown engines

Predicate pushdown skips entire row groups — a 10-year archive scanned in milliseconds

No Lock-In

Query Your Archive with Any Tool

Because Scalefield Secure writes standard Apache Parquet files, your compliance archive is readable by the entire modern data ecosystem — today, tomorrow, and in 10 years when you switch vendors. No migration, no export, no proprietary reader.

Want to run an ad-hoc forensic investigation? Open DuckDB and query directly. Need to feed your SIEM? Mount the S3 bucket from Athena. Building a compliance dashboard? Point Superset at Trino. The data is always yours.

DuckDB — query petabytes from a laptop, zero infra
Apache Spark — distributed processing for bulk forensics
Trino / Presto — federated SQL across data lakes
AWS Athena / Azure Synapse — serverless cloud query
Apache Polars — Rust-native DataFrame for scripts
Pandas / PyArrow — Python data science workflows

-- Query 10 years of audit history directly from S3 -- using DuckDB — no server, no import, no waiting INSTALL httpfs; LOAD httpfs; SET s3_region = 'eu-central-1'; SELECT user_name, COUNT(*) AS ddl_count, MIN(event_time) AS first_seen, MAX(event_time) AS last_seen FROM read_parquet( 's3://my-compliance-archive/audit/ year=*/month=*/*.parquet' ) WHERE event_type IN ('DDL', 'DROP', 'TRUNCATE') AND event_time >= '2020-01-01' GROUP BY user_name ORDER BY ddl_count DESC; -- Result: 847,293 rows scanned in 0.4 seconds -- 99.6 GB of unread row groups skipped via -- predicate pushdown on event_time statistics

Open Ecosystem

Supported by the
Entire Data Industry

Apache Parquet is backed by the Apache Software Foundation and supported natively by every major cloud, analytics, and data engineering platform. It is the de-facto standard for analytical data lakes — your compliance archive will never become stranded data.

DuckDB

In-process analytics engine — query Parquet from any environment in milliseconds

Apache Spark

Distributed processing framework — scales to petabyte compliance forensics

Trino / Presto

Federated SQL query engine — join compliance data with any other source

AWS Athena

Serverless S3 query — pay only for bytes scanned, zero management

Azure Synapse

Native Parquet support — ADLS Gen2 backed compliance archive on Azure

Google BigQuery

External Parquet tables — query GCS-hosted compliance data at petabyte scale

Apache Polars

Rust-native DataFrame library — blazing-fast Parquet processing in Python scripts

PyArrow / Pandas

Python data science — read compliance archives directly into DataFrames

Apache Iceberg

Table format built on Parquet — time-travel queries, ACID compliance, schema evolution

Our Philosophy

Why We Built Scalefield Secure
on Open Formats

Compliance data is not just operational — it is a legal and forensic asset that must remain accessible for years, through technology cycles, through vendor changes, and through audits you cannot predict today. We made a deliberate decision: Scalefield Secure will never lock your compliance archive in a proprietary format.

Your Data Belongs to You

A compliance archive is a legal record. Locking it inside a vendor's proprietary binary format creates a dependency that is fundamentally incompatible with the independence compliance requires. Every Parquet file Scalefield Secure writes is readable by hundreds of tools without any license, any runtime, or any call to our servers — today, in 10 years, even if CYBERTEC ceased to exist.

Archives Outlive Software

SOX requires 7 years of audit history. That is longer than most enterprise software contracts — and longer than many products exist. We chose Apache Parquet specifically because it is a broadly ratified open standard with independent implementations in every major programming language. An archive you write today will be trivially readable by tools that do not exist yet.

Auditability Requires Transparency

When a regulator or external auditor arrives, they will not install your compliance vendor's proprietary reader. They will use standard tools. Parquet means your evidence is immediately verifiable by any independent party — no black box, no trust-me-it's-correct, no export-to-CSV step that could theoretically alter data. The archive is the evidence.

No Egress Tax on Your Own History

Proprietary compliance platforms charge per query, per export, or per GB to retrieve your own data. Because Scalefield Secure stores plain Parquet on your own S3 bucket or on-prem storage, there is no vendor middleman between you and your archive. Run 10,000 forensic queries or export your entire 10-year history — zero licensing cost per access, ever.

"We believe long-term compliance archives must be stored in open, self-describing, widely-supported formats. Proprietary formats create dependency, reduce transparency, and ultimately undermine the independence that compliance is supposed to guarantee. Apache Parquet is not a compromise — it is the right choice."

— CYBERTEC Scalefield Secure Engineering Team

Production ML Capabilities

What Scalefield Secure
Data Scientists Actually Do

Scalefield Secure ships with a built-in data science pipeline. The following are real, production-ready capabilities — not a roadmap, not theoretical possibilities. Every one of them is made practical at scale specifically because the underlying archive is Apache Parquet.

Built In

Zero-Copy ML Feature Engineering

Scalefield Secure's ML pipeline reads Parquet column chunks directly into Apache Arrow memory buffers — the same in-memory format used by Pandas, Polars, and NumPy. There is zero serialisation, zero CSV conversion, zero intermediate copy. Loading five years of audit features for model training takes a single API call. The columnar layout guarantees that only the exact feature columns the model needs are ever read from disk — irrelevant columns (raw query text, comments, metadata) are never touched.

# Load 5 years of audit features — only the columns # the model actually needs, zero intermediate copy import pyarrow.parquet as pq dataset = pq.ParquetDataset( 's3://audit-archive/year=2020/', filters=[('event_type', 'in', ['DDL', 'DROP'])] ) df = dataset.read( columns=['user_name', 'event_time', 'query_type', 'source_ip'] ).to_pandas() # direct Arrow → Pandas, no copy

Built In

Isolation Forest Anomaly Detection

Scalefield Secure runs Isolation Forest directly on the Parquet archive — partitioned by year/month/day — to identify anomalous access patterns without any labelled training data. Time-aligned partitions mean the model trains on temporally coherent batches, model drift is measurable across periods, and new anomalies can be scored against historical baselines stored in the same archive.

Built In

DBSCAN Behavioural Clustering

Scalefield Secure uses DBSCAN to cluster users into behavioural peer groups across the full compliance archive. Because Parquet lets the engine load only the feature vector columns (query frequency, table access patterns, time-of-day distributions) and skip everything else, full-archive clustering across years of history runs on standard hardware without distributed infrastructure.

Built In

Reproducible Forensic Notebooks

Every investigation Scalefield Secure data scientists run is a Jupyter notebook that reads directly from immutable Parquet files. Re-running the exact same notebook three years later against the same files produces byte-for-byte identical results. There is no mutable database state, no log rotation, no export step. A notebook shared with a regulator is the auditable forensic record.

Scalefield Secure Production Data Science Pipeline

01

Parquet Archive

Immutable, partitioned
by date on S3/Ceph

→

02

Column Pruning

Read only feature
columns needed

→

03

Arrow Memory

Zero-copy into
Pandas / NumPy

→

04

ML Models

Isolation Forest
DBSCAN · LLM

→

05

Dashboard

Alerts, reports
and evidence packs

Apache Parquet forCompliance Archiving

Compliance Doesn't Expire.Your Archive Shouldn't Either.

Why Columnar Storage Changes Everything

Inside a Parquet File

Row Groups

The Self-Describing Footer

Why Columnar Data Compresses So Well

Smart EncodingBefore Compression

Dictionary Encoding

Delta Encoding

Run-Length Encoding (RLE)

Bit Packing

Predicate Pushdown & Column Pruning

Query Your Archive with Any Tool

Supported by theEntire Data Industry

Why We Built Scalefield Secureon Open Formats

Your Data Belongs to You

Archives Outlive Software

Auditability Requires Transparency

No Egress Tax on Your Own History

What Scalefield SecureData Scientists Actually Do

Zero-Copy ML Feature Engineering

Isolation Forest Anomaly Detection

DBSCAN Behavioural Clustering

Reproducible Forensic Notebooks

See Parquet-Backed Compliance in Action

Apache Parquet for
Compliance Archiving

Compliance Doesn't Expire.
Your Archive Shouldn't Either.

Smart Encoding
Before Compression

Supported by the
Entire Data Industry

Why We Built Scalefield Secure
on Open Formats

What Scalefield Secure
Data Scientists Actually Do