Technology Deep Dive

Apache Parquet for
Compliance Archiving

Compliance regulations demand years — sometimes a decade — of tamper-evident audit history. Scalefield Secure uses Apache Parquet as its native archive format because it is the most cost-efficient, analytically capable, and openly portable way to store compliance data at scale.

10 yr Retention support
12× Compression vs CSV
<1 s Query billions of rows
100% Open standard

Compliance Doesn't Expire.
Your Archive Shouldn't Either.

Every major compliance framework mandates that database audit logs are retained for years, not months. A legacy solution that stores data in bloated row-based formats — or proprietary binary files — becomes a financial and operational liability as your archive grows. Scalefield Secure solves this with Apache Parquet.

SOX
7 yrs
Sarbanes-Oxley requires financial system audit logs for a minimum of 7 years for public companies.
GDPR
6 yrs
GDPR Article 5 accountability principle requires data access logs to demonstrate compliance for up to 6 years.
HIPAA
6 yrs
US Health Insurance Portability Act mandates covered entities retain audit control documentation for 6 years.
PCI DSS
1 yr
Payment Card Industry standard requires 12 months of online audit logs with 3 months immediately accessible.
ISO 27001
3 yrs
ISO 27001 Annex A.12.4 requires audit logs to support forensic investigations with a minimum 3-year retention.
BSI C5
10 yrs
BSI Cloud Computing Compliance Controls Catalogue mandates up to 10-year log retention for German-regulated entities.

Why Columnar Storage Changes Everything

Traditional relational databases — and most legacy compliance log stores — use row-oriented storage. Every row is written together, which is efficient for transactional writes but disastrous for analytical queries over millions of rows.

Apache Parquet uses columnar storage: values for the same column are stored contiguously on disk. When you want to ask "how many SELECT queries did user X run last quarter?", Parquet only reads the query_type and user columns — skipping everything else entirely.

  • Read only the columns your query touches — skip irrelevant data entirely
  • Identical data types in each column compress far more efficiently
  • Column statistics enable predicate pushdown — skip entire row groups before reading
  • Vectorised CPU instructions process entire column batches at native speed
ROW STORAGE user action db ts alice SELECT prod 09:01 bob INSERT dev 09:03 alice DELETE prod 09:07 carol SELECT staging 09:12 Reads ALL columns for every row PARQUET COLUMNAR user alice bob alice carol action SEL INS DEL SEL action SEL INS DEL SEL ts 09:01 09:03 09:07 09:12 Reads ONLY needed columns skipped skipped Data read from disk Data skipped
Columnar layout reads only relevant columns — skipping irrelevant data at the hardware level

Inside a Parquet File

A Parquet file is a self-describing, hierarchically structured binary format. Understanding its anatomy explains why it is simultaneously compact, fast to query, and resilient to corruption — properties that matter deeply in compliance contexts.

PAR1 — Magic Bytes (4 bytes) ROW GROUP 1 (e.g. 1M rows) COLUMN CHUNK user_name DATA PAGE 1 dict + RLE encoded DATA PAGE 2 dict + RLE encoded COLUMN STATISTICS min / max / null_count DICTIONARY PAGE unique value lookup COLUMN CHUNK query_type DATA PAGE 1 DELTA encoded DATA PAGE 2 DELTA encoded COLUMN STATISTICS min / max / null_count DICTIONARY PAGE SELECT, INSERT, UPDATE… ROW GROUP 2 … more column chunks … Can be processed in parallel FILE FOOTER — Thrift-encoded metadata schema · row group offsets · column stats · key-value metadata Footer length (4 bytes) PAR1
Parquet file structure — self-describing with schema and statistics embedded in the footer; row groups enable parallel I/O

Row Groups

A Parquet file is divided into horizontal partitions called Row Groups — typically 128 MB to 1 GB each. Each row group is fully independent, enabling parallel reads across CPU cores or distributed workers. For a 10-year compliance archive, this means analytical queries can be serviced simultaneously across hundreds of row groups.

  • Default size: 128 MB — tunable for your hardware profile
  • Independent row groups enable full parallel I/O
  • Each row group stores its own column statistics for pruning
  • Corruption in one row group never affects others

The Self-Describing Footer

Parquet stores all schema and statistical metadata in a Thrift-encoded footer at the end of the file. Unlike CSV or plain logs, a Parquet file carries its own data dictionary: column names, data types, row counts, minimum and maximum values per column chunk. This metadata is read first — allowing query engines to skip row groups entirely without touching data pages.

  • Schema is embedded — no external catalog required
  • Min/max/null statistics drive predicate pushdown
  • Column offsets allow seek-based random access
  • Magic bytes PAR1 verify file integrity

Why Columnar Data Compresses So Well

Because Parquet stores identical data types contiguously — all user names together, all timestamps together, all query types together — standard compression algorithms achieve dramatically higher ratios than row-based formats where varied data types are interleaved.

Format & Codec Relative Storage Size Size (1 TB raw) Saving
Raw CSV / Plain text
No compression, row-based
1,000 GB baseline
Row-based DB export (gzip)
Compressed rows — limited gains
~600 GB ~40% saved
Parquet + Snappy
Fast columnar compression
~220 GB ~78% saved
Parquet + Zstd (level 3)
Balanced compression — Scalefield default
~130 GB ~87% saved
Parquet + Zstd (level 19)
Maximum compression for deep archive
~80 GB ~92% saved

* Ratios are representative for typical database audit log workloads with high-cardinality usernames and low-cardinality action types. Actual ratios depend on data entropy.

Smart Encoding
Before Compression

Parquet applies column-level encoding before any compression codec. This transforms data into a form that compresses far more efficiently — and often speeds up queries independently. Different columns use different encodings automatically based on data characteristics.

Dictionary Encoding

A lookup table stores unique values once. Each data page references values by index. Perfect for low-cardinality columns like query_type, database_name, or status_code — replaces long strings with tiny integers.

SELECT → 0
INSERT → 1
UPDATE → 2
DELETE → 3

[0,0,1,0,3,2,0] → 7 bytes

Delta Encoding

Instead of storing absolute values, only differences (deltas) between consecutive values are stored. Ideal for monotonically increasing data like timestamps and auto-increment IDs — common in audit logs.

Timestamps:
1700000000
1700000003 → +3
1700000009 → +6
1700000011 → +2

Stores: [3, 6, 2]

Run-Length Encoding (RLE)

Consecutive repeated values are stored as a count + value pair rather than writing the value repeatedly. Works extremely well for boolean flags, null bitmaps, and status columns in audit data.

success × 128
failure × 2
success × 47

Stores:
[128:true, 2:false, 47:true]

Bit Packing

Small integer values use only as many bits as needed. Dictionary indices rarely exceed 8 bits for compliance data — so Parquet packs 8 indices into a single 64-bit word, reducing page size by up to 8×.

Index values 0–15
require only 4 bits each

Standard: 8 × 8 bits = 64 bits
Bit-packed: 8 × 4 bits = 32 bits
Saving: 50%

Predicate Pushdown & Column Pruning

The most powerful feature of Parquet for compliance queries is predicate pushdown. Because each column chunk stores min/max statistics in the footer, a query engine can decide — without reading a single data byte — which row groups cannot possibly contain matching rows.

For example: "Find all DDL events in the last 30 days." Parquet reads the footer, checks the event_time statistics for each row group, and skips every row group whose max timestamp is older than 30 days. In a 10-year archive, this routinely means 99%+ of data is never read from disk.

  • Row group skipping based on min/max statistics
  • Bloom filters for high-cardinality point lookups (user IDs, IP addresses)
  • Column projection — read only queried columns
  • Page-level statistics for finer-grained skipping within row groups
  • Compatible with DuckDB, Spark, Trino, Athena pushdown engines
QUERY WHERE event_time > '2025-12-01' Read file footer load row group statistics (tiny!) Check each row group's min/max event_time ROW GROUP 1 2015–2016 ✕ SKIP ROW GROUP 2 2017–2022 ✕ SKIP ROW GROUP 3 2023–2024 ✕ SKIP ROW GROUP 4 Dec 2025 – present ✓ READ Result in <1 second 99% of archive never touched
Predicate pushdown skips entire row groups — a 10-year archive scanned in milliseconds

Query Your Archive with Any Tool

Because Scalefield Secure writes standard Apache Parquet files, your compliance archive is readable by the entire modern data ecosystem — today, tomorrow, and in 10 years when you switch vendors. No migration, no export, no proprietary reader.

Want to run an ad-hoc forensic investigation? Open DuckDB and query directly. Need to feed your SIEM? Mount the S3 bucket from Athena. Building a compliance dashboard? Point Superset at Trino. The data is always yours.

  • DuckDB — query petabytes from a laptop, zero infra
  • Apache Spark — distributed processing for bulk forensics
  • Trino / Presto — federated SQL across data lakes
  • AWS Athena / Azure Synapse — serverless cloud query
  • Apache Polars — Rust-native DataFrame for scripts
  • Pandas / PyArrow — Python data science workflows
-- Query 10 years of audit history directly from S3 -- using DuckDB — no server, no import, no waiting INSTALL httpfs; LOAD httpfs; SET s3_region = 'eu-central-1'; SELECT user_name, COUNT(*) AS ddl_count, MIN(event_time) AS first_seen, MAX(event_time) AS last_seen FROM read_parquet( 's3://my-compliance-archive/audit/ year=*/month=*/*.parquet' ) WHERE event_type IN ('DDL', 'DROP', 'TRUNCATE') AND event_time >= '2020-01-01' GROUP BY user_name ORDER BY ddl_count DESC; -- Result: 847,293 rows scanned in 0.4 seconds -- 99.6 GB of unread row groups skipped via -- predicate pushdown on event_time statistics

Supported by the
Entire Data Industry

Apache Parquet is backed by the Apache Software Foundation and supported natively by every major cloud, analytics, and data engineering platform. It is the de-facto standard for analytical data lakes — your compliance archive will never become stranded data.

DuckDB
In-process analytics engine — query Parquet from any environment in milliseconds
Apache Spark
Distributed processing framework — scales to petabyte compliance forensics
Trino / Presto
Federated SQL query engine — join compliance data with any other source
AWS Athena
Serverless S3 query — pay only for bytes scanned, zero management
Azure Synapse
Native Parquet support — ADLS Gen2 backed compliance archive on Azure
Google BigQuery
External Parquet tables — query GCS-hosted compliance data at petabyte scale
Apache Polars
Rust-native DataFrame library — blazing-fast Parquet processing in Python scripts
PyArrow / Pandas
Python data science — read compliance archives directly into DataFrames
Apache Iceberg
Table format built on Parquet — time-travel queries, ACID compliance, schema evolution

Why We Built Scalefield Secure
on Open Formats

Compliance data is not just operational — it is a legal and forensic asset that must remain accessible for years, through technology cycles, through vendor changes, and through audits you cannot predict today. We made a deliberate decision: Scalefield Secure will never lock your compliance archive in a proprietary format.

Your Data Belongs to You

A compliance archive is a legal record. Locking it inside a vendor's proprietary binary format creates a dependency that is fundamentally incompatible with the independence compliance requires. Every Parquet file Scalefield Secure writes is readable by hundreds of tools without any license, any runtime, or any call to our servers — today, in 10 years, even if CYBERTEC ceased to exist.

Archives Outlive Software

SOX requires 7 years of audit history. That is longer than most enterprise software contracts — and longer than many products exist. We chose Apache Parquet specifically because it is a broadly ratified open standard with independent implementations in every major programming language. An archive you write today will be trivially readable by tools that do not exist yet.

Auditability Requires Transparency

When a regulator or external auditor arrives, they will not install your compliance vendor's proprietary reader. They will use standard tools. Parquet means your evidence is immediately verifiable by any independent party — no black box, no trust-me-it's-correct, no export-to-CSV step that could theoretically alter data. The archive is the evidence.

No Egress Tax on Your Own History

Proprietary compliance platforms charge per query, per export, or per GB to retrieve your own data. Because Scalefield Secure stores plain Parquet on your own S3 bucket or on-prem storage, there is no vendor middleman between you and your archive. Run 10,000 forensic queries or export your entire 10-year history — zero licensing cost per access, ever.

"We believe long-term compliance archives must be stored in open, self-describing, widely-supported formats. Proprietary formats create dependency, reduce transparency, and ultimately undermine the independence that compliance is supposed to guarantee. Apache Parquet is not a compromise — it is the right choice."

— CYBERTEC Scalefield Secure Engineering Team

What Scalefield Secure
Data Scientists Actually Do

Scalefield Secure ships with a built-in data science pipeline. The following are real, production-ready capabilities — not a roadmap, not theoretical possibilities. Every one of them is made practical at scale specifically because the underlying archive is Apache Parquet.

Built In

Zero-Copy ML Feature Engineering

Scalefield Secure's ML pipeline reads Parquet column chunks directly into Apache Arrow memory buffers — the same in-memory format used by Pandas, Polars, and NumPy. There is zero serialisation, zero CSV conversion, zero intermediate copy. Loading five years of audit features for model training takes a single API call. The columnar layout guarantees that only the exact feature columns the model needs are ever read from disk — irrelevant columns (raw query text, comments, metadata) are never touched.

# Load 5 years of audit features — only the columns # the model actually needs, zero intermediate copy import pyarrow.parquet as pq dataset = pq.ParquetDataset( 's3://audit-archive/year=2020/', filters=[('event_type', 'in', ['DDL', 'DROP'])] ) df = dataset.read( columns=['user_name', 'event_time', 'query_type', 'source_ip'] ).to_pandas() # direct Arrow → Pandas, no copy
Built In

Isolation Forest Anomaly Detection

Scalefield Secure runs Isolation Forest directly on the Parquet archive — partitioned by year/month/day — to identify anomalous access patterns without any labelled training data. Time-aligned partitions mean the model trains on temporally coherent batches, model drift is measurable across periods, and new anomalies can be scored against historical baselines stored in the same archive.

Built In

DBSCAN Behavioural Clustering

Scalefield Secure uses DBSCAN to cluster users into behavioural peer groups across the full compliance archive. Because Parquet lets the engine load only the feature vector columns (query frequency, table access patterns, time-of-day distributions) and skip everything else, full-archive clustering across years of history runs on standard hardware without distributed infrastructure.

Built In

Reproducible Forensic Notebooks

Every investigation Scalefield Secure data scientists run is a Jupyter notebook that reads directly from immutable Parquet files. Re-running the exact same notebook three years later against the same files produces byte-for-byte identical results. There is no mutable database state, no log rotation, no export step. A notebook shared with a regulator is the auditable forensic record.

Scalefield Secure Production Data Science Pipeline
01
Parquet Archive
Immutable, partitioned
by date on S3/Ceph
02
Column Pruning
Read only feature
columns needed
03
Arrow Memory
Zero-copy into
Pandas / NumPy
04
ML Models
Isolation Forest
DBSCAN · LLM
05
Dashboard
Alerts, reports
and evidence packs

See Parquet-Backed Compliance in Action

Scalefield Secure gives you all of this out of the box — zero configuration required to start building a decade-proof compliance archive.