Compliance regulations demand years — sometimes a decade — of tamper-evident audit history. Scalefield Secure uses Apache Parquet as its native archive format because it is the most cost-efficient, analytically capable, and openly portable way to store compliance data at scale.
Every major compliance framework mandates that database audit logs are retained for years, not months. A legacy solution that stores data in bloated row-based formats — or proprietary binary files — becomes a financial and operational liability as your archive grows. Scalefield Secure solves this with Apache Parquet.
Traditional relational databases — and most legacy compliance log stores — use row-oriented storage. Every row is written together, which is efficient for transactional writes but disastrous for analytical queries over millions of rows.
Apache Parquet uses columnar storage: values for the same column are stored contiguously on disk. When you want to ask "how many SELECT queries did user X run last quarter?", Parquet only reads the query_type and user columns — skipping everything else entirely.
A Parquet file is a self-describing, hierarchically structured binary format. Understanding its anatomy explains why it is simultaneously compact, fast to query, and resilient to corruption — properties that matter deeply in compliance contexts.
A Parquet file is divided into horizontal partitions called Row Groups — typically 128 MB to 1 GB each. Each row group is fully independent, enabling parallel reads across CPU cores or distributed workers. For a 10-year compliance archive, this means analytical queries can be serviced simultaneously across hundreds of row groups.
Parquet stores all schema and statistical metadata in a Thrift-encoded footer at the end of the file. Unlike CSV or plain logs, a Parquet file carries its own data dictionary: column names, data types, row counts, minimum and maximum values per column chunk. This metadata is read first — allowing query engines to skip row groups entirely without touching data pages.
PAR1 verify file integrityBecause Parquet stores identical data types contiguously — all user names together, all timestamps together, all query types together — standard compression algorithms achieve dramatically higher ratios than row-based formats where varied data types are interleaved.
| Format & Codec | Relative Storage Size | Size (1 TB raw) | Saving |
|---|---|---|---|
| Raw CSV / Plain text No compression, row-based |
1,000 GB | baseline | |
| Row-based DB export (gzip) Compressed rows — limited gains |
~600 GB | ~40% saved | |
| Parquet + Snappy Fast columnar compression |
~220 GB | ~78% saved | |
| Parquet + Zstd (level 3) Balanced compression — Scalefield default |
~130 GB | ~87% saved | |
| Parquet + Zstd (level 19) Maximum compression for deep archive |
~80 GB | ~92% saved |
* Ratios are representative for typical database audit log workloads with high-cardinality usernames and low-cardinality action types. Actual ratios depend on data entropy.
Parquet applies column-level encoding before any compression codec. This transforms data into a form that compresses far more efficiently — and often speeds up queries independently. Different columns use different encodings automatically based on data characteristics.
A lookup table stores unique values once. Each data page references values by index. Perfect for low-cardinality columns like query_type, database_name, or status_code — replaces long strings with tiny integers.
Instead of storing absolute values, only differences (deltas) between consecutive values are stored. Ideal for monotonically increasing data like timestamps and auto-increment IDs — common in audit logs.
Consecutive repeated values are stored as a count + value pair rather than writing the value repeatedly. Works extremely well for boolean flags, null bitmaps, and status columns in audit data.
Small integer values use only as many bits as needed. Dictionary indices rarely exceed 8 bits for compliance data — so Parquet packs 8 indices into a single 64-bit word, reducing page size by up to 8×.
The most powerful feature of Parquet for compliance queries is predicate pushdown. Because each column chunk stores min/max statistics in the footer, a query engine can decide — without reading a single data byte — which row groups cannot possibly contain matching rows.
For example: "Find all DDL events in the last 30 days." Parquet reads the footer, checks the event_time statistics for each row group, and skips every row group whose max timestamp is older than 30 days. In a 10-year archive, this routinely means 99%+ of data is never read from disk.
Because Scalefield Secure writes standard Apache Parquet files, your compliance archive is readable by the entire modern data ecosystem — today, tomorrow, and in 10 years when you switch vendors. No migration, no export, no proprietary reader.
Want to run an ad-hoc forensic investigation? Open DuckDB and query directly. Need to feed your SIEM? Mount the S3 bucket from Athena. Building a compliance dashboard? Point Superset at Trino. The data is always yours.
Apache Parquet is backed by the Apache Software Foundation and supported natively by every major cloud, analytics, and data engineering platform. It is the de-facto standard for analytical data lakes — your compliance archive will never become stranded data.
Compliance data is not just operational — it is a legal and forensic asset that must remain accessible for years, through technology cycles, through vendor changes, and through audits you cannot predict today. We made a deliberate decision: Scalefield Secure will never lock your compliance archive in a proprietary format.
A compliance archive is a legal record. Locking it inside a vendor's proprietary binary format creates a dependency that is fundamentally incompatible with the independence compliance requires. Every Parquet file Scalefield Secure writes is readable by hundreds of tools without any license, any runtime, or any call to our servers — today, in 10 years, even if CYBERTEC ceased to exist.
SOX requires 7 years of audit history. That is longer than most enterprise software contracts — and longer than many products exist. We chose Apache Parquet specifically because it is a broadly ratified open standard with independent implementations in every major programming language. An archive you write today will be trivially readable by tools that do not exist yet.
When a regulator or external auditor arrives, they will not install your compliance vendor's proprietary reader. They will use standard tools. Parquet means your evidence is immediately verifiable by any independent party — no black box, no trust-me-it's-correct, no export-to-CSV step that could theoretically alter data. The archive is the evidence.
Proprietary compliance platforms charge per query, per export, or per GB to retrieve your own data. Because Scalefield Secure stores plain Parquet on your own S3 bucket or on-prem storage, there is no vendor middleman between you and your archive. Run 10,000 forensic queries or export your entire 10-year history — zero licensing cost per access, ever.
"We believe long-term compliance archives must be stored in open, self-describing, widely-supported formats. Proprietary formats create dependency, reduce transparency, and ultimately undermine the independence that compliance is supposed to guarantee. Apache Parquet is not a compromise — it is the right choice."
— CYBERTEC Scalefield Secure Engineering Team
Scalefield Secure ships with a built-in data science pipeline. The following are real, production-ready capabilities — not a roadmap, not theoretical possibilities. Every one of them is made practical at scale specifically because the underlying archive is Apache Parquet.
Scalefield Secure's ML pipeline reads Parquet column chunks directly into Apache Arrow memory buffers — the same in-memory format used by Pandas, Polars, and NumPy. There is zero serialisation, zero CSV conversion, zero intermediate copy. Loading five years of audit features for model training takes a single API call. The columnar layout guarantees that only the exact feature columns the model needs are ever read from disk — irrelevant columns (raw query text, comments, metadata) are never touched.
Scalefield Secure runs Isolation Forest directly on the Parquet archive — partitioned by year/month/day — to identify anomalous access patterns without any labelled training data. Time-aligned partitions mean the model trains on temporally coherent batches, model drift is measurable across periods, and new anomalies can be scored against historical baselines stored in the same archive.
Scalefield Secure uses DBSCAN to cluster users into behavioural peer groups across the full compliance archive. Because Parquet lets the engine load only the feature vector columns (query frequency, table access patterns, time-of-day distributions) and skip everything else, full-archive clustering across years of history runs on standard hardware without distributed infrastructure.
Every investigation Scalefield Secure data scientists run is a Jupyter notebook that reads directly from immutable Parquet files. Re-running the exact same notebook three years later against the same files produces byte-for-byte identical results. There is no mutable database state, no log rotation, no export step. A notebook shared with a regulator is the auditable forensic record.
Scalefield Secure gives you all of this out of the box — zero configuration required to start building a decade-proof compliance archive.