SCALE
Petabytes
FORMATS
Parquet / ORC / CSV
QUERY ENGINE
Spark / Presto
COMPRESSION
Snappy / Zstd
-- CAPABILITIES --------
SCHEMA-ON-READ
Ingest data in any format without upfront schema definition. Apply structure at query time for maximum flexibility and rapid iteration.
SPARK INTEGRATION
Native Apache Spark connector for distributed processing. Run PySpark, SparkSQL, and Spark ML directly against lake data at scale.
COLUMNAR FORMATS
First-class support for Apache Parquet and ORC columnar formats. Predicate pushdown and column pruning for efficient analytical queries.
DATA CATALOGUING
Integrated metadata catalogue with automatic schema detection. Tag, search, and discover datasets across departments and projects.
-- USE CASES --------
▸Genomics data lakes and variant analysis
▸IoT sensor data aggregation and time-series
▸Research data warehousing across departments
▸Log analytics and operational intelligence