Visualize parquet files with Apache Superset using Trino or PrestoSQL
Many times, I like to visualize contents in formats like parquet
, csv
, json
etc in Apache Superset. This article tries to provide a demo setup to achieve this.
Tech stack
- Trino (formerly PrestoSQL)
- Minio — for hosting the file. AWS S3 compatible.
- Hive Metastore — for accessing files from Trino using Hive connector
- Apache superset — for visualizing
This whole application is runnable in local machine using docker based flow. No external dependencies are involved. Once setup, I was able to add different data that I had and quickly became a productive environment to visualize different kinds of data in local.
We will be using `Hive metastore`. Though the topic mentions about parquet, we can visualize following types of files formats:
- ORC
- Parquet
- Avro
- RCText (RCFile using ColumnarSerDe)
- RCBinary (RCFile using LazyBinaryColumnarSerDe)
- SequenceFile
- JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
- CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
- TextFile
Source code
- The complete source for this demo is available here.
- It’s a docker based solution.
Architecture details
We like to visualize rich data file formats like parquet
in Apache superset. But there is no direct mechanism to integration them.
On the other hand, Trino
(formerly `PrestoSQL`) is used to connect with different data sources, including parquet
, csv
, json
etc., However trino needs Hive
connector for accessing files.
Hive
can access from Object storage solutions like `AWS S3`. For our demo, we could use minio
Bring this ecosystem together manually can be tricky and tiresome. Using a docker based ecosystem simplifies the whole process.
Please have a view at the docker-compose.yaml
Hive metastore
For our purpose, we don’t need the complete Hive
application. With just hive metastore standalone
we will be able to connect files with Apahce superset.
Hive metastore configuration can be found
Getting started
docker-compose up
- Docker volumes are locally mounted. This should help in understanding the data of different service.
Initialize Hive schema
Start Trino CLI with docker exec -it trino trino
and execute following SQL commands.
-- Need to manually run each query now.CREATE SCHEMA IF NOT EXISTS hive.iris
WITH (location = 's3a://iris/');-- Path s3a://iris/iris_data is the holding directory. We dont give full file path. Only parent directory
CREATE TABLE IF NOT EXISTS hive.iris.iris_data (
sepal_length DOUBLE,
sepal_width DOUBLE,
petal_length DOUBLE,
petal_width DOUBLE,
class VARCHAR
)
WITH (
external_location = 's3a://iris/iris_data',
format = 'PARQUET'
);
Testing Trino with Minio
Let’s run a SQL query from Trino against a parquet file that is loaded into Minio.
trino> SELECT
-> sepal_length,
-> class
-> FROM hive.iris.iris_data
-> LIMIT 10;
->
sepal_length | class
--------------+-------------
5.1 | Iris-setosa
5.0 | Iris-setosa
4.4 | Iris-setosa
4.9 | Iris-setosa
4.6 | Iris-setosa
5.0 | Iris-setosa
5.4 | Iris-setosa
4.6 | Iris-setosa
4.9 | Iris-setosa
4.7 | Iris-setosa
(10 rows)Query 20211229_201233_00002_wfmbr, FINISHED, 1 node
Splits: 10 total, 10 done (100.00%)
1.14 [15 rows, 8.08KB] [13 rows/s, 7.08KB/s]
Setup superset
- First time, run `sh superset_init.sh`. This will setup superset with SQLite DB as metadata store.
- In Superset, add database > trino with SqlAlchemy URI as `trino://hive@trino-coordinator:8080/hive`
- In superset, add dataset. Now we will be able to our
iris_data
as a table.
Conclusion
With this full flow, we were able to view a parquet file as a table directly in Apache Superset. The flow can be done to bring any rich data file formats into Apache Superset. Once the data is in superset, sky's the limit to visualize them as we desire.