Visualize parquet files with Apache Superset using Trino or PrestoSQL

Sairam Krish
3 min readDec 30, 2021

--

Many times, I like to visualize contents in formats like parquet, csv, json etc in Apache Superset. This article tries to provide a demo setup to achieve this.

Tech stack

  • Trino (formerly PrestoSQL)
  • Minio — for hosting the file. AWS S3 compatible.
  • Hive Metastore — for accessing files from Trino using Hive connector
  • Apache superset — for visualizing

This whole application is runnable in local machine using docker based flow. No external dependencies are involved. Once setup, I was able to add different data that I had and quickly became a productive environment to visualize different kinds of data in local.

We will be using `Hive metastore`. Though the topic mentions about parquet, we can visualize following types of files formats:

  • ORC
  • Parquet
  • Avro
  • RCText (RCFile using ColumnarSerDe)
  • RCBinary (RCFile using LazyBinaryColumnarSerDe)
  • SequenceFile
  • JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
  • CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
  • TextFile

Source code

Architecture details

We like to visualize rich data file formats like parquet in Apache superset. But there is no direct mechanism to integration them.

On the other hand, Trino (formerly `PrestoSQL`) is used to connect with different data sources, including parquet , csv, json etc., However trino needs Hive connector for accessing files.

Hive can access from Object storage solutions like `AWS S3`. For our demo, we could use minio

Bring this ecosystem together manually can be tricky and tiresome. Using a docker based ecosystem simplifies the whole process.

Please have a view at the docker-compose.yaml

Hive metastore

For our purpose, we don’t need the complete Hive application. With just hive metastore standalone we will be able to connect files with Apahce superset.

Hive metastore configuration can be found

Getting started

  • docker-compose up
  • Docker volumes are locally mounted. This should help in understanding the data of different service.

Initialize Hive schema

Start Trino CLI with docker exec -it trino trino and execute following SQL commands.

-- Need to manually run each query now.CREATE SCHEMA IF NOT EXISTS hive.iris
WITH (location = 's3a://iris/');
-- Path s3a://iris/iris_data is the holding directory. We dont give full file path. Only parent directory
CREATE TABLE IF NOT EXISTS hive.iris.iris_data (
sepal_length DOUBLE,
sepal_width DOUBLE,
petal_length DOUBLE,
petal_width DOUBLE,
class VARCHAR
)
WITH (
external_location = 's3a://iris/iris_data',
format = 'PARQUET'
);

Testing Trino with Minio

Let’s run a SQL query from Trino against a parquet file that is loaded into Minio.

trino> SELECT
-> sepal_length,
-> class
-> FROM hive.iris.iris_data
-> LIMIT 10;
->
sepal_length | class
--------------+-------------
5.1 | Iris-setosa
5.0 | Iris-setosa
4.4 | Iris-setosa
4.9 | Iris-setosa
4.6 | Iris-setosa
5.0 | Iris-setosa
5.4 | Iris-setosa
4.6 | Iris-setosa
4.9 | Iris-setosa
4.7 | Iris-setosa
(10 rows)
Query 20211229_201233_00002_wfmbr, FINISHED, 1 node
Splits: 10 total, 10 done (100.00%)
1.14 [15 rows, 8.08KB] [13 rows/s, 7.08KB/s]

Setup superset

  • First time, run `sh superset_init.sh`. This will setup superset with SQLite DB as metadata store.
  • In Superset, add database > trino with SqlAlchemy URI as `trino://hive@trino-coordinator:8080/hive`
  • In superset, add dataset. Now we will be able to our iris_data as a table.

Conclusion

With this full flow, we were able to view a parquet file as a table directly in Apache Superset. The flow can be done to bring any rich data file formats into Apache Superset. Once the data is in superset, sky's the limit to visualize them as we desire.

--

--

Sairam Krish
Sairam Krish

Written by Sairam Krish

Software Architect ★ Data Architect

No responses yet