How to connect trino database with azure datalake to generate parquet file from trino?

To connect Trino with Azure Data Lake Storage (ADLS) Gen2 and generate Parquet files from Trino queries, you need to configure Trino to access ADLS Gen2 using the Hive or Delta Lake connector, set up authentication, and use SQL statements to write query results as Parquet files. Below is a step-by-step guide based on the available information and best practices.

—

### **Prerequisites**
1. **Trino Installation**: Ensure Trino is installed (e.g., via Docker, Kubernetes, or a managed service like Starburst Galaxy).
2. **ADLS Gen2 Account**: You have an Azure Data Lake Storage Gen2 account with a storage account, container, and necessary credentials (e.g., access key or service principal).
3. **Hive Metastore or AWS Glue**: A Hive Metastore Service (HMS) or AWS Glue Data Catalog is required for metadata management. Alternatively, you can use the Delta Lake connector for a metastore-free setup in some cases.
4. **Java Runtime**: For Parquet operations, ensure a 64-bit JRE 8, JDK 23, or OpenJDK is installed if using self-hosted integration runtimes.[](https://learn.microsoft.com/en-us/azure/data-factory/format-parquet)
5. **Network Access**: Trino coordinator and worker nodes must have network access to ADLS Gen2 (endpoint: `dfs.core.windows.net`) and the metastore (default port: 9083 for HMS).

—

### **Step 1: Configure Trino to Connect to ADLS Gen2**
Trino can connect to ADLS Gen2 using the **Hive connector** or **Delta Lake connector**. The Hive connector is more common for general Parquet file operations, while Delta Lake is preferred for transactional data lakes. Below, we focus on the Hive connector, with notes on Delta Lake where applicable.

#### **1.1 Create a Catalog Configuration File**
Create a catalog properties file (e.g., `etc/catalog/hive.properties`) to configure the Hive connector for ADLS Gen2.

“`properties
connector.name=hive
hive.metastore.uri=thrift://<metastore-host>:9083
fs.native-azure.enabled=true
azure.auth-type=ACCESS_KEY
azure.access-key=<your-adls-access-key>
azure.endpoint=dfs.core.windows.net
“`

– **Explanation**:
– `connector.name=hive`: Specifies the Hive connector.
– `hive.metastore.uri`: Points to your Hive Metastore or compatible service (e.g., AWS Glue).
– `fs.native-azure.enabled=true`: Enables native Azure file system support for ADLS Gen2.[](https://trino.io/docs/current/object-storage/legacy-azure.html)
– `azure.auth-type=ACCESS_KEY`: Uses the storage account access key for authentication. Alternatively, use `SERVICE_PRINCIPAL` with `azure.client-id`, `azure.client-secret`, and `azure.tenant-id` for more secure access.[](https://trino.io/docs/current/object-storage/legacy-azure.html)
– `azure.access-key`: The access key for your ADLS Gen2 storage account.
– `azure.endpoint`: The ADLS Gen2 endpoint (typically `dfs.core.windows.net`).

If using AWS Glue instead of a Hive Metastore:
“`properties
hive.metastore=glue
hive.aws-glue-catalog-id=<aws-account-id>
aws.region=<region>
aws.access-key=<aws-access-key>
aws.secret-key=<aws-secret-key>
“`

#### **1.2 Alternative: Delta Lake Connector**
If you prefer using Delta Lake for a transactional data lake, create a catalog file (e.g., `etc/catalog/delta.properties`):

“`properties
connector.name=delta_lake
hive.metastore.uri=thrift://<metastore-host>:9083
fs.native-azure.enabled=true
azure.auth-type=ACCESS_KEY
azure.access-key=<your-adls-access-key>
azure.endpoint=dfs.core.windows.net
“`

Delta Lake natively reads transaction logs and supports Parquet files, making it suitable for advanced use cases like time travel or ACID transactions.[](https://trino.io/docs/current/connector/delta-lake.html)

#### **1.3 Advanced Configuration (Optional)**
For multiple ADLS accounts or complex setups, configure credentials in Hadoop’s `core-site.xml`:
“`xml
<configuration>
<property>
    <name>fs.azure.account.key.<account-name>.dfs.core.windows.net</name>
    <value><your-access-key></value>
</property>
</configuration>
“`
Reference this file in the catalog:
“`properties
hive.config.resources=<path-to-core-site.xml>
“`

—

### **Step 2: Verify Connectivity**
1. Start your Trino server.
2. Use the Trino CLI or a client (e.g., DBeaver) to connect to Trino.
3. Verify the catalog is accessible:
   “`sql
   SHOW CATALOGS;
   “`
   Ensure `hive` (or `delta_lake`) appears in the list.
4. Test access to ADLS Gen2 by creating a schema:
   “`sql
   CREATE SCHEMA hive.my_schema
   WITH (location = ‘abfs://<container>@<account-name>.dfs.core.windows.net/<path>/’);
   “`

If you encounter errors like “Could not read table schema” or connection failures, check:
– Metastore connectivity (port 9083).
– Correctness of ADLS credentials and endpoint.
– Firewall rules allowing access to ADLS and the metastore.[](https://www.reddit.com/r/dataengineering/comments/1i1fo6n/struggling_to_connect_trino_to_adls_gen2_need/)

—

### **Step 3: Generate Parquet Files from Trino Queries**
To generate Parquet files in ADLS Gen2, use the `CREATE TABLE AS` statement with the `format = ‘PARQUET’` property to write query results to ADLS.

#### **3.1 Create a Table with Query Results**
Example: Export query results to a Parquet file in ADLS Gen2.
“`sql
CREATE TABLE hive.my_schema.orders_parquet
WITH (
format = ‘PARQUET’,
external_location = ‘abfs://<container>@<account-name>.dfs.core.windows.net/<path>/orders/’
)
AS
SELECT orderkey, custkey, orderstatus, totalprice, orderdate
FROM tpch.sf1.orders;
“`

– **Explanation**:
– `hive.my_schema.orders_parquet`: The table name in the Hive catalog.
– `format = ‘PARQUET’`: Specifies the output file format as Parquet.
– `external_location`: The ADLS Gen2 path where Parquet files will be written (e.g., `abfs://my-container@myaccount.dfs.core.windows.net/data/orders/`).
– `AS SELECT …`: The query whose results are written as Parquet files.
– The data is written to one or more Parquet files in the specified location.[](https://trino.io/docs/current/object-storage/legacy-azure.html)

#### **3.2 Verify Parquet Files**
– Check the ADLS Gen2 container using Azure Portal, Azure CLI, or a tool like Azure Storage Explorer to confirm the Parquet files are created in the specified path.
– Query the table to verify data:
“`sql
SELECT * FROM hive.my_schema.orders_parquet LIMIT 10;
“`

#### **3.3 Optional: Partitioning**
To improve query performance, partition the Parquet files by specifying columns in the `WITH` clause:
“`sql
CREATE TABLE hive.my_schema.orders_parquet
WITH (
format = ‘PARQUET’,
external_location = ‘abfs://<container>@<account-name>.dfs.core.windows.net/<path>/orders/’,
partitioned_by = ARRAY[‘orderdate’]
)
AS
SELECT orderkey, custkey, orderstatus, totalprice, orderdate
FROM tpch.sf1.orders;
“`
This creates subdirectories based on `orderdate` values, optimizing queries on partitioned columns.[](https://trino.io/docs/current/connector/iceberg.html)

—

### **Step 4: Best Practices and Considerations**
1. **Metastore Requirement**:
   – A Hive Metastore or AWS Glue is typically required for the Hive connector to manage table metadata. Without a metastore, you cannot create or query tables directly.[](https://www.reddit.com/r/dataengineering/comments/1i1fo6n/struggling_to_connect_trino_to_adls_gen2_need/)
   – For lightweight alternatives, consider Starburst Galaxy’s built-in metastore or Delta Lake’s transaction log for metadata management.[](https://www.reddit.com/r/dataengineering/comments/1i1fo6n/struggling_to_connect_trino_to_adls_gen2_need/)

2. **Performance Tuning**:
   – Use Parquet’s columnar storage for efficient data retrieval. Sort data by join keys in Parquet files to improve stripe/row-group pruning.[](https://trino.io/docs/current/connector/hive.html)
   – Enable dynamic filtering and partitioning for large datasets.[](https://trino.io/docs/current/connector/hive.html)
   – Adjust JVM memory settings if encountering `java.lang.OutOfMemoryError` during Parquet operations:
     “`bash
     export _JAVA_OPTIONS=”-Xms256m -Xmx16g”
     “`

3. **Security**:
   – Prefer service principal authentication over access keys for better security.[](https://trino.io/docs/current/object-storage/legacy-azure.html)
   – Ensure the service principal has `Storage Blob Data Contributor` or `Owner` roles for write operations.[](https://trino.io/docs/current/object-storage/legacy-azure.html)
   – Restrict network access to ADLS and the metastore using Azure firewalls or VPCs.

4. **Table Format Choice**:
   – **Hive Connector**: Best for simple Parquet file generation without transactional needs.[](https://trino.io/docs/current/connector/hive.html)
   – **Delta Lake Connector**: Use for ACID compliance, time travel, or frequent updates. Delta Lake stores metadata in the same ADLS location, reducing metastore dependency.[](https://trino.io/docs/current/connector/delta-lake.html)
   – **Iceberg Connector**: Another option for large-scale analytics with schema evolution support.[](https://trino.io/docs/current/connector/iceberg.html)

5. **Error Handling**:
   – If Trino fails to read Parquet files due to schema mismatches, ensure the table schema matches the Parquet file schema. Use `CREATE TABLE` to define the schema explicitly if querying existing Parquet files.[](https://trino.io/docs/current/object-storage/legacy-azure.html)
   – For timestamp issues, note that Trino may not correctly read `TIMESTAMP` values from Parquet files created by Hive 3.1 or later due to known bugs (HIVE-21002, HIVE-22167). Use Delta Lake or Iceberg to mitigate.[](https://trino.io/docs/current/connector/hive.html)

—

### **Step 5: Example Workflow**
Here’s a complete example to query an existing table, generate Parquet files, and store them in ADLS Gen2:
1. Configure `hive.properties` as shown in Step 1.1.
2. Start Trino and connect via the CLI:
   “`bash
   trino –server localhost:8080 –catalog hive
   “`
3. Create a schema:
   “`sql
   CREATE SCHEMA hive.sales
   WITH (location = ‘abfs://my-container@myaccount.dfs.core.windows.net/sales/’);
   “`
4. Generate Parquet files from a query:
   “`sql
   CREATE TABLE hive.sales.sales_summary
   WITH (
     format = ‘PARQUET’,
     external_location = ‘abfs://my-container@myaccount.dfs.core.windows.net/sales/summary/’,
     partitioned_by = ARRAY[‘region’]
   )
   AS
   SELECT region, SUM(revenue) as total_revenue, COUNT(*) as order_count
   FROM tpch.sf1.orders
   GROUP BY region;
   “`
5. Verify the Parquet files in ADLS Gen2 and query the table:
   “`sql
   SELECT * FROM hive.sales.sales_summary LIMIT 5;
   “`

—

### **Additional Notes**
– **Starburst Galaxy**: For a simpler setup, use Starburst Galaxy, which provides a hosted Trino instance with built-in ADLS Gen2 support and a metastore. Configure an ADLS connector via the UI and avoid manual metastore setup.[](https://www.reddit.com/r/dataengineering/comments/1i1fo6n/struggling_to_connect_trino_to_adls_gen2_need/)
– **Existing Parquet Files**: To query existing Parquet files in ADLS Gen2, create an external table:
“`sql
CREATE TABLE hive.my_schema.existing_data (
    col1 BIGINT,
    col2 VARCHAR
)
WITH (
    format = ‘PARQUET’,
    external_location = ‘abfs://<container>@<account-name>.dfs.core.windows.net/<path>/’
);
“`
– **Limitations**:
– Trino’s Hive connector does not support ACID transactions. Use Delta Lake for transactional updates.[](https://uprush.medium.com/build-an-open-data-lakehouse-with-spark-delta-and-trino-on-s3-bd60521a12dd)
– Writing `TIMESTAMP` data to Parquet may require workarounds due to Hive compatibility issues.[](https://trino.io/docs/current/connector/hive.html)
– **Further Reading**:
– Trino Hive Connector Documentation:[](https://trino.io/docs/current/connector/hive.html)
– Trino Delta Lake Connector Documentation:[](https://trino.io/docs/current/connector/delta-lake.html)
– Azure ADLS Gen2 with Trino:[](https://trino.io/docs/current/object-storage/legacy-azure.html)

—

### **Troubleshooting**
– **Error: “Could not read table schema”**:
– Verify metastore connectivity and catalog configuration.
– Ensure the ADLS path exists and is accessible.
– **Error: “Java heap space”**:
– Increase JVM memory using `_JAVA_OPTIONS` as shown above.[](https://learn.microsoft.com/en-us/azure/data-factory/format-parquet)
– **Slow Queries**:
– Enable partitioning or use Delta Lake/Iceberg for better metadata management.
– Optimize Parquet files by sorting data on frequently queried columns.[](https://trino.io/docs/current/connector/hive.html)

—

This guide provides a comprehensive approach to connecting Trino with ADLS Gen2 and generating Parquet files. If you encounter specific errors or need further customization, please provide additional details (e.g., error messages, Trino version, or metastore type) for tailored assistance.

Share this:

Related

Leave a comment Cancel reply