Load data from CSV file into Trino Table

To create a table in Trino and load data from a CSV file stored in Azure Data Lake Storage (ADLS), you’ll use Trino’s Hive connector to register the CSV file as a table. The Hive connector, backed by a Hive metastore, allows Trino to query files in ADLS. Below is a step-by-step guide to achieve this, assuming the CSV file is in ADLS Gen2 and Trino is configured to access it.

### Prerequisites
– **Trino Setup**: Trino (version 476 or later recommended) is installed and running.
– **Hive Metastore**: A Hive metastore (standalone or managed, e.g., AWS Glue or a local Hive metastore) is configured and accessible to Trino.
– **ADLS Gen2 Access**: The CSV file is stored in ADLS Gen2 (e.g., `abfss://container@storageaccount.dfs.core.windows.net/path/to/file.csv`), and Trino has the necessary credentials.
– **Hive Connector**: The Hive connector is configured in Trino’s catalog (e.g., `etc/catalog/hive.properties`) with ADLS support.
– **Azure Credentials**: You have the storage account name, container name, and authentication details (e.g., SAS token, OAuth, or access key).

### Steps to Create a Table and Load CSV Data in Trino

1. **Upload the CSV File to ADLS**:
   – Use Azure CLI, Azure Portal, or another tool to upload the CSV file to ADLS Gen2. For example:
     “`bash
     az storage fs file upload –account-name <storage-account> –container-name <container> –file-system <container> –path /path/to/file.csv –source ./file.csv
     “`
   – Note the file’s path (e.g., `abfss://container@storageaccount.dfs.core.windows.net/path/to/file.csv`).

2. **Configure the Hive Connector for ADLS**:
   – Edit the Hive catalog configuration file (e.g., `etc/catalog/hive.properties`) to include ADLS Gen2 settings. Example configuration:
     “`properties
     connector.name=hive
     hive.metastore.uri=thrift://<hive-metastore-host>:9083
     hive.azure.abfs-storage-account=<storage-account-name>
     hive.azure.abfs-container=<container-name>
     hive.azure.abfs-access-key=<access-key>
     # OR for OAuth:
     # hive.azure.abfs-oauth-endpoint=https://login.microsoftonline.com/<tenant-id>/oauth2/token
     # hive.azure.abfs-oauth-client-id=<client-id>
     # hive.azure.abfs-oauth-client-secret=<client-secret>
     “`
   – Replace placeholders with your ADLS credentials. For OAuth, ensure the service principal has `Storage Blob Data Contributor` role on the storage account.
   – Restart Trino if you modify the configuration.

3. **Create a Schema in Trino**:
   – Connect to Trino using the Trino CLI, JDBC, or a client like DBeaver.
   – Create a schema in the Hive catalog if it doesn’t exist:
     “`sql
     CREATE SCHEMA IF NOT EXISTS hive.my_schema;
     “`

4. **Create a Table for the CSV File**:
   – Define a table in Trino that maps to the CSV file in ADLS. Specify the schema to match the CSV’s columns and set the format to `CSV`. Example for a CSV with columns `id` (integer), `name` (varchar), and `amount` (double):
     “`sql
     CREATE TABLE hive.my_schema.my_table (
         id INTEGER,
         name VARCHAR,
         amount DOUBLE
     )
     WITH (
         format = ‘CSV’,
         external_location = ‘abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/’,
         csv_escape = ‘”‘,
         csv_quote = ‘”‘,
         skip_header_line_count = 1
     );
     “`
     – **Notes**:
       – `external_location`: Points to the directory containing the CSV file (not the file itself if the directory may contain multiple files). Ensure the path ends with a `/`.
       – `skip_header_line_count = 1`: Skips the header row if present. Omit if the CSV has no header.
       – `csv_escape` and `csv_quote`: Handle special characters (e.g., commas or quotes in fields). Adjust based on your CSV format.
       – Ensure the column types match the CSV data to avoid `NULL` values or errors. If the CSV contains only text, use `VARCHAR` for all columns and cast as needed in queries.

5. **Verify Data Loading**:
   – Query the table to ensure the CSV data is accessible:
     “`sql
     SELECT * FROM hive.my_schema.my_table;
     “`
   – Trino will read the CSV file from ADLS and return the data based on the table definition.

6. **Example CSV and Workflow**:
   – **Sample CSV** (`file.csv` in `abfss://mycontainer@mystorageaccount.dfs.core.windows.net/data/`):
     “`
     id,name,amount
     1,Alice,100.50
     2,Bob,200.75
     “`
   – **Steps**:
     – Upload `file.csv` to ADLS Gen2.
     – Configure `hive.properties` as shown above.
     – Run the `CREATE TABLE` statement in Trino.
     – Query the table to see:
       “`
       id | name | amount
       1 | Alice | 100.50
       2 | Bob   | 200.75
       “`

7. **Optional: Optimize for Performance**:
   – **Convert to Parquet/ORC**: CSV files are not optimized for query performance. Convert the data to a columnar format like Parquet for better performance:
     “`sql
     CREATE TABLE hive.my_schema.my_table_parquet
     WITH (format = ‘PARQUET’)
     AS SELECT * FROM hive.my_schema.my_table;
     “`
   – **Partitioning**: If the CSV data is large, consider partitioning the data in ADLS (e.g., by date or region) and include `partitioned_by` in the table definition:
     “`sql
     WITH (
         format = ‘CSV’,
         external_location = ‘abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/’,
         partitioned_by = ARRAY[‘region’],
         skip_header_line_count = 1
     )
     “`

### Troubleshooting
– **Authentication Errors**: Verify ADLS credentials in `hive.properties`. Ensure the access key or OAuth credentials are correct and have sufficient permissions.
– **Table Not Found**: Confirm the schema and table exist in the Hive metastore and the `external_location` matches the ADLS path.
– **NULL Values or Data Mismatch**: Check the CSV for inconsistent data types or unescaped special characters. Use `VARCHAR` for all columns if parsing issues persist, then cast in queries.
– **Path Issues**: Ensure the `external_location` points to the directory containing the CSV, not the file itself, unless only one file is expected.
– **Performance**: For large datasets, convert to Parquet/ORC or partition the data to improve query speed.

### Notes
– **No Direct Import Command**: Trino does not support a direct CLI command like `LOAD DATA` for CSV files. The table creation process effectively “loads” the data by making it queryable.
– **CSV Format**: Trino’s Hive connector may struggle with complex CSV files (e.g., unescaped newlines or commas). Preprocess the CSV if necessary to ensure compatibility.
– **Alternative Tools**: For automation, consider tools like Apache Airflow or Kestra to orchestrate CSV uploads and table creation.

If you have a specific CSV structure, ADLS path, or authentication method (e.g., OAuth vs. access key), please share details, and I can tailor the solution further. Let me know if you need help with the `trino-storage` connector for ad-hoc CSV queries or additional configuration steps!

Share this:

Related

Leave a comment Cancel reply