Impala
Apache Impala can query data stored in Apache Ozone, serving as a scalable alternative to HDFS. This integration allows users to leverage Impala's high-performance SQL query engine directly on data stored in Ozone.
Version Compatibility
This integration requires:
- Apache Ozone 1.4.0 or later
- Apache Impala 4.2.0 or later
Always check release notes for both projects to ensure compatibility between specific versions.
Configuration
To enable Impala to access data in Ozone, you need to configure Impala to recognize the Ozone filesystem.
-
Configure
core-site.xml
for Impala: Ensure thecore-site.xml
used by Impala includes the necessary Ozone filesystem configurations. Primarily, set the default filesystem (fs.defaultFS
) to your Ozone cluster'sofs
address if you want Ozone to be the primary storage.ofs
(Ozone File System): Recommended for all Impala operations. Example:ofs://<ozone_manager_host>:9862
s3a
(S3 Compatible): For S3 compatible access. Example:s3a://<bucket>.<aws-region>.amazonaws.com
<configuration>
<property>
<name>fs.defaultFS</name>
<value>ofs://om-host.example.com:9862</value>
</property>
<!-- Add other Ozone client configurations as needed -->
<property>
<name>ozone.om.address</name>
<value>om-host.example.com:9861</value>
</property>
</configuration>Make sure the Ozone client JARs are available in Impala's classpath.
-
Security (Kerberos): If your Ozone and Impala clusters are secured with Kerberos, ensure proper configuration for authentication. Impala daemons will need appropriate Kerberos principals and keytabs to access Ozone securely. Refer to Ozone and Impala security documentation for detailed setup.
Usage
Once configured, you can create Impala tables that reside on Ozone.
-
Default Filesystem: If
fs.defaultFS
is set to Ozone, you can create tables without specifying aLOCATION
, and Impala will store the data in Ozone under the default warehouse directory.-- Creates table 't1' in the default Ozone location
CREATE TABLE t1 (
id INT,
name STRING,
event_date DATE
)
PARTITIONED BY (event_date)
STORED AS PARQUET;
-- Creates database 'mydatabase' in the default Ozone location
CREATE DATABASE mydatabase;
USE mydatabase;
CREATE TABLE t2 (col1 INT, col2 STRING) STORED AS PARQUET; -
Specific Location: You can explicitly specify an Ozone path in the
LOCATION
clause using theofs://
prefix.-- Creates table 't3' in a specific Ozone bucket and volume
CREATE EXTERNAL TABLE t3 (
user_id BIGINT,
url STRING
)
STORED AS PARQUET
LOCATION 'ofs://om-host.example.com:9862/volume1/bucket1/path/to/table3';
-- Creates database 'anotherdb' in a specific Ozone location
CREATE DATABASE anotherdb
LOCATION 'ofs://om-host.example.com:9862/volume1/bucket1/databases/anotherdb.db'; -
Querying Data: Query data stored in Ozone using standard Impala SQL.
SELECT name FROM t1 WHERE id > 100;
INSERT INTO t2 VALUES (1, 'apple'), (2, 'banana');
Advanced Features
Storage Layout Support
Impala works with both Ozone storage layouts:
- RATIS Replication: The default replication mechanism
- Erasure Coding: Provides efficient storage with reduced overhead
Specify the desired storage layout at the bucket level when creating your Ozone buckets.
Table Types
Both managed and external tables are fully supported:
- Managed tables: When dropped, Impala deletes the underlying data in Ozone
- External tables: When dropped, only the table metadata is removed while data remains intact in Ozone
Performance Considerations
- Block Size: Ozone uses a global block size configuration. For optimal Impala query performance, consider setting appropriate block sizes in your Ozone configuration.
- Data Locality: Unlike HDFS, Ozone doesn't automatically provide data locality to Impala. Consider co-locating Impala daemons with Ozone datanodes when possible.
- Cache Settings: For frequently accessed data, adjust Impala's memory settings to optimize query performance with Ozone storage.
Limitations and Considerations
- Protocol Support:
ofs
is the recommended protocol for Impala with Ozone. Theo3fs
protocol is not fully supported for Impala operations. PARQUET_FILE_SIZE
: ThePARQUET_FILE_SIZE
query option, used to control the size of Parquet files written by Impala, is ineffective when writing to Ozone.- Spill-to-Disk: Impala's spill-to-disk mechanism (used when queries exceed memory limits) can be configured to use an Ozone path by specifying a full Ozone URI (e.g.,
ofs://om-host.example.com:9862/volume1/bucket1/tmp/impala-spill
) in the Impala configuration for spill directories.
Troubleshooting
Common Issues
-
Connectivity Problems:
- Verify network connectivity between Impala daemons and Ozone Manager
- Check
core-site.xml
configuration for correct hostnames and ports
-
Authentication Failures:
- Ensure Kerberos principals and keytabs are correctly configured
- Verify that the security settings are consistent between Impala and Ozone
-
Performance Issues:
- Monitor query execution times and compare with HDFS performance
- Check Ozone metrics for potential bottlenecks in data access
-
File Not Found Errors:
- Verify that the Ozone volumes and buckets exist and are accessible
- Check permissions for the Impala service account on Ozone resources