Hadoop S3A Connector (s3a
)
The Hadoop S3A connector (s3a://
) is a client-side library designed to allow Hadoop ecosystem tools to interact with Amazon S3 or S3-compatible object stores, including Apache Ozone's S3 Gateway.
Overview
While Ozone provides its native Hadoop Compatible File System (HCFS) implementation via ofs://
, using the s3a
connector offers an alternative way for Hadoop applications (Spark, Hive, MapReduce, etc.) to access Ozone data using the S3 protocol.
Key Points:
- Client-Side Translation:
s3a
acts as a translator, converting HadoopFileSystem
API calls into S3 REST API requests, which are then sent to the configured endpoint (Ozone's S3 Gateway). - S3 Semantics: Operations performed via
s3a
generally follow S3 object storage semantics. This means directory operations are often simulated client-side. - Configuration: Requires configuring the Hadoop client with the Ozone S3 Gateway endpoint, access credentials, and specific
s3a
settings. - Performance: Performance characteristics can differ from
ofs://
, especially for operations involving directory renames or listings, ass3a
might perform multiple S3 API calls to simulate these actions.
Configuration
To use s3a
with Ozone, configure the following properties in the client's core-site.xml
:
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class for the s3a filesystem.</description>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://ozone-s3g.example.com:9878</value> <!-- URL of your S3 Gateway -->
<description>Ozone S3 Gateway endpoint URL.</description>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>your_ozone_access_key</value> <!-- Access Key obtained from Ozone -->
<description>Ozone S3 Access Key ID.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>your_ozone_secret_key</value> <!-- Secret Key obtained from Ozone -->
<description>Ozone S3 Secret Access Key.</description>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
<description>
Recommended for Ozone S3 Gateway. Use path-style access (bucket name in path)
instead of virtual-style (bucket name in hostname).
</description>
</property>
<!-- Optional: Disable change detection for better compatibility/performance with Ozone -->
<property>
<name>fs.s3a.change.detection.mode</name>
<value>none</value>
</property>
<property>
<name>fs.s3a.change.detection.version.required</name>
<value>false</value>
</property>
<!-- Optional: Configure S3A committer if needed for specific frameworks -->
<!--
<property>
<name>fs.s3a.committer.name</name>
<value>directory</value> <!- or partition, magic ->
</property>
<property>
<name>fs.s3a.committer.staging.conflict-mode</name>
<value>append</value> <!- or fail, replace ->
</property>
-->
Replace endpoint and credential values with your specific Ozone S3 Gateway details. Ensure the necessary Hadoop AWS and S3A JARs are on the client classpath.
Usage Examples
Hadoop FS CLI:
# List contents (maps to S3 ListObjects)
hadoop fs -ls s3a://my-ozone-bucket/
# Create a directory (simulated by s3a, often creates a zero-byte object ending in '/')
hadoop fs -mkdir s3a://my-ozone-bucket/newdir
# Copy a local file to Ozone via s3a
hadoop fs -copyFromLocal /local/path/file.txt s3a://my-ozone-bucket/newdir/
# Rename (often involves copy + delete operations by s3a, NOT atomic)
hadoop fs -mv s3a://my-ozone-bucket/newdir s3a://my-ozone-bucket/renameddir
Spark/Hive:
Applications can use s3a://
paths similarly to how they use ofs://
or hdfs://
.
-- Example Hive query
CREATE EXTERNAL TABLE my_s3a_table (...)
LOCATION 's3a://my-ozone-bucket/data/my_table';
LOAD DATA LOCAL INPATH '/local/data.csv' INTO TABLE my_s3a_table;
s3a
vs. ofs
Feature | s3a:// (via Ozone S3 Gateway) | ofs:// |
---|---|---|
Protocol | S3 REST API | Ozone Native RPC |
Primary Bucket | OBS (works with FSO) | FSO (incompatible with OBS) |
Semantics | Object Storage | Filesystem |
Dir Rename/Delete | Client-side simulation (Copy+Delete), Not Atomic | Server-side, Atomic (on FSO buckets) |
Performance | Can vary; potentially slower for renames/listings | Generally optimized for filesystem operations |
Compatibility | S3-focused tools, Hadoop ecosystem | Hadoop ecosystem |
Setup | Configure S3 Gateway endpoint & credentials | Configure OM Service ID / Address |
When to Use s3a
with Ozone
- When you need to access Ozone from Hadoop tools but prefer or require S3 API interaction (e.g., due to existing S3 tooling or specific application requirements).
- When accessing OBS buckets from Hadoop applications (as
ofs://
cannot access OBS buckets). - When strict filesystem atomicity for directory operations is not a critical requirement.
For most traditional Hadoop analytics workloads requiring filesystem semantics and performance, ofs://
with FSO buckets is generally the preferred and more performant choice. However, s3a
provides valuable flexibility for accessing Ozone via the widely adopted S3 protocol.