Skip to main content

Hadoop S3A Connector (s3a)

The Hadoop S3A connector (s3a://) is a client-side library designed to allow Hadoop ecosystem tools to interact with Amazon S3 or S3-compatible object stores, including Apache Ozone's S3 Gateway.

Overview

While Ozone provides its native Hadoop Compatible File System (HCFS) implementation via ofs://, using the s3a connector offers an alternative way for Hadoop applications (Spark, Hive, MapReduce, etc.) to access Ozone data using the S3 protocol.

Key Points:

  • Client-Side Translation: s3a acts as a translator, converting Hadoop FileSystem API calls into S3 REST API requests, which are then sent to the configured endpoint (Ozone's S3 Gateway).
  • S3 Semantics: Operations performed via s3a generally follow S3 object storage semantics. This means directory operations are often simulated client-side.
  • Configuration: Requires configuring the Hadoop client with the Ozone S3 Gateway endpoint, access credentials, and specific s3a settings.
  • Performance: Performance characteristics can differ from ofs://, especially for operations involving directory renames or listings, as s3a might perform multiple S3 API calls to simulate these actions.

Configuration

To use s3a with Ozone, configure the following properties in the client's core-site.xml:

<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class for the s3a filesystem.</description>
</property>

<property>
<name>fs.s3a.endpoint</name>
<value>http://ozone-s3g.example.com:9878</value> <!-- URL of your S3 Gateway -->
<description>Ozone S3 Gateway endpoint URL.</description>
</property>

<property>
<name>fs.s3a.access.key</name>
<value>your_ozone_access_key</value> <!-- Access Key obtained from Ozone -->
<description>Ozone S3 Access Key ID.</description>
</property>

<property>
<name>fs.s3a.secret.key</name>
<value>your_ozone_secret_key</value> <!-- Secret Key obtained from Ozone -->
<description>Ozone S3 Secret Access Key.</description>
</property>

<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
<description>
Recommended for Ozone S3 Gateway. Use path-style access (bucket name in path)
instead of virtual-style (bucket name in hostname).
</description>
</property>

<!-- Optional: Disable change detection for better compatibility/performance with Ozone -->
<property>
<name>fs.s3a.change.detection.mode</name>
<value>none</value>
</property>
<property>
<name>fs.s3a.change.detection.version.required</name>
<value>false</value>
</property>

<!-- Optional: Configure S3A committer if needed for specific frameworks -->
<!--
<property>
<name>fs.s3a.committer.name</name>
<value>directory</value> <!- or partition, magic ->
</property>
<property>
<name>fs.s3a.committer.staging.conflict-mode</name>
<value>append</value> <!- or fail, replace ->
</property>
-->

Replace endpoint and credential values with your specific Ozone S3 Gateway details. Ensure the necessary Hadoop AWS and S3A JARs are on the client classpath.

Usage Examples

Hadoop FS CLI:

# List contents (maps to S3 ListObjects)
hadoop fs -ls s3a://my-ozone-bucket/

# Create a directory (simulated by s3a, often creates a zero-byte object ending in '/')
hadoop fs -mkdir s3a://my-ozone-bucket/newdir

# Copy a local file to Ozone via s3a
hadoop fs -copyFromLocal /local/path/file.txt s3a://my-ozone-bucket/newdir/

# Rename (often involves copy + delete operations by s3a, NOT atomic)
hadoop fs -mv s3a://my-ozone-bucket/newdir s3a://my-ozone-bucket/renameddir

Spark/Hive:

Applications can use s3a:// paths similarly to how they use ofs:// or hdfs://.

-- Example Hive query
CREATE EXTERNAL TABLE my_s3a_table (...)
LOCATION 's3a://my-ozone-bucket/data/my_table';

LOAD DATA LOCAL INPATH '/local/data.csv' INTO TABLE my_s3a_table;

s3a vs. ofs

Features3a:// (via Ozone S3 Gateway)ofs://
ProtocolS3 REST APIOzone Native RPC
Primary BucketOBS (works with FSO)FSO (incompatible with OBS)
SemanticsObject StorageFilesystem
Dir Rename/DeleteClient-side simulation (Copy+Delete), Not AtomicServer-side, Atomic (on FSO buckets)
PerformanceCan vary; potentially slower for renames/listingsGenerally optimized for filesystem operations
CompatibilityS3-focused tools, Hadoop ecosystemHadoop ecosystem
SetupConfigure S3 Gateway endpoint & credentialsConfigure OM Service ID / Address

When to Use s3a with Ozone

  • When you need to access Ozone from Hadoop tools but prefer or require S3 API interaction (e.g., due to existing S3 tooling or specific application requirements).
  • When accessing OBS buckets from Hadoop applications (as ofs:// cannot access OBS buckets).
  • When strict filesystem atomicity for directory operations is not a critical requirement.

For most traditional Hadoop analytics workloads requiring filesystem semantics and performance, ofs:// with FSO buckets is generally the preferred and more performant choice. However, s3a provides valuable flexibility for accessing Ozone via the widely adopted S3 protocol.