Hadoop S3A Connector (`s3a`)

The Hadoop S3A connector (s3a://) is a client-side library designed to allow Hadoop ecosystem tools to interact with Amazon S3 or S3-compatible object stores, including Apache Ozone's S3 Gateway.

Overview

While Ozone provides its native Hadoop Compatible File System (HCFS) implementation via ofs://, using the s3a connector offers an alternative way for Hadoop applications (Spark, Hive, MapReduce, etc.) to access Ozone data using the S3 protocol.

Key Points:

Client-Side Translation: s3a acts as a translator, converting Hadoop FileSystem API calls into S3 REST API requests, which are then sent to the configured endpoint (Ozone's S3 Gateway).
S3 Semantics: Operations performed via s3a generally follow S3 object storage semantics. This means directory operations are often simulated client-side.
Configuration: Requires configuring the Hadoop client with the Ozone S3 Gateway endpoint, access credentials, and specific s3a settings.
Performance: Performance characteristics can differ from ofs://, especially for operations involving directory renames or listings, as s3a might perform multiple S3 API calls to simulate these actions.

Configuration

To use s3a with Ozone, configure the following properties in the client's core-site.xml:

<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  <description>The implementation class for the s3a filesystem.</description>
</property>

<property>
  <name>fs.s3a.endpoint</name>
  <value>http://ozone-s3g.example.com:9878</value> <!-- URL of your S3 Gateway -->
  <description>Ozone S3 Gateway endpoint URL.</description>
</property>

<property>
  <name>fs.s3a.access.key</name>
  <value>your_ozone_access_key</value> <!-- Access Key obtained from Ozone -->
  <description>Ozone S3 Access Key ID.</description>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>your_ozone_secret_key</value> <!-- Secret Key obtained from Ozone -->
  <description>Ozone S3 Secret Access Key.</description>
</property>

<property>
  <name>fs.s3a.path.style.access</name>
  <value>true</value>
  <description>
    Recommended for Ozone S3 Gateway. Use path-style access (bucket name in path)
    instead of virtual-style (bucket name in hostname).
  </description>
</property>

<!-- Optional: Disable change detection for better compatibility/performance with Ozone -->
<property>
  <name>fs.s3a.change.detection.mode</name>
  <value>none</value>
</property>
<property>
  <name>fs.s3a.change.detection.version.required</name>
  <value>false</value>
</property>

<!-- Optional: Configure S3A committer if needed for specific frameworks -->
<!--
<property>
  <name>fs.s3a.committer.name</name>
  <value>directory</value> <!- or partition, magic ->
</property>
<property>
  <name>fs.s3a.committer.staging.conflict-mode</name>
  <value>append</value> <!- or fail, replace ->
</property>
-->

Replace endpoint and credential values with your specific Ozone S3 Gateway details. Ensure the necessary Hadoop AWS and S3A JARs are on the client classpath.

Usage Examples

Hadoop FS CLI:

# List contents (maps to S3 ListObjects)
hadoop fs -ls s3a://my-ozone-bucket/

# Create a directory (simulated by s3a, often creates a zero-byte object ending in '/')
hadoop fs -mkdir s3a://my-ozone-bucket/newdir

# Copy a local file to Ozone via s3a
hadoop fs -copyFromLocal /local/path/file.txt s3a://my-ozone-bucket/newdir/

# Rename (often involves copy + delete operations by s3a, NOT atomic)
hadoop fs -mv s3a://my-ozone-bucket/newdir s3a://my-ozone-bucket/renameddir

Spark/Hive:

Applications can use s3a:// paths similarly to how they use ofs:// or hdfs://.

-- Example Hive query
CREATE EXTERNAL TABLE my_s3a_table (...)
LOCATION 's3a://my-ozone-bucket/data/my_table';

LOAD DATA LOCAL INPATH '/local/data.csv' INTO TABLE my_s3a_table;

`s3a` vs. `ofs`

Feature	`s3a://` (via Ozone S3 Gateway)	`ofs://`
Protocol	S3 REST API	Ozone Native RPC
Primary Bucket	OBS (works with FSO)	FSO (incompatible with OBS)
Semantics	Object Storage	Filesystem
Dir Rename/Delete	Client-side simulation (Copy+Delete), Not Atomic	Server-side, Atomic (on FSO buckets)
Performance	Can vary; potentially slower for renames/listings	Generally optimized for filesystem operations
Compatibility	S3-focused tools, Hadoop ecosystem	Hadoop ecosystem
Setup	Configure S3 Gateway endpoint & credentials	Configure OM Service ID / Address

When to Use `s3a` with Ozone

When you need to access Ozone from Hadoop tools but prefer or require S3 API interaction (e.g., due to existing S3 tooling or specific application requirements).
When accessing OBS buckets from Hadoop applications (as ofs:// cannot access OBS buckets).
When strict filesystem atomicity for directory operations is not a critical requirement.

For most traditional Hadoop analytics workloads requiring filesystem semantics and performance, ofs:// with FSO buckets is generally the preferred and more performant choice. However, s3a provides valuable flexibility for accessing Ozone via the widely adopted S3 protocol.

Overview​

Configuration​

Usage Examples​

s3a vs. ofs​

When to Use s3a with Ozone​

Overview

Configuration

Usage Examples

`s3a` vs. `ofs`

When to Use `s3a` with Ozone