Architecture

Apache Ozone is a scalable, distributed, and highly available object store designed to handle billions of objects of any size. This document provides an overview of Ozone's architecture, including its core components, data organization, and operational concepts.

Ozone Namespace

Ozone organizes data in a hierarchical namespace consisting of three levels:

Volumes

Volumes are the top-level entities in the Ozone namespace, conceptually similar to filesystems in traditional storage systems. They typically represent:

Multi-tenant boundaries
Organizational divisions
Project groupings

Volumes provide isolation and can have their own admins and quota limits. For more information, see Volumes Overview.

Buckets

Buckets exist within volumes and act as containers for objects (keys). Each bucket can be configured with specific properties:

Storage type and replication factor
Encryption settings
Access control policies
Quota limits

Buckets are analogous to directories in a filesystem or buckets in cloud object stores. For more information, see Buckets Overview.

Keys (Objects)

Keys are the actual data objects stored in buckets. They can be:

Files of any size
Binary data
Named using a path-like structure depending on bucket layout

For more details about Ozone's namespace, see Namespace Overview.

Ozone Bucket Types

Ozone supports two distinct bucket layouts, each optimized for different use cases:

Object Store Layout

The Object Store layout (OBS) works like traditional object stores with a flat namespace:

Objects are stored with their full key path
No concept of directories (though path-like naming is supported)
Optimized for object storage workloads
Compatible with S3-style access patterns

File System Optimized Layout

The File System Optimized layout (FSO) provides hierarchical directory structure:

Directories are first-class entities
Supports efficient directory operations (listing, renaming)
Includes filesystem features like trash
Better performance for filesystem-style workloads
Default layout type

The bucket layout determines how data is organized and accessed within a bucket. For more information, see Bucket Layouts.

Ozone Namespace Hierarchy showing Volumes containing Buckets containing Keys

Core Components

Ozone has a modular architecture with several key components that work together to provide a scalable and reliable storage system.

Ozone Manager (OM)

The Ozone Manager is the metadata server that manages the namespace:

Handles all volume, bucket, and key operations
Maintains metadata in RocksDB
Allocates blocks for data storage
Manages access control
Supports HA deployment with Ratis consensus protocol

The OM is the entry point for all namespace operations. It tracks which objects exist and where they are stored. For more information, see Ozone Manager Details.

Storage Container Manager (SCM)

The Storage Container Manager orchestrates the container lifecycle and coordinates datanodes:

Manages container creation and allocation
Tracks datanode status and health
Handles container replication and EC
Issues block allocation requests
Coordinates container balancing
Supports HA deployment with Ratis

SCM is the control plane for container management. For more information, see Storage Container Manager Details.

Datanode

Datanodes are the workhorses that store the actual data:

Store data in containers on local disks
Serve read and write requests
Report container statistics to SCM
Participate in replication pipelines
Handle data integrity checks

Each datanode manages a set of containers and serves read/write requests from clients. For more information, see Datanode Details.

Recon

Recon is the analytics and monitoring component:

Collects and aggregates metrics
Provides a web UI for monitoring
Offers a consolidated view of the cluster
Helps identify issues and bottlenecks
Syncs data from OM, SCM, and Datanodes

Recon is an optional but recommended component for operational visibility. For more information, see Recon Details.

S3 Gateway

The S3 Gateway provides S3-compatible API access:

Implements S3 REST API
Translates S3 operations to Ozone operations
Supports most S3 features and SDKs
Provides authentication and authorization

The S3 Gateway enables applications built for S3 to work with Ozone. For more information, see S3 Gateway Details.

HttpFS

HttpFS provides a REST interface compatible with WebHDFS:

Enables HTTP access to Ozone
Compatible with HDFS clients
Supports read/write operations
Facilitates integration with web applications

HttpFS allows web applications to interact with Ozone using familiar APIs. For more information, see HttpFS Details.

Ozone Client

The Ozone client is the software component that enables applications to interact with the Ozone storage system:

Provides Java libraries for programmatic access
Handles communication with OM for namespace operations
Manages direct data transfer with datanodes
Implements client-side caching for improved performance
Offers pluggable interfaces for different protocols (S3, OFS)
Handles authentication and token management

The client library abstracts away the complexity of the distributed system, providing applications with a simple, consistent interface to Ozone storage. For more information, see Client Details.

Component Interactions

The components of Ozone interact in well-defined patterns for different operations:

Diagram illustrating Ozone client interactions with OM, SCM, and Datanodes for read/write operations.

Write Path Sequence

The typical write sequence follows these steps:

Namespace Operations: The client contacts the Ozone Manager to create or locate the key in the namespace
Block Allocation: The Ozone Manager requests blocks from the Storage Container Manager
Data Transfer: The client directly writes data to the selected Datanodes according to the replication pipeline

Read Path Sequence

For reads, the process is simpler:

The client requests key location information from the Ozone Manager
Using the block location information, the client reads data directly from Datanodes
In case of failures, the client retries with alternative replicas

Monitoring and Management

Diagram showing Recon collecting data from OM, SCM, and Datanodes for monitoring. The Recon service continuously:

Collects metrics from the Ozone Manager, Storage Container Manager, and Datanodes
Provides consolidated views of system health and performance
Facilitates troubleshooting and management

Ozone Internals

Understanding Ozone's internal structure helps to grasp how data is organized and protected.

Containers

Containers are the fundamental storage units in Ozone:

Fixed-size (typically 5GB) units of storage
Managed by the Storage Container Manager (SCM)
Replicated or erasure-coded across datanodes
Contain multiple blocks
Include metadata and chunk files

Containers are self-contained units that include all necessary metadata and data. They are the unit of replication in Ozone.

Blocks

Blocks are logical units of data within containers:

Represent portions of objects/keys
Created when clients write data
Referenced by object metadata
Allocated by the Ozone Manager
Secured with block tokens

When a client writes data, the OM allocates blocks from SCM, and the client writes data to these blocks through datanodes.

Chunks

Chunks are the physical data units stored on disk:

Fixed-size portions of blocks
Written sequentially in container data files
Checksummed for data integrity
Optimized for disk I/O

Chunks are the smallest units of data stored on disk and include checksums for integrity verification.

Diagram showing the structure of an Ozone container with metadata and chunks.

Replicated Containers

Ozone provides durability through container replication:

Default replication factor is 3
Uses Ratis (Raft) consensus protocol
Synchronously replicates data across datanodes
Provides strong consistency guarantees
Handles node failures transparently

Replicated containers ensure data durability by storing multiple copies of each container across different datanodes.

Diagram illustrating how data blocks and chunks are stored across datanodes in a replicated container setup.

Erasure Encoded Containers

Erasure coding provides space-efficient durability:

Splits data across multiple datanodes with parity
Supports various coding schemes (e.g., RS-3-2-1024k)
Reduces storage overhead compared to replication
Trades some performance for storage efficiency
Suitable for cold data storage

Erasure coding allows for data durability with less storage overhead than full replication.

Diagram illustrating how data and parity blocks are stored across datanodes in an erasure coded container setup.

Pipelines

Pipelines are groups of datanodes that work together to store data:

Managed by SCM
Consist of multiple datanodes
Handle write operations as a unit
Support different replication strategies

For detailed information, see Write Pipelines.

Ratis Replicated

Ratis pipelines use the Raft consensus protocol:

Typically three datanodes per pipeline
One leader and multiple followers
Synchronous replication
Strong consistency guarantees
Automatic leader election on failure

Erasure Coded

Erasure coded pipelines distribute data and parity:

Datanodes store data or parity chunks
EC pipeline size depends on the coding scheme
Handles reconstruction on node failure
Optimized for storage efficiency

Multi-Protocol Access

Ozone supports multiple access protocols, making it versatile for different applications:

Native API

Command-line interface and Java API
Full feature access
Highest performance

Ozone File System (OFS)

Hadoop-compatible filesystem interface
Works with all Hadoop ecosystem applications
Path format: ofs://om-host/volume/bucket/key

S3 Compatible

REST API compatible with Amazon S3
Works with S3 clients and SDKs
Path format: http://s3g-host/bucket/key

HttpFS

REST API compatible with WebHDFS
Enables web applications to access Ozone
Path format: http://httpfs-host/webhdfs/v1/volume/bucket/key

The multi-protocol architecture allows for flexible integration with existing applications and workflows. For more information, see Client Interfaces.

Summary

Apache Ozone's architecture provides:

Scalability through separation of metadata and data paths
Reliability via replication and erasure coding
Flexibility with multiple access protocols
Performance through optimized data paths
Manageability with comprehensive monitoring

This architecture enables Ozone to serve as both an object store and a filesystem, making it suitable for a wide range of applications from big data analytics to cloud-native workloads.

For more detailed information about Ozone's components and internals, see the System Internals section.

Ozone Namespace​

Volumes​

Buckets​

Keys (Objects)​

Ozone Bucket Types​

Object Store Layout​

File System Optimized Layout​

Core Components​

Ozone Manager (OM)​

Storage Container Manager (SCM)​

Datanode​

Recon​

S3 Gateway​

HttpFS​

Ozone Client​

Component Interactions​

Write Path Sequence​

Read Path Sequence​

Monitoring and Management​

Ozone Internals​

Containers​

Blocks​

Chunks​

Replicated Containers​

Erasure Encoded Containers​

Pipelines​

Ratis Replicated​

Erasure Coded​

Multi-Protocol Access​

Native API​

Ozone File System (OFS)​

S3 Compatible​

HttpFS​

Summary​

Ozone Namespace

Volumes

Buckets

Keys (Objects)

Ozone Bucket Types

Object Store Layout

File System Optimized Layout

Core Components

Ozone Manager (OM)

Storage Container Manager (SCM)

Datanode

Recon

S3 Gateway

HttpFS

Ozone Client

Component Interactions

Write Path Sequence

Read Path Sequence

Monitoring and Management

Ozone Internals

Containers

Blocks

Chunks

Replicated Containers

Erasure Encoded Containers

Pipelines

Ratis Replicated

Erasure Coded

Multi-Protocol Access

Native API

Ozone File System (OFS)

S3 Compatible

HttpFS

Summary