High Level Design Series · Data Storage · Part 8· Post 28 of 70

Object Storage (S3 Architecture)

April 2026 · 16 min read

Object vs File vs Block Storage

Before diving into S3, let’s understand the three fundamental storage paradigms that underpin every modern infrastructure. Each trades off between access patterns, performance, and scalability in fundamentally different ways.

Dimension	Block Storage	File Storage	Object Storage
Unit	Fixed-size blocks (512B–4KB)	Files in directories (hierarchical)	Objects in flat namespace (bucket + key)
Access	Raw block I/O (iSCSI, FC)	POSIX file APIs (NFS, SMB)	HTTP REST API (PUT/GET/DELETE)
Metadata	Minimal (block address only)	File system metadata (permissions, timestamps)	Rich, custom key-value metadata per object
Mutability	In-place updates (random writes)	In-place updates (seek + write)	Immutable — replace entire object
Scalability	Limited to single volume (~16 TB)	Limited by NAS controller (~PBs with clustering)	Virtually unlimited (exabytes+)
Performance	Lowest latency (<1ms SSD)	Low latency (1–10ms NFS)	Higher latency (50–200ms first byte)
Durability	Depends on RAID/replication	Depends on NAS redundancy	Designed for 99.999999999% (11 nines)
Cost (per GB/mo)	$0.08–0.10 (EBS gp3)	$0.03–0.30 (EFS)	$0.023 (S3 Standard)
Best For	Databases, boot volumes, VMs	Shared file systems, home directories, CMS	Backups, media, data lakes, archives
Examples	AWS EBS, Azure Disk, GCP PD	AWS EFS, Azure Files, GCP Filestore	AWS S3, Azure Blob, GCP Cloud Storage, MinIO

Key insight: Object storage sacrifices random-write and low-latency access in exchange for virtually unlimited scale, extreme durability, and the cheapest per-GB cost. This is why it dominates for unstructured data — which accounts for over 80% of enterprise data.

Why Object Storage Dominates at Scale

The flat namespace is the secret weapon. File systems maintain a hierarchical directory tree — every mkdir, rename, or ls must traverse and lock parts of this tree. As the tree grows to billions of files, metadata operations become the bottleneck (ask anyone who’s run ls on a directory with 10 million files).

Object storage eliminates this by treating each object as an independent entity addressed by a flat key. The “directory structure” you see in the S3 console (photos/2024/vacation/img001.jpg) is a visual illusion — the forward slashes are just characters in a flat string key. This means:

No directory locking — billions of concurrent writes to different keys with zero contention
No rename tax — “moving” is just a copy + delete (no metadata tree walk)
Hash-based distribution — keys are hashed to distribute objects evenly across storage nodes
Limitless fanout — no directory entry limits, no inode exhaustion

S3 Architecture Deep Dive

Amazon S3 (Simple Storage Service), launched in 2006, stores over 350 trillion objects and handles millions of requests per second. Let’s dissect its architecture layer by layer.

Core Concepts: Buckets, Objects, Keys

Buckets are the top-level containers. Each bucket name is globally unique across all of AWS (because bucket names become part of the DNS hostname). You get up to 100 buckets per account (soft limit, can be raised to 1000).

# Bucket naming rules:
# - 3–63 characters, lowercase letters, numbers, hyphens
# - Globally unique across ALL AWS accounts
# - Cannot look like an IP address (e.g. 192.168.1.1)

# S3 URL formats:
# Path-style:   https://s3.amazonaws.com/my-bucket/photos/cat.jpg
# Virtual-host: https://my-bucket.s3.amazonaws.com/photos/cat.jpg  (preferred)
# Region:       https://my-bucket.s3.us-west-2.amazonaws.com/photos/cat.jpg

Objects are the actual data entities. Each object consists of:

Key — the unique identifier within a bucket (up to 1,024 bytes UTF-8)
Value — the data itself (0 bytes to 5 TB per object)
Version ID — automatically assigned when versioning is enabled
Metadata — system metadata (Content-Type, Last-Modified) + user-defined metadata (up to 2 KB)
Subresources — ACLs, torrent info, etc.

# PUT an object with custom metadata
aws s3api put-object \
  --bucket my-data-lake \
  --key "raw/events/2024/03/15/events-001.parquet" \
  --body events-001.parquet \
  --content-type "application/octet-stream" \
  --metadata '{"source":"kafka-cluster-1","partition":"7","offset":"142857"}' \
  --storage-class STANDARD \
  --server-side-encryption "aws:kms" \
  --ssekms-key-id "arn:aws:kms:us-east-1:123456:key/abc-123"

# Response:
# {
#     "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
#     "VersionId": "3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo",
#     "ServerSideEncryption": "aws:kms"
# }

Internal Architecture

While AWS doesn’t publish all internal details, we know from published papers and conference talks that S3 is built on several internal subsystems:

Front-End Layer
REST API, authentication, request routing, rate limiting

↓

Index Layer (Metadata)
Object key → storage location mapping (distributed key-value store)

↓

Placement Layer
Decides which storage nodes & disks receive data chunks

↓

Storage Layer
Physical disks organized into storage nodes, erasure-coded chunks

Request flow for a PUT:

Front-End authenticates the request (SigV4), validates the bucket & key, checks IAM policies and bucket policies.
Index Layer reserves a new entry in the metadata store, mapping the key to a set of storage locations.
Placement Layer selects target storage nodes based on fault domains (spread across racks, power zones, AZs).
Storage Layer receives the data, splits it into chunks, applies erasure coding, and writes coded fragments to disks.
Once a quorum of fragments is durably written, the index is committed and the client receives a 200 OK.

Consistency Model

S3’s consistency story is one of the most important evolutions in cloud storage history.

The Old Model: Eventual Consistency (2006–2020)

For its first 14 years, S3 had a mixed consistency model:

Read-after-write consistency for new PUTs — if you created a brand-new object, subsequent GETs would return the latest data.
Eventual consistency for overwrites (PUT of existing key) and DELETEs — you might read stale data for a short window.

# The classic gotcha (pre-Dec 2020):
PUT s3://bucket/config.json  (version 2)     # overwrite existing object
GET s3://bucket/config.json                   # might return version 1!
# ... seconds later ...
GET s3://bucket/config.json                   # now returns version 2

# Even worse — the "negative cache" bug:
GET s3://bucket/new-file.txt    → 404         # object doesn't exist yet
PUT s3://bucket/new-file.txt    → 200         # create it
GET s3://bucket/new-file.txt    → 404!        # cached 404 haunts you

This caused real production bugs: data pipelines reading stale files, CI/CD systems failing intermittently, config deployments appearing to silently fail.

The New Model: Strong Read-After-Write (Dec 2020)

In December 2020, AWS announced that S3 now delivers strong read-after-write consistency for all operations — PUTs, DELETEs, and LIST — at no additional cost and with no performance penalty.

# Post Dec 2020 — guaranteed behavior:
PUT s3://bucket/config.json   → 200           # overwrite
GET s3://bucket/config.json   → returns new version (guaranteed)

DELETE s3://bucket/old.txt    → 204
GET s3://bucket/old.txt       → 404 (guaranteed, no stale reads)

PUT s3://bucket/new.txt       → 200
LIST s3://bucket/             → includes new.txt (guaranteed)

How did they achieve this? AWS redesigned the S3 metadata subsystem. The new system uses a replication protocol that ensures all metadata nodes see the same view before acknowledging a write. Think of it as a witness protocol — before returning success, the system confirms that all subsequent reads from any node will see the new state. This replaced an eventually-consistent cache layer without sacrificing throughput.

Erasure Coding & 11-Nines Durability

S3 promises 99.999999999% (11 nines) annual durability. This means if you store 10 million objects, you can statistically expect to lose a single object once every 10,000 years. How?

Erasure Coding Fundamentals

Instead of simple replication (3 copies = 3× storage overhead), S3 uses erasure coding — a mathematical technique from coding theory that achieves the same or better durability with far less storage overhead.

The core idea: take k data chunks and produce m parity chunks, for a total of n = k + m chunks. You can reconstruct the original data from any k of the n chunks. This means you can tolerate the loss of up to m chunks.

Reed-Solomon (k, m) erasure code: k = number of data chunks m = number of parity chunks n = k + m = total chunks stored Storage overhead = n / k = (k + m) / k Example: RS(10, 6) \to 16 total chunks - Can lose ANY 6 chunks and still recover - Storage overhead = 16/10 = 1.6\times - Compare: 3\times replication needs 3.0\times storage for same durability

Durability Mathematics

Let’s derive the 11-nines figure. Assume:

Annual disk failure rate (AFR) = 2% (industry standard for enterprise drives)
Erasure code scheme: RS(10, 6) — 10 data + 6 parity = 16 total fragments
Fragments spread across independent fault domains (different racks/AZs)
Repair time: fragments on a failed disk are rebuilt within hours

P(single fragment lost in a year) = AFR = 0.02 P(data loss) = P(more than 6 fragments lost simultaneously) = P(\geq7 out of 16 fragments fail before repair) Using binomial distribution: P(exactly i failures out of 16) = C(16,i) \times 0.02^i \times 0.98^(16-i) P(\geq7 failures) = Σ(i=7 to 16) C(16,i) \times 0.02^i \times 0.98^(16-i) Dominant term (i=7): C(16,7) \times 0.02^7 \times 0.98^9 = 11440 \times 1.28\times10⁻¹² \times 0.834 = 1.22 \times 10⁻⁸ Adding remaining terms: P(\geq7) \approx 1.22 \times 10⁻⁸ But this assumes no repair! With continuous repair (rebuild time ~6 hours): P(data loss) drops to \approx 10⁻¹⁴ to 10⁻¹⁵ Durability = 1 - P(data loss) \approx 99.9999999999999% = well beyond 11 nines (more like 14-15 nines with repair) AWS conservatively quotes 11 nines to account for: - Correlated failures (power outage takes whole rack) - Software bugs (bit rot, firmware issues) - Human error (operational mistakes)

Storage efficiency comparison:

Scheme	Chunks	Tolerate Failures	Storage Overhead	Approx. Durability
Simple replication (3 copies)	3 data	2	3.0×	99.9999% (6 nines)
RS(4, 2)	4+2 = 6	2	1.5×	99.99999% (7 nines)
RS(6, 3)	6+3 = 9	3	1.5×	99.999999% (8 nines)
RS(10, 6)	10+6 = 16	6	1.6×	99.999999999%+ (11+ nines)
RS(16, 4)	16+4 = 20	4	1.25×	99.9999999% (9 nines)

Why not just replicate 11 times? For 1 PB of data: 3× replication costs $69,000/mo (3 PB at $0.023/GB). RS(10,6) at 1.6× costs only $36,800/mo (1.6 PB). That’s $32,200/mo saved — nearly $400K per year — while achieving better durability.

▶ Object Storage Architecture — Erasure Coding Flow

Watch how a client upload is split into data chunks, encoded with parity, and distributed across storage nodes. See how the system survives node failures.

Storage Classes

S3 offers a tiered storage model that lets you optimize cost based on access frequency. Each class differs in pricing, retrieval latency, minimum storage duration, and availability SLA.

Storage Class	Use Case	$/GB/mo	Retrieval	Min Duration	Availability
S3 Standard	Frequently accessed data	$0.023	Instant (ms)	None	99.99%
S3 Intelligent-Tiering	Unknown/changing access patterns	$0.023–0.004	Instant (ms)	None	99.9%
S3 Standard-IA	Infrequent access, rapid retrieval	$0.0125	Instant (ms)	30 days	99.9%
S3 One Zone-IA	Re-creatable infrequent data	$0.01	Instant (ms)	30 days	99.5%
S3 Glacier Instant	Archive with instant access	$0.004	Instant (ms)	90 days	99.9%
S3 Glacier Flexible	Archive, minutes-to-hours retrieval	$0.0036	1–12 hours	90 days	99.9%
S3 Glacier Deep Archive	Long-term archive, rare access	$0.00099	12–48 hours	180 days	99.9%

Real-World Cost Calculation

Consider a media company storing 500 TB of video assets with different access patterns:

Scenario: 500 TB video library Hot content (5% = 25 TB): S3 Standard \to 25,000 GB \times $0.023 = $575/mo Warm content (15% = 75 TB): S3 Standard-IA \to 75,000 GB \times $0.0125 = $937.50/mo Cold content (30% = 150 TB): Glacier Flexible \to 150,000 GB \times $0.0036 = $540/mo Frozen (50% = 250 TB): Glacier Deep Archive \to 250,000 GB \times $0.00099 = $247.50/mo Total: $2,300/month for 500 TB ──────────────────────────────────────── If everything was S3 Standard: 500,000 GB \times $0.023 = $11,500/mo Savings: $9,200/month = $110,400/year (80% reduction!) Plus retrieval costs (GET requests + data transfer): Standard: $0.0004 per 1K GET requests (free retrieval) Standard-IA: $0.001 per 1K GET + $0.01/GB retrieval Glacier Flexible: $0.0004 per 1K GET + $0.03/GB retrieval (standard tier) Deep Archive: $0.0004 per 1K GET + $0.02/GB retrieval (standard tier)

▶ Storage Class Lifecycle — Cost vs Access Trade-Off

See how an object transitions through storage tiers over time, trading access speed for lower cost at each stage.

Multipart Upload

For objects larger than 100 MB (required for objects > 5 GB), S3 provides multipart upload — a three-phase protocol that dramatically improves reliability and throughput for large objects.

How It Works

Initiate — create a multipart upload session, receive an UploadId
Upload Parts — upload each part (5 MB–5 GB each, up to 10,000 parts), in parallel if desired
Complete — send the list of part numbers and ETags to finalize the object

# Phase 1: Initiate multipart upload
aws s3api create-multipart-upload \
  --bucket my-data-lake \
  --key "backups/db-snapshot-2024-03-15.tar.gz" \
  --storage-class STANDARD_IA \
  --server-side-encryption "aws:kms"

# Response: { "UploadId": "abc123...", "Bucket": "my-data-lake", "Key": "..." }

# Phase 2: Upload parts (can be parallel!)
# Split a 10 GB file into 100 MB parts:
split -b 100M db-snapshot.tar.gz part-

# Upload each part (can run in parallel with GNU parallel or xargs):
for i in $(seq 1 100); do
  aws s3api upload-part \
    --bucket my-data-lake \
    --key "backups/db-snapshot-2024-03-15.tar.gz" \
    --upload-id "abc123..." \
    --part-number $i \
    --body "part-$(printf '%02d' $i)"
done
# Each returns: { "ETag": "\"etag-hash-here\"" }

# Phase 3: Complete multipart upload
aws s3api complete-multipart-upload \
  --bucket my-data-lake \
  --key "backups/db-snapshot-2024-03-15.tar.gz" \
  --upload-id "abc123..." \
  --multipart-upload '{
    "Parts": [
      {"PartNumber": 1, "ETag": "\"etag1\""},
      {"PartNumber": 2, "ETag": "\"etag2\""},
      ...
      {"PartNumber": 100, "ETag": "\"etag100\""}
    ]
  }'

Why multipart matters:

Resilience — if one part fails, retry just that part (not the entire 10 GB upload)
Parallelism — upload 8 parts simultaneously to saturate your bandwidth
Pause/Resume — upload over hours or days; parts stay for 24 hours by default
Throughput — smaller parts = more TCP connections = higher aggregate bandwidth

Optimal part size selection:

Object size Recommended part size Number of parts ───────────── ──────────────────── ─────────────── 100 MB-1 GB 16 MB 6-64 1 GB-10 GB 64 MB-128 MB 8-156 10 GB-100 GB 128 MB-512 MB 20-781 100 GB-5 TB 512 MB-1 GB 100-5000 Formula: part_size = max(5 MB, object_size / 10000) parts = ceil(object_size / part_size) Max: 10,000 parts \times 5 GB/part = 5 TB (single object limit)

Pre-Signed URLs

Pre-signed URLs allow you to grant temporary, scoped access to private S3 objects without exposing your AWS credentials or making the bucket public. The URL itself contains a cryptographic signature.

# Generate a pre-signed URL for downloading (GET)
aws s3 presign s3://my-bucket/reports/q1-2024.pdf \
  --expires-in 3600   # 1 hour

# Output:
# https://my-bucket.s3.amazonaws.com/reports/q1-2024.pdf
#   ?X-Amz-Algorithm=AWS4-HMAC-SHA256
#   &X-Amz-Credential=AKIA.../20240315/us-east-1/s3/aws4_request
#   &X-Amz-Date=20240315T120000Z
#   &X-Amz-Expires=3600
#   &X-Amz-SignedHeaders=host
#   &X-Amz-Signature=abc123...

# Generate a pre-signed URL for uploading (PUT)
import boto3

s3 = boto3.client('s3', region_name='us-east-1')
url = s3.generate_presigned_url(
    'put_object',
    Params={
        'Bucket': 'user-uploads',
        'Key': f'avatars/{user_id}.jpg',
        'ContentType': 'image/jpeg',
        'ContentLength': 5242880,  # enforce 5 MB max
    },
    ExpiresIn=900  # 15 minutes
)

# Client uploads directly to S3, bypassing your server:
# curl -X PUT -H "Content-Type: image/jpeg" \
#   --data-binary @avatar.jpg "$PRESIGNED_URL"

Common patterns for pre-signed URLs:

Direct browser uploads — client-side JavaScript uploads to S3 via pre-signed PUT URL, avoiding proxying through your API server
Temporary download links — generate time-limited links for paid content, SaaS exports, or sensitive reports
Cross-account sharing — share an object with someone in a different AWS account without bucket policy changes
CDN cache busting — pre-signed URLs with different signatures bust CloudFront cache

Security best practice: Pre-signed URLs inherit the permissions of the IAM user/role that created them. If that user’s permissions are revoked, outstanding pre-signed URLs stop working immediately. Always use the shortest practical expiration time.

Versioning

S3 versioning keeps every version of every object. Once enabled on a bucket, it cannot be disabled — only suspended (new objects won’t get versions, but existing versions persist).

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket my-bucket \
  --versioning-configuration Status=Enabled

# Upload same key twice:
aws s3 cp v1.txt s3://my-bucket/config.txt    # VersionId: "aaa111"
aws s3 cp v2.txt s3://my-bucket/config.txt    # VersionId: "bbb222"

# List all versions:
aws s3api list-object-versions --bucket my-bucket --prefix config.txt
# {
#   "Versions": [
#     { "Key": "config.txt", "VersionId": "bbb222", "IsLatest": true,  "Size": 1024 },
#     { "Key": "config.txt", "VersionId": "aaa111", "IsLatest": false, "Size": 512  }
#   ]
# }

# Get a specific version:
aws s3api get-object --bucket my-bucket --key config.txt \
  --version-id "aaa111" old-config.txt

# "Delete" an object (just adds a delete marker):
aws s3 rm s3://my-bucket/config.txt
# VersionId: "ccc333" (this is a delete marker)

# Object appears deleted, but old versions still exist!
# To truly delete, specify the version:
aws s3api delete-object --bucket my-bucket --key config.txt \
  --version-id "aaa111"   # permanently deletes this version

Versioning costs: Each version is a full copy, billed at the same rate. A 1 GB file overwritten 100 times = 100 GB stored. Use lifecycle policies to clean up old versions.

Lifecycle Policies

Lifecycle rules automate the transition and expiration of objects — the backbone of cost optimization in any S3-heavy architecture.

// lifecycle-rules.json — applied to bucket with:
// aws s3api put-bucket-lifecycle-configuration \
//   --bucket my-data-lake --lifecycle-configuration file://lifecycle-rules.json
{
  "Rules": [
    {
      "ID": "hot-to-warm-after-30d",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555    // delete after 7 years (compliance)
      }
    },
    {
      "ID": "cleanup-incomplete-uploads",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    },
    {
      "ID": "expire-old-versions",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 365
      }
    },
    {
      "ID": "delete-expired-markers",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "Expiration": {
        "ExpiredObjectDeleteMarker": true
      }
    }
  ]
}

Hidden cost trap: Incomplete multipart uploads are invisible in the S3 console but still incur storage charges. The AbortIncompleteMultipartUpload rule above is essential — we’ve seen companies paying thousands per month for orphaned multipart fragments they didn’t know existed.

Event Notifications (S3 → SNS/SQS/Lambda)

S3 can emit events when objects are created, deleted, restored, or replicated. This is the foundation of event-driven architectures built on object storage.

Event Types

# Supported events:
s3:ObjectCreated:*          # any create (Put, Post, Copy, CompleteMultipartUpload)
s3:ObjectCreated:Put        # specific PUT
s3:ObjectCreated:Post
s3:ObjectCreated:Copy
s3:ObjectCreated:CompleteMultipartUpload
s3:ObjectRemoved:*          # any delete
s3:ObjectRemoved:Delete
s3:ObjectRemoved:DeleteMarkerCreated
s3:ObjectRestore:Post       # Glacier restore initiated
s3:ObjectRestore:Completed  # Glacier restore completed
s3:Replication:*            # cross-region replication events
s3:LifecycleTransition      # object transitioned between storage classes
s3:IntelligentTiering       # automatic tier change

Notification Targets & Patterns

# Notification configuration (via AWS CLI)
aws s3api put-bucket-notification-configuration \
  --bucket media-uploads \
  --notification-configuration '{
    "LambdaFunctionConfigurations": [
      {
        "Id": "thumbnail-generator",
        "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456:function:gen-thumb",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              {"Name": "prefix", "Value": "uploads/images/"},
              {"Name": "suffix", "Value": ".jpg"}
            ]
          }
        }
      }
    ],
    "QueueConfigurations": [
      {
        "Id": "transcode-queue",
        "QueueArn": "arn:aws:sqs:us-east-1:123456:video-transcode-queue",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              {"Name": "prefix", "Value": "uploads/video/"}
            ]
          }
        }
      }
    ],
    "TopicConfigurations": [
      {
        "Id": "audit-trail",
        "TopicArn": "arn:aws:sns:us-east-1:123456:s3-audit-topic",
        "Events": ["s3:ObjectRemoved:*"]
      }
    ]
  }'

Event-driven pipeline example:

S3 Upload
User uploads video to uploads/video/clip.mp4

↓ s3:ObjectCreated:Put

SQS Queue
Decouples, buffers burst uploads, provides retry

↓ poll

Lambda / ECS Worker
Transcodes to HLS (720p, 1080p, 4K)

↓ PUT

S3 Output
Processed files to processed/video/clip/

↓ s3:ObjectCreated

SNS → CDN Invalidation
Notify CDN + update database catalog

EventBridge integration (preferred for new designs): Since 2021, S3 can send events to Amazon EventBridge, which supports content-based filtering, multiple targets per event, replay, and archive. EventBridge is now the recommended approach over direct SNS/SQS/Lambda notifications for complex event routing.

Content-Addressed Storage

Content-addressed storage (CAS) identifies objects by the cryptographic hash of their content rather than a user-assigned name. This creates a natural deduplication layer and provides integrity verification “for free.”

# Content-addressed key = hash of the content
import hashlib

def store_content_addressed(s3_client, bucket, data: bytes) -> str:
    """Store data using SHA-256 hash as the key."""
    content_hash = hashlib.sha256(data).hexdigest()
    key = f"cas/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
    # ↑ Two-level prefix to avoid hot partitions

    # Check if already exists (dedup!)
    try:
        s3_client.head_object(Bucket=bucket, Key=key)
        return key  # already stored, skip upload
    except s3_client.exceptions.ClientError:
        pass  # doesn't exist, upload it

    s3_client.put_object(
        Bucket=bucket,
        Key=key,
        Body=data,
        ContentType='application/octet-stream',
        Metadata={
            'content-hash': content_hash,
            'hash-algorithm': 'sha256'
        }
    )
    return key

# Usage:
key = store_content_addressed(s3, 'my-cas-bucket', video_bytes)
# key = "cas/a3/b1/a3b1c2d3e4f5...64-char-hex"

# Verification on read:
obj = s3.get_object(Bucket='my-cas-bucket', Key=key)
data = obj['Body'].read()
assert hashlib.sha256(data).hexdigest() == key.split('/')[-1]
# ↑ integrity guaranteed — if hash matches, data is uncorrupted

Where CAS is used:

Git — every blob, tree, and commit is stored by its SHA-1 (now SHA-256) hash
Docker/OCI images — each layer is a content-addressed blob in a registry
IPFS — distributed file system where files are addressed by CID (content identifier)
Backup systems — Restic, Borg use CAS for deduplication across backups
Data lakes — Apache Iceberg uses content hashing for snapshot manifests

S3’s ETag is almost CAS: For single-part uploads, the ETag is the MD5 hash of the content. For multipart uploads, it’s the MD5 of the concatenated part MD5s, suffixed with -N (number of parts). The x-amz-checksum-sha256 header (added in 2022) provides true content addressing.

S3 API: Key Operations

S3’s REST API is the de facto standard for object storage — virtually every cloud provider and open-source alternative implements this API.

# ──────── CRUD Operations ────────

# PUT Object
PUT /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com
Content-Type: application/json
Content-Length: 1024
x-amz-storage-class: STANDARD_IA
x-amz-server-side-encryption: AES256
x-amz-meta-custom-field: my-value
Authorization: AWS4-HMAC-SHA256 Credential=...

{"data": "..."}

# GET Object
GET /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com
Range: bytes=0-1048575    # partial read (first 1 MB)

# HEAD Object (metadata only, no data transfer)
HEAD /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com
# Returns: Content-Length, Content-Type, ETag, Last-Modified, x-amz-meta-*

# DELETE Object
DELETE /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com

# ──────── LIST Operations ────────

# List objects (v2, paginated)
GET /?list-type=2&prefix=photos/2024/&delimiter=/&max-keys=1000 HTTP/1.1
Host: my-bucket.s3.amazonaws.com
# Returns: CommonPrefixes (simulated "directories") + Contents (objects)

# ──────── COPY Object (server-side, no download) ────────
PUT /destination-key HTTP/1.1
Host: dest-bucket.s3.amazonaws.com
x-amz-copy-source: source-bucket/source-key

# ──────── Batch Operations ────────
# S3 Batch Operations can process billions of objects:
# - Copy objects between buckets
# - Set tags, ACLs, or metadata
# - Invoke Lambda per object
# - Restore from Glacier

Performance Limits

Metric	Limit	Notes
GET requests per prefix	5,500/sec	Per prefix per partition
PUT/POST/DELETE per prefix	3,500/sec	Per prefix per partition
Max object size	5 TB	Via multipart upload
Max single PUT size	5 GB	Use multipart for larger
Max metadata per object	2 KB	User-defined key-value pairs
Max parts per multipart	10,000	Part size: 5 MB–5 GB
Max buckets per account	100 (soft)	Raisable to 1,000
Max objects per bucket	Unlimited	Billions in production

Scaling past 5,500 GET/s: S3 automatically partitions your bucket by key prefix. To maximize throughput, use high-cardinality prefixes (e.g., hash-based keys like a3b1/data.json instead of sequential keys like 2024-03-15/data.json). Since 2018, S3 handles this automatically for most workloads — the “randomize prefix” advice is largely outdated but still relevant for extreme throughput.

MinIO: Self-Hosted S3-Compatible Storage

MinIO is the leading open-source, S3-compatible object storage server. It implements the full S3 API, meaning any application built for S3 can switch to MinIO with zero code changes — just change the endpoint URL.

Architecture

# MinIO can run as a single binary or distributed cluster

# Single-node (development):
minio server /data --console-address ":9001"
# Exposes S3 API on :9000, web console on :9001

# Distributed mode (production) — 4 nodes × 4 disks each = 16 drives:
# Node 1:
minio server http://node{1...4}:9000/mnt/disk{1...4}/data

# This creates an erasure-coded cluster:
# - RS(8,8) by default: 8 data + 8 parity shards per object
# - Tolerates loss of up to 8 drives (half the cluster)
# - Storage efficiency: 50% usable (compared to 33% with 3× replication)

# More efficient for larger clusters — 16 nodes × 4 disks = 64 drives:
minio server http://node{1...16}:9000/mnt/disk{1...4}/data
# RS(32,32): tolerates 32 drive failures

Usage with Standard S3 SDKs

# Python — boto3 works with MinIO by changing endpoint only:
import boto3

s3 = boto3.client('s3',
    endpoint_url='http://minio.internal:9000',
    aws_access_key_id='minioadmin',
    aws_secret_access_key='minioadmin',
    region_name='us-east-1'  # required but not used
)

# All standard S3 operations work:
s3.create_bucket(Bucket='my-app-data')
s3.put_object(Bucket='my-app-data', Key='users/123.json', Body=b'{"name":"Alice"}')
obj = s3.get_object(Bucket='my-app-data', Key='users/123.json')
print(obj['Body'].read())  # b'{"name":"Alice"}'

# Go — MinIO's own SDK:
// import "github.com/minio/minio-go/v7"
client, _ := minio.New("minio.internal:9000", &minio.Options{
    Creds:  credentials.NewStaticV4("minioadmin", "minioadmin", ""),
    Secure: false,
})
client.PutObject(ctx, "my-app-data", "users/123.json",
    strings.NewReader(`{"name":"Alice"}`), -1,
    minio.PutObjectOptions{ContentType: "application/json"})

# Kubernetes deployment (Helm):
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --set replicas=4 \
  --set persistence.size=1Ti \
  --set resources.requests.memory=16Gi

When to use MinIO over S3:

Data sovereignty — data must stay on-premises or in specific jurisdictions
Cost at scale — 10+ PB is cheaper on own hardware ($0.005/GB/mo vs S3’s $0.023)
Low-latency access — co-locate storage with compute (same rack, <1ms vs 50ms S3 latency)
Air-gapped environments — no internet connectivity (defense, healthcare)
Development/testing — local S3-compatible server for integration tests

Design Patterns & Best Practices

Key Design Patterns

# ❌ Bad: Sequential keys create hot partitions
2024-03-15-000001.json
2024-03-15-000002.json
2024-03-15-000003.json
# All requests hit the same partition → throttling at high throughput

# ✅ Good: Hash-prefixed keys distribute across partitions
a3b1/2024-03-15-000001.json    # hash of timestamp or UUID prefix
7f2e/2024-03-15-000002.json
c9d4/2024-03-15-000003.json

# ✅ Better: Use natural high-cardinality prefixes
users/a3b1c2d3/profile.json   # user ID as prefix
events/2024/03/15/14/30/evt-uuid.json   # hour-level partitioning

# ✅ Best (modern S3): Use random UUIDs as keys
550e8400-e29b-41d4-a716-446655440000.parquet
# S3 auto-partitions since 2018, but UUIDs still help at extreme scale

Security Patterns

# 1. Bucket policy: deny unencrypted uploads
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnencryptedUploads",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    },
    {
      "Sid": "DenyHTTP",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
      "Condition": {
        "Bool": { "aws:SecureTransport": "false" }
      }
    }
  ]
}

# 2. Enable S3 Block Public Access (account-level)
aws s3control put-public-access-block \
  --account-id 123456789012 \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# 3. Enable access logging
aws s3api put-bucket-logging --bucket my-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "my-access-logs",
      "TargetPrefix": "s3-logs/my-bucket/"
    }
  }'

Summary

Object storage uses a flat namespace (bucket + key) and HTTP API — trades random-write/low-latency for unlimited scale and 11-nines durability.
S3 architecture has four layers: front-end (REST), index (metadata), placement, and storage (erasure-coded chunks on disk).
Strong read-after-write consistency (since Dec 2020) — no more stale reads for overwrites or deletes.
Erasure coding RS(10,6) achieves 11+ nines durability at only 1.6× storage overhead, vs 3.0× for triple replication.
Storage classes (Standard → IA → Glacier → Deep Archive) can reduce costs by 80%+ with lifecycle policies.
Multipart upload enables parallel, resumable uploads for objects up to 5 TB (10,000 parts × 5 GB).
Pre-signed URLs provide time-limited, credential-free access — essential for direct client uploads.
Event notifications (S3 → Lambda/SQS/SNS/EventBridge) power event-driven pipelines for media processing, ETL, and auditing.
Content-addressed storage (CAS) uses content hashes as keys for natural deduplication and integrity verification.
MinIO provides S3-compatible storage on your own hardware — zero code changes, just swap the endpoint URL.