← All Posts
High Level Design Series · Data Storage · Part 8· Post 28 of 70

Object Storage (S3 Architecture)

Object vs File vs Block Storage

Before diving into S3, let’s understand the three fundamental storage paradigms that underpin every modern infrastructure. Each trades off between access patterns, performance, and scalability in fundamentally different ways.

DimensionBlock StorageFile StorageObject Storage
UnitFixed-size blocks (512B–4KB)Files in directories (hierarchical)Objects in flat namespace (bucket + key)
AccessRaw block I/O (iSCSI, FC)POSIX file APIs (NFS, SMB)HTTP REST API (PUT/GET/DELETE)
MetadataMinimal (block address only)File system metadata (permissions, timestamps)Rich, custom key-value metadata per object
MutabilityIn-place updates (random writes)In-place updates (seek + write)Immutable — replace entire object
ScalabilityLimited to single volume (~16 TB)Limited by NAS controller (~PBs with clustering)Virtually unlimited (exabytes+)
PerformanceLowest latency (<1ms SSD)Low latency (1–10ms NFS)Higher latency (50–200ms first byte)
DurabilityDepends on RAID/replicationDepends on NAS redundancyDesigned for 99.999999999% (11 nines)
Cost (per GB/mo)$0.08–0.10 (EBS gp3)$0.03–0.30 (EFS)$0.023 (S3 Standard)
Best ForDatabases, boot volumes, VMsShared file systems, home directories, CMSBackups, media, data lakes, archives
ExamplesAWS EBS, Azure Disk, GCP PDAWS EFS, Azure Files, GCP FilestoreAWS S3, Azure Blob, GCP Cloud Storage, MinIO
Key insight: Object storage sacrifices random-write and low-latency access in exchange for virtually unlimited scale, extreme durability, and the cheapest per-GB cost. This is why it dominates for unstructured data — which accounts for over 80% of enterprise data.

Why Object Storage Dominates at Scale

The flat namespace is the secret weapon. File systems maintain a hierarchical directory tree — every mkdir, rename, or ls must traverse and lock parts of this tree. As the tree grows to billions of files, metadata operations become the bottleneck (ask anyone who’s run ls on a directory with 10 million files).

Object storage eliminates this by treating each object as an independent entity addressed by a flat key. The “directory structure” you see in the S3 console (photos/2024/vacation/img001.jpg) is a visual illusion — the forward slashes are just characters in a flat string key. This means:

S3 Architecture Deep Dive

Amazon S3 (Simple Storage Service), launched in 2006, stores over 350 trillion objects and handles millions of requests per second. Let’s dissect its architecture layer by layer.

Core Concepts: Buckets, Objects, Keys

Buckets are the top-level containers. Each bucket name is globally unique across all of AWS (because bucket names become part of the DNS hostname). You get up to 100 buckets per account (soft limit, can be raised to 1000).

# Bucket naming rules:
# - 3–63 characters, lowercase letters, numbers, hyphens
# - Globally unique across ALL AWS accounts
# - Cannot look like an IP address (e.g. 192.168.1.1)

# S3 URL formats:
# Path-style:   https://s3.amazonaws.com/my-bucket/photos/cat.jpg
# Virtual-host: https://my-bucket.s3.amazonaws.com/photos/cat.jpg  (preferred)
# Region:       https://my-bucket.s3.us-west-2.amazonaws.com/photos/cat.jpg

Objects are the actual data entities. Each object consists of:

# PUT an object with custom metadata
aws s3api put-object \
  --bucket my-data-lake \
  --key "raw/events/2024/03/15/events-001.parquet" \
  --body events-001.parquet \
  --content-type "application/octet-stream" \
  --metadata '{"source":"kafka-cluster-1","partition":"7","offset":"142857"}' \
  --storage-class STANDARD \
  --server-side-encryption "aws:kms" \
  --ssekms-key-id "arn:aws:kms:us-east-1:123456:key/abc-123"

# Response:
# {
#     "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
#     "VersionId": "3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo",
#     "ServerSideEncryption": "aws:kms"
# }

Internal Architecture

While AWS doesn’t publish all internal details, we know from published papers and conference talks that S3 is built on several internal subsystems:

Front-End Layer
REST API, authentication, request routing, rate limiting
Index Layer (Metadata)
Object key → storage location mapping (distributed key-value store)
Placement Layer
Decides which storage nodes & disks receive data chunks
Storage Layer
Physical disks organized into storage nodes, erasure-coded chunks

Request flow for a PUT:

  1. Front-End authenticates the request (SigV4), validates the bucket & key, checks IAM policies and bucket policies.
  2. Index Layer reserves a new entry in the metadata store, mapping the key to a set of storage locations.
  3. Placement Layer selects target storage nodes based on fault domains (spread across racks, power zones, AZs).
  4. Storage Layer receives the data, splits it into chunks, applies erasure coding, and writes coded fragments to disks.
  5. Once a quorum of fragments is durably written, the index is committed and the client receives a 200 OK.

Consistency Model

S3’s consistency story is one of the most important evolutions in cloud storage history.

The Old Model: Eventual Consistency (2006–2020)

For its first 14 years, S3 had a mixed consistency model:

# The classic gotcha (pre-Dec 2020):
PUT s3://bucket/config.json  (version 2)     # overwrite existing object
GET s3://bucket/config.json                   # might return version 1!
# ... seconds later ...
GET s3://bucket/config.json                   # now returns version 2

# Even worse — the "negative cache" bug:
GET s3://bucket/new-file.txt    → 404         # object doesn't exist yet
PUT s3://bucket/new-file.txt    → 200         # create it
GET s3://bucket/new-file.txt    → 404!        # cached 404 haunts you

This caused real production bugs: data pipelines reading stale files, CI/CD systems failing intermittently, config deployments appearing to silently fail.

The New Model: Strong Read-After-Write (Dec 2020)

In December 2020, AWS announced that S3 now delivers strong read-after-write consistency for all operations — PUTs, DELETEs, and LIST — at no additional cost and with no performance penalty.

# Post Dec 2020 — guaranteed behavior:
PUT s3://bucket/config.json   → 200           # overwrite
GET s3://bucket/config.json   → returns new version (guaranteed)

DELETE s3://bucket/old.txt    → 204
GET s3://bucket/old.txt       → 404 (guaranteed, no stale reads)

PUT s3://bucket/new.txt       → 200
LIST s3://bucket/             → includes new.txt (guaranteed)
How did they achieve this? AWS redesigned the S3 metadata subsystem. The new system uses a replication protocol that ensures all metadata nodes see the same view before acknowledging a write. Think of it as a witness protocol — before returning success, the system confirms that all subsequent reads from any node will see the new state. This replaced an eventually-consistent cache layer without sacrificing throughput.

Erasure Coding & 11-Nines Durability

S3 promises 99.999999999% (11 nines) annual durability. This means if you store 10 million objects, you can statistically expect to lose a single object once every 10,000 years. How?

Erasure Coding Fundamentals

Instead of simple replication (3 copies = 3× storage overhead), S3 uses erasure coding — a mathematical technique from coding theory that achieves the same or better durability with far less storage overhead.

The core idea: take k data chunks and produce m parity chunks, for a total of n = k + m chunks. You can reconstruct the original data from any k of the n chunks. This means you can tolerate the loss of up to m chunks.

Reed-Solomon (k, m) erasure code: k = number of data chunks m = number of parity chunks n = k + m = total chunks stored Storage overhead = n / k = (k + m) / k Example: RS(10, 6) → 16 total chunks - Can lose ANY 6 chunks and still recover - Storage overhead = 16/10 = 1.6× - Compare: 3× replication needs 3.0× storage for same durability

Durability Mathematics

Let’s derive the 11-nines figure. Assume:

P(single fragment lost in a year) = AFR = 0.02 P(data loss) = P(more than 6 fragments lost simultaneously) = P(≥7 out of 16 fragments fail before repair) Using binomial distribution: P(exactly i failures out of 16) = C(16,i) × 0.02^i × 0.98^(16-i) P(≥7 failures) = Σ(i=7 to 16) C(16,i) × 0.02^i × 0.98^(16-i) Dominant term (i=7): C(16,7) × 0.02^7 × 0.98^9 = 11440 × 1.28×10⁻¹² × 0.834 = 1.22 × 10⁻⁸ Adding remaining terms: P(≥7) ≈ 1.22 × 10⁻⁸ But this assumes no repair! With continuous repair (rebuild time ~6 hours): P(data loss) drops to ≈ 10⁻¹⁴ to 10⁻¹⁵ Durability = 1 - P(data loss) ≈ 99.9999999999999% = well beyond 11 nines (more like 14-15 nines with repair) AWS conservatively quotes 11 nines to account for: - Correlated failures (power outage takes whole rack) - Software bugs (bit rot, firmware issues) - Human error (operational mistakes)

Storage efficiency comparison:

SchemeChunksTolerate FailuresStorage OverheadApprox. Durability
Simple replication (3 copies)3 data23.0×99.9999% (6 nines)
RS(4, 2)4+2 = 621.5×99.99999% (7 nines)
RS(6, 3)6+3 = 931.5×99.999999% (8 nines)
RS(10, 6)10+6 = 1661.6×99.999999999%+ (11+ nines)
RS(16, 4)16+4 = 2041.25×99.9999999% (9 nines)
Why not just replicate 11 times? For 1 PB of data: 3× replication costs $69,000/mo (3 PB at $0.023/GB). RS(10,6) at 1.6× costs only $36,800/mo (1.6 PB). That’s $32,200/mo saved — nearly $400K per year — while achieving better durability.

▶ Object Storage Architecture — Erasure Coding Flow

Watch how a client upload is split into data chunks, encoded with parity, and distributed across storage nodes. See how the system survives node failures.

Storage Classes

S3 offers a tiered storage model that lets you optimize cost based on access frequency. Each class differs in pricing, retrieval latency, minimum storage duration, and availability SLA.

Storage ClassUse Case$/GB/moRetrievalMin DurationAvailability
S3 StandardFrequently accessed data$0.023Instant (ms)None99.99%
S3 Intelligent-TieringUnknown/changing access patterns$0.023–0.004Instant (ms)None99.9%
S3 Standard-IAInfrequent access, rapid retrieval$0.0125Instant (ms)30 days99.9%
S3 One Zone-IARe-creatable infrequent data$0.01Instant (ms)30 days99.5%
S3 Glacier InstantArchive with instant access$0.004Instant (ms)90 days99.9%
S3 Glacier FlexibleArchive, minutes-to-hours retrieval$0.00361–12 hours90 days99.9%
S3 Glacier Deep ArchiveLong-term archive, rare access$0.0009912–48 hours180 days99.9%

Real-World Cost Calculation

Consider a media company storing 500 TB of video assets with different access patterns:

Scenario: 500 TB video library Hot content (5% = 25 TB): S3 Standard → 25,000 GB × $0.023 = $575/mo Warm content (15% = 75 TB): S3 Standard-IA → 75,000 GB × $0.0125 = $937.50/mo Cold content (30% = 150 TB): Glacier Flexible → 150,000 GB × $0.0036 = $540/mo Frozen (50% = 250 TB): Glacier Deep Archive → 250,000 GB × $0.00099 = $247.50/mo Total: $2,300/month for 500 TB ──────────────────────────────────────── If everything was S3 Standard: 500,000 GB × $0.023 = $11,500/mo Savings: $9,200/month = $110,400/year (80% reduction!) Plus retrieval costs (GET requests + data transfer): Standard: $0.0004 per 1K GET requests (free retrieval) Standard-IA: $0.001 per 1K GET + $0.01/GB retrieval Glacier Flexible: $0.0004 per 1K GET + $0.03/GB retrieval (standard tier) Deep Archive: $0.0004 per 1K GET + $0.02/GB retrieval (standard tier)

▶ Storage Class Lifecycle — Cost vs Access Trade-Off

See how an object transitions through storage tiers over time, trading access speed for lower cost at each stage.

Multipart Upload

For objects larger than 100 MB (required for objects > 5 GB), S3 provides multipart upload — a three-phase protocol that dramatically improves reliability and throughput for large objects.

How It Works

  1. Initiate — create a multipart upload session, receive an UploadId
  2. Upload Parts — upload each part (5 MB–5 GB each, up to 10,000 parts), in parallel if desired
  3. Complete — send the list of part numbers and ETags to finalize the object
# Phase 1: Initiate multipart upload
aws s3api create-multipart-upload \
  --bucket my-data-lake \
  --key "backups/db-snapshot-2024-03-15.tar.gz" \
  --storage-class STANDARD_IA \
  --server-side-encryption "aws:kms"

# Response: { "UploadId": "abc123...", "Bucket": "my-data-lake", "Key": "..." }

# Phase 2: Upload parts (can be parallel!)
# Split a 10 GB file into 100 MB parts:
split -b 100M db-snapshot.tar.gz part-

# Upload each part (can run in parallel with GNU parallel or xargs):
for i in $(seq 1 100); do
  aws s3api upload-part \
    --bucket my-data-lake \
    --key "backups/db-snapshot-2024-03-15.tar.gz" \
    --upload-id "abc123..." \
    --part-number $i \
    --body "part-$(printf '%02d' $i)"
done
# Each returns: { "ETag": "\"etag-hash-here\"" }

# Phase 3: Complete multipart upload
aws s3api complete-multipart-upload \
  --bucket my-data-lake \
  --key "backups/db-snapshot-2024-03-15.tar.gz" \
  --upload-id "abc123..." \
  --multipart-upload '{
    "Parts": [
      {"PartNumber": 1, "ETag": "\"etag1\""},
      {"PartNumber": 2, "ETag": "\"etag2\""},
      ...
      {"PartNumber": 100, "ETag": "\"etag100\""}
    ]
  }'
Why multipart matters:
  • Resilience — if one part fails, retry just that part (not the entire 10 GB upload)
  • Parallelism — upload 8 parts simultaneously to saturate your bandwidth
  • Pause/Resume — upload over hours or days; parts stay for 24 hours by default
  • Throughput — smaller parts = more TCP connections = higher aggregate bandwidth

Optimal part size selection:

Object size Recommended part size Number of parts ───────────── ──────────────────── ─────────────── 100 MB–1 GB 16 MB 6–64 1 GB–10 GB 64 MB–128 MB 8–156 10 GB–100 GB 128 MB–512 MB 20–781 100 GB–5 TB 512 MB–1 GB 100–5000 Formula: part_size = max(5 MB, object_size / 10000) parts = ceil(object_size / part_size) Max: 10,000 parts × 5 GB/part = 5 TB (single object limit)

Pre-Signed URLs

Pre-signed URLs allow you to grant temporary, scoped access to private S3 objects without exposing your AWS credentials or making the bucket public. The URL itself contains a cryptographic signature.

# Generate a pre-signed URL for downloading (GET)
aws s3 presign s3://my-bucket/reports/q1-2024.pdf \
  --expires-in 3600   # 1 hour

# Output:
# https://my-bucket.s3.amazonaws.com/reports/q1-2024.pdf
#   ?X-Amz-Algorithm=AWS4-HMAC-SHA256
#   &X-Amz-Credential=AKIA.../20240315/us-east-1/s3/aws4_request
#   &X-Amz-Date=20240315T120000Z
#   &X-Amz-Expires=3600
#   &X-Amz-SignedHeaders=host
#   &X-Amz-Signature=abc123...

# Generate a pre-signed URL for uploading (PUT)
import boto3

s3 = boto3.client('s3', region_name='us-east-1')
url = s3.generate_presigned_url(
    'put_object',
    Params={
        'Bucket': 'user-uploads',
        'Key': f'avatars/{user_id}.jpg',
        'ContentType': 'image/jpeg',
        'ContentLength': 5242880,  # enforce 5 MB max
    },
    ExpiresIn=900  # 15 minutes
)

# Client uploads directly to S3, bypassing your server:
# curl -X PUT -H "Content-Type: image/jpeg" \
#   --data-binary @avatar.jpg "$PRESIGNED_URL"

Common patterns for pre-signed URLs:

Security best practice: Pre-signed URLs inherit the permissions of the IAM user/role that created them. If that user’s permissions are revoked, outstanding pre-signed URLs stop working immediately. Always use the shortest practical expiration time.

Versioning

S3 versioning keeps every version of every object. Once enabled on a bucket, it cannot be disabled — only suspended (new objects won’t get versions, but existing versions persist).

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket my-bucket \
  --versioning-configuration Status=Enabled

# Upload same key twice:
aws s3 cp v1.txt s3://my-bucket/config.txt    # VersionId: "aaa111"
aws s3 cp v2.txt s3://my-bucket/config.txt    # VersionId: "bbb222"

# List all versions:
aws s3api list-object-versions --bucket my-bucket --prefix config.txt
# {
#   "Versions": [
#     { "Key": "config.txt", "VersionId": "bbb222", "IsLatest": true,  "Size": 1024 },
#     { "Key": "config.txt", "VersionId": "aaa111", "IsLatest": false, "Size": 512  }
#   ]
# }

# Get a specific version:
aws s3api get-object --bucket my-bucket --key config.txt \
  --version-id "aaa111" old-config.txt

# "Delete" an object (just adds a delete marker):
aws s3 rm s3://my-bucket/config.txt
# VersionId: "ccc333" (this is a delete marker)

# Object appears deleted, but old versions still exist!
# To truly delete, specify the version:
aws s3api delete-object --bucket my-bucket --key config.txt \
  --version-id "aaa111"   # permanently deletes this version

Versioning costs: Each version is a full copy, billed at the same rate. A 1 GB file overwritten 100 times = 100 GB stored. Use lifecycle policies to clean up old versions.

Lifecycle Policies

Lifecycle rules automate the transition and expiration of objects — the backbone of cost optimization in any S3-heavy architecture.

// lifecycle-rules.json — applied to bucket with:
// aws s3api put-bucket-lifecycle-configuration \
//   --bucket my-data-lake --lifecycle-configuration file://lifecycle-rules.json
{
  "Rules": [
    {
      "ID": "hot-to-warm-after-30d",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555    // delete after 7 years (compliance)
      }
    },
    {
      "ID": "cleanup-incomplete-uploads",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    },
    {
      "ID": "expire-old-versions",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 365
      }
    },
    {
      "ID": "delete-expired-markers",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "Expiration": {
        "ExpiredObjectDeleteMarker": true
      }
    }
  ]
}
Hidden cost trap: Incomplete multipart uploads are invisible in the S3 console but still incur storage charges. The AbortIncompleteMultipartUpload rule above is essential — we’ve seen companies paying thousands per month for orphaned multipart fragments they didn’t know existed.

Event Notifications (S3 → SNS/SQS/Lambda)

S3 can emit events when objects are created, deleted, restored, or replicated. This is the foundation of event-driven architectures built on object storage.

Event Types

# Supported events:
s3:ObjectCreated:*          # any create (Put, Post, Copy, CompleteMultipartUpload)
s3:ObjectCreated:Put        # specific PUT
s3:ObjectCreated:Post
s3:ObjectCreated:Copy
s3:ObjectCreated:CompleteMultipartUpload
s3:ObjectRemoved:*          # any delete
s3:ObjectRemoved:Delete
s3:ObjectRemoved:DeleteMarkerCreated
s3:ObjectRestore:Post       # Glacier restore initiated
s3:ObjectRestore:Completed  # Glacier restore completed
s3:Replication:*            # cross-region replication events
s3:LifecycleTransition      # object transitioned between storage classes
s3:IntelligentTiering       # automatic tier change

Notification Targets & Patterns

# Notification configuration (via AWS CLI)
aws s3api put-bucket-notification-configuration \
  --bucket media-uploads \
  --notification-configuration '{
    "LambdaFunctionConfigurations": [
      {
        "Id": "thumbnail-generator",
        "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456:function:gen-thumb",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              {"Name": "prefix", "Value": "uploads/images/"},
              {"Name": "suffix", "Value": ".jpg"}
            ]
          }
        }
      }
    ],
    "QueueConfigurations": [
      {
        "Id": "transcode-queue",
        "QueueArn": "arn:aws:sqs:us-east-1:123456:video-transcode-queue",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              {"Name": "prefix", "Value": "uploads/video/"}
            ]
          }
        }
      }
    ],
    "TopicConfigurations": [
      {
        "Id": "audit-trail",
        "TopicArn": "arn:aws:sns:us-east-1:123456:s3-audit-topic",
        "Events": ["s3:ObjectRemoved:*"]
      }
    ]
  }'

Event-driven pipeline example:

S3 Upload
User uploads video to uploads/video/clip.mp4
↓ s3:ObjectCreated:Put
SQS Queue
Decouples, buffers burst uploads, provides retry
↓ poll
Lambda / ECS Worker
Transcodes to HLS (720p, 1080p, 4K)
↓ PUT
S3 Output
Processed files to processed/video/clip/
↓ s3:ObjectCreated
SNS → CDN Invalidation
Notify CDN + update database catalog
EventBridge integration (preferred for new designs): Since 2021, S3 can send events to Amazon EventBridge, which supports content-based filtering, multiple targets per event, replay, and archive. EventBridge is now the recommended approach over direct SNS/SQS/Lambda notifications for complex event routing.

Content-Addressed Storage

Content-addressed storage (CAS) identifies objects by the cryptographic hash of their content rather than a user-assigned name. This creates a natural deduplication layer and provides integrity verification “for free.”

# Content-addressed key = hash of the content
import hashlib

def store_content_addressed(s3_client, bucket, data: bytes) -> str:
    """Store data using SHA-256 hash as the key."""
    content_hash = hashlib.sha256(data).hexdigest()
    key = f"cas/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
    # ↑ Two-level prefix to avoid hot partitions

    # Check if already exists (dedup!)
    try:
        s3_client.head_object(Bucket=bucket, Key=key)
        return key  # already stored, skip upload
    except s3_client.exceptions.ClientError:
        pass  # doesn't exist, upload it

    s3_client.put_object(
        Bucket=bucket,
        Key=key,
        Body=data,
        ContentType='application/octet-stream',
        Metadata={
            'content-hash': content_hash,
            'hash-algorithm': 'sha256'
        }
    )
    return key

# Usage:
key = store_content_addressed(s3, 'my-cas-bucket', video_bytes)
# key = "cas/a3/b1/a3b1c2d3e4f5...64-char-hex"

# Verification on read:
obj = s3.get_object(Bucket='my-cas-bucket', Key=key)
data = obj['Body'].read()
assert hashlib.sha256(data).hexdigest() == key.split('/')[-1]
# ↑ integrity guaranteed — if hash matches, data is uncorrupted

Where CAS is used:

S3’s ETag is almost CAS: For single-part uploads, the ETag is the MD5 hash of the content. For multipart uploads, it’s the MD5 of the concatenated part MD5s, suffixed with -N (number of parts). The x-amz-checksum-sha256 header (added in 2022) provides true content addressing.

S3 API: Key Operations

S3’s REST API is the de facto standard for object storage — virtually every cloud provider and open-source alternative implements this API.

# ──────── CRUD Operations ────────

# PUT Object
PUT /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com
Content-Type: application/json
Content-Length: 1024
x-amz-storage-class: STANDARD_IA
x-amz-server-side-encryption: AES256
x-amz-meta-custom-field: my-value
Authorization: AWS4-HMAC-SHA256 Credential=...

{"data": "..."}

# GET Object
GET /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com
Range: bytes=0-1048575    # partial read (first 1 MB)

# HEAD Object (metadata only, no data transfer)
HEAD /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com
# Returns: Content-Length, Content-Type, ETag, Last-Modified, x-amz-meta-*

# DELETE Object
DELETE /my-key HTTP/1.1
Host: my-bucket.s3.amazonaws.com

# ──────── LIST Operations ────────

# List objects (v2, paginated)
GET /?list-type=2&prefix=photos/2024/&delimiter=/&max-keys=1000 HTTP/1.1
Host: my-bucket.s3.amazonaws.com
# Returns: CommonPrefixes (simulated "directories") + Contents (objects)

# ──────── COPY Object (server-side, no download) ────────
PUT /destination-key HTTP/1.1
Host: dest-bucket.s3.amazonaws.com
x-amz-copy-source: source-bucket/source-key

# ──────── Batch Operations ────────
# S3 Batch Operations can process billions of objects:
# - Copy objects between buckets
# - Set tags, ACLs, or metadata
# - Invoke Lambda per object
# - Restore from Glacier

Performance Limits

MetricLimitNotes
GET requests per prefix5,500/secPer prefix per partition
PUT/POST/DELETE per prefix3,500/secPer prefix per partition
Max object size5 TBVia multipart upload
Max single PUT size5 GBUse multipart for larger
Max metadata per object2 KBUser-defined key-value pairs
Max parts per multipart10,000Part size: 5 MB–5 GB
Max buckets per account100 (soft)Raisable to 1,000
Max objects per bucketUnlimitedBillions in production
Scaling past 5,500 GET/s: S3 automatically partitions your bucket by key prefix. To maximize throughput, use high-cardinality prefixes (e.g., hash-based keys like a3b1/data.json instead of sequential keys like 2024-03-15/data.json). Since 2018, S3 handles this automatically for most workloads — the “randomize prefix” advice is largely outdated but still relevant for extreme throughput.

MinIO: Self-Hosted S3-Compatible Storage

MinIO is the leading open-source, S3-compatible object storage server. It implements the full S3 API, meaning any application built for S3 can switch to MinIO with zero code changes — just change the endpoint URL.

Architecture

# MinIO can run as a single binary or distributed cluster

# Single-node (development):
minio server /data --console-address ":9001"
# Exposes S3 API on :9000, web console on :9001

# Distributed mode (production) — 4 nodes × 4 disks each = 16 drives:
# Node 1:
minio server http://node{1...4}:9000/mnt/disk{1...4}/data

# This creates an erasure-coded cluster:
# - RS(8,8) by default: 8 data + 8 parity shards per object
# - Tolerates loss of up to 8 drives (half the cluster)
# - Storage efficiency: 50% usable (compared to 33% with 3× replication)

# More efficient for larger clusters — 16 nodes × 4 disks = 64 drives:
minio server http://node{1...16}:9000/mnt/disk{1...4}/data
# RS(32,32): tolerates 32 drive failures

Usage with Standard S3 SDKs

# Python — boto3 works with MinIO by changing endpoint only:
import boto3

s3 = boto3.client('s3',
    endpoint_url='http://minio.internal:9000',
    aws_access_key_id='minioadmin',
    aws_secret_access_key='minioadmin',
    region_name='us-east-1'  # required but not used
)

# All standard S3 operations work:
s3.create_bucket(Bucket='my-app-data')
s3.put_object(Bucket='my-app-data', Key='users/123.json', Body=b'{"name":"Alice"}')
obj = s3.get_object(Bucket='my-app-data', Key='users/123.json')
print(obj['Body'].read())  # b'{"name":"Alice"}'

# Go — MinIO's own SDK:
// import "github.com/minio/minio-go/v7"
client, _ := minio.New("minio.internal:9000", &minio.Options{
    Creds:  credentials.NewStaticV4("minioadmin", "minioadmin", ""),
    Secure: false,
})
client.PutObject(ctx, "my-app-data", "users/123.json",
    strings.NewReader(`{"name":"Alice"}`), -1,
    minio.PutObjectOptions{ContentType: "application/json"})

# Kubernetes deployment (Helm):
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --set replicas=4 \
  --set persistence.size=1Ti \
  --set resources.requests.memory=16Gi

When to use MinIO over S3:

Design Patterns & Best Practices

Key Design Patterns

# ❌ Bad: Sequential keys create hot partitions
2024-03-15-000001.json
2024-03-15-000002.json
2024-03-15-000003.json
# All requests hit the same partition → throttling at high throughput

# ✅ Good: Hash-prefixed keys distribute across partitions
a3b1/2024-03-15-000001.json    # hash of timestamp or UUID prefix
7f2e/2024-03-15-000002.json
c9d4/2024-03-15-000003.json

# ✅ Better: Use natural high-cardinality prefixes
users/a3b1c2d3/profile.json   # user ID as prefix
events/2024/03/15/14/30/evt-uuid.json   # hour-level partitioning

# ✅ Best (modern S3): Use random UUIDs as keys
550e8400-e29b-41d4-a716-446655440000.parquet
# S3 auto-partitions since 2018, but UUIDs still help at extreme scale

Security Patterns

# 1. Bucket policy: deny unencrypted uploads
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnencryptedUploads",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    },
    {
      "Sid": "DenyHTTP",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
      "Condition": {
        "Bool": { "aws:SecureTransport": "false" }
      }
    }
  ]
}

# 2. Enable S3 Block Public Access (account-level)
aws s3control put-public-access-block \
  --account-id 123456789012 \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# 3. Enable access logging
aws s3api put-bucket-logging --bucket my-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "my-access-logs",
      "TargetPrefix": "s3-logs/my-bucket/"
    }
  }'

Summary