Mastering Data Processing on AWS: A Python Developer's Guide to Boto3 and S3

Published September 22, 2025 · 4 min read · by Eric Wilson

Unlock the power of AWS for data processing. This guide covers everything a Python developer needs to know about using Boto3 to interact with S3, from basic file uploads to advanced operations like presigned URLs and multipart uploads.

Amazon S3 is the backbone of countless data architectures on AWS, serving as a durable, scalable, and cost-effective object store. For Python developers, boto3 is the essential tool for interacting with S3. But are you using it as efficiently as possible?

This guide moves beyond basic file uploads and explores practical techniques for mastering data processing with boto3 and S3, covering everything from handling large datasets to performing batch operations.

Getting Started: The Boto3 S3 Client and Resource

boto3 offers two ways to interact with S3:

Client: A low-level interface that maps directly to the S3 API operations (e.g., list_objects_v2, put_object).
Resource: A higher-level, object-oriented interface that provides a more intuitive way to manage S3 resources (e.g., bucket.objects.all()).

For most data processing tasks, the Client offers more control and is often more performant.

import boto3

# Use the client for fine-grained control
s3_client = boto3.client('s3')

# The resource can be more convenient for simple operations
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket('my-data-bucket')

Technique 1: Efficiently Listing Objects with Paginators

When a bucket contains thousands or millions of objects, listing them all can be a challenge. A standard list_objects_v2 call only returns up to 1,000 objects at a time. Instead of manually handling the NextContinuationToken, you should use a paginator.

Paginators abstract away the complexity of token management, allowing you to iterate over the entire result set with a simple loop.

def count_objects_in_prefix(bucket: str, prefix: str) -> int:
    """Efficiently counts all objects in a given S3 prefix."""
    paginator = s3_client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)

    object_count = 0
    for page in page_iterator:
        if 'Contents' in page:
            object_count += len(page['Contents'])
            
    return object_count

Technique 2: Streaming Large Files

Loading a multi-gigabyte file from S3 directly into memory is a recipe for disaster. A much safer and more memory-efficient approach is to stream the data.

The StreamingBody object returned by get_object allows you to read the file in chunks, making it possible to process huge files with a small, constant amount of memory.

Here’s how you can process a large CSV file line by line without loading the whole file:

import csv
import codecs

def process_large_csv(bucket: str, key: str):
    """Streams and processes a large CSV file from S3 line by line."""
    response = s3_client.get_object(Bucket=bucket, Key=key)
    
    # Use codecs to decode the streaming body line by line
    csv_reader = csv.reader(codecs.getreader('utf-8')(response['Body']))
    
    for row in csv_reader:
        # Process each row without loading the entire file
        print(f"Processing row: {row}")

Technique 3: Parallel Uploads and Downloads with `transfer`

For large files, boto3's upload_file and download_file methods are highly recommended. These methods, part of the S3 transfer module, automatically handle multipart uploads and downloads, using multiple threads to improve throughput.

They also manage retries and checksum validation, making your file transfers more robust.

from boto3.s3.transfer import TransferConfig

# Set multipart threshold to 10 MB and use 8 threads
config = TransferConfig(
    multipart_threshold=1024 * 1024 * 10,
    max_concurrency=8,
    use_threads=True
)

# Upload a file with parallel parts
s3_client.upload_file(
    Filename='local-large-file.zip',
    Bucket='my-data-bucket',
    Key='uploads/large-file.zip',
    Config=config
)

# Download a file with parallel parts
s3_client.download_file(
    Bucket='my-data-bucket',
    Key='uploads/large-file.zip',
    Filename='downloaded-large-file.zip',
    Config=config
)

Technique 4: Performing Batch Operations

If you need to delete thousands of objects, calling delete_object for each one is slow and inefficient. The delete_objects operation allows you to delete up to 1,000 objects in a single API call.

Here’s a helper function to delete all objects within a prefix:

def delete_all_objects_in_prefix(bucket: str, prefix: str):
    """Deletes all objects in a given S3 prefix in batches."""
    paginator = s3_client.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket, Prefix=prefix)

    for page in pages:
        if 'Contents' in page:
            objects_to_delete = [{'Key': obj['Key']} for obj in page['Contents']]
            
            if objects_to_delete:
                s3_client.delete_objects(
                    Bucket=bucket,
                    Delete={'Objects': objects_to_delete}
                )
                print(f"Deleted {len(objects_to_delete)} objects.")

Conclusion

Working with S3 is about more than just storing files. By using the right boto3 patterns—paginators for listing, streaming for large files, transfer management for uploads, and batch operations for bulk changes—you can build data processing applications that are efficient, scalable, and robust. The next time you write a Python script for S3, think beyond the basics and apply these techniques to master your data workflow.

Comments

Comments are not available. Feel free to share your feedback on LinkedIn or connect with Geek Cafe.

Back to All Blogs