I’m going to show how to download CloudFront logs, parse them with Pandas and extract some useful insights. I’m writing the whole code in a Jupyter notebook, so it’s easy to follow along and reproduce the results.

Downloading logs

First of all, “Standard logging” must be enabled for the CloudFront distribution and logs are stored in an S3 bucket. It can be done in the AWS console like described here . Each log file is a gzipped CSV file with the predefined structure. You can read more about the format, including the list of fields, in the official documentation .

I prefer to download logs once, store them in an optimized Parquet format and then analyze them locally.

Make sure you have stored credentials for the AWS account in the ~/.aws/credentials file. There are more ways to authenticate with AWS, but this one is the simplest and works well for me.

Let’s take a look on the first part of the notebook, which is responsible for basic configuration and downloading logs.

import boto3
import os
import gzip
import shutil
import pandas as pd

aws_profile_name = 'my-dev'

# The bucket and list of prefixes is used to filter out logs for the specific period of time
bucket_name = 'main-cloudfront-logs'
prefixes = [
  'E2NDJF3JKDD2FG.2023-08-01',
  'E2NDJF3JKDD2FG.2023-08-02',
]

# Where to store logs locally
parquet_file_name = 'logs.parquet'

# Load AWS credentials
session = boto3.Session(profile_name=aws_profile_name)

# Create an S3 resource object using the AWS credentials
s3 = session.resource('s3')

# Create a local directory to store the downloaded log files
os.makedirs('logs', exist_ok=True)

# List all log files in the S3 bucket that match the prefix
bucket = s3.Bucket(bucket_name)
log_files = []
for prefix in prefixes:
    log_files += [obj.key for obj in bucket.objects.filter(Prefix=prefix)]

print('Downloading {} log files from {}.'.format(len(log_files), bucket_name))

# Download, unzip, and parse each log file
data = []
for idx, log_file in enumerate(log_files):
    # Create subdirectories if they don't exist
    os.makedirs(os.path.dirname('logs/' + log_file), exist_ok=True)

    # Download the log file
    s3.meta.client.download_file(bucket_name, log_file, 'logs/' + log_file)

    # Unzip the log file
    with gzip.open('logs/' + log_file, 'rb') as f_in:
        with open('logs/' + log_file[:-3], 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    # Get the column names from the second line of the file
    with open('logs/' + log_file[:-3], 'r') as f:
        f.readline()  # Skip the first line
        columns = f.readline().strip().split(' ')[1:]

    print(f'Parsing #{idx + 1} {log_file}')

    # Load the log file into a DataFrame
    df = pd.read_csv('logs/' + log_file[:-3], delimiter='\t', skiprows=2, names=columns)

    # Append the DataFrame to the data list
    data.append(df)

    # Delete original and unzipped log files
    os.remove('logs/' + log_file)
    os.remove('logs/' + log_file[:-3])

# Concatenate all dataframes
df = pd.concat(data, ignore_index=True)

# Convert 'sc-content-len' to numeric
df['sc-content-len'] = pd.to_numeric(df['sc-content-len'], errors='coerce')

# Save the DataFrame to a Parquet file
df.to_parquet(parquet_file_name)

As a result, we have a Parquet file with all logs for the specified period of time.

Now it’s time to analyze them.

Analyzing logs

Let’s see what we have in the dataset.

# Load the DataFrame from the Parquet file
df = pd.read_parquet(parquet_file_name, engine='pyarrow')

# Print summary statistics for all columns
pd.set_option('display.max_columns', None)
df.describe(include='all').fillna('-')

You should see a table with all columns and their summary statistics.

Basically, from this point you can use all the power of Pandas to analyze the log records.

Let’s see a few simple examples.

# Show the most popular URLs
df.groupby('cs-uri-stem').size().sort_values(ascending=False)

# Show the most popular URLs with 404 status code
df[df['sc-status'] == 404].groupby('cs-uri-stem').size().sort_values(ascending=False)

# Show the slowest requests (by URL)
df.groupby('cs-uri-stem')['time-taken'].mean().sort_values(ascending=False)

We can try something more advanced. Let’s say we have such requests:

/series/<UUID>/<FORMAT>

GET /series/1ef7f7ae-e002-445d-8e7a-c0b4cd11000f/raw
GET /series/1ef7f7ae-e002-445d-8e7a-c0b4cd11000f/compact
GET /series/1ef7f7ae-e002-445d-8e7a-c0b4cd11000f/simplified
GET /series/efa8b51c-972c-4e66-8f69-bcfd8d45f25d/raw
...

We can extract the UUID and the format from the URL and then analyze based on these values.

# Extract UUID and format from the URL
pattern = r'/series/([a-f0-9\-]+)/([a-z]+)'
df[['uuid', 'format']] = df['cs-uri-stem'].str.extract(pattern)

# Show the most popular UUIDs
df.groupby('uuid').size().sort_values(ascending=False)

# Show the number of requests by format
df.groupby('format').size().sort_values(ascending=False)

Conclusion

I hope this notebook will help you to get started with CloudFront logs analysis.

The main drawback of this approach is that you need to download logs from S3 to your local machine and the whole dataset should fit into the memory of your computer. This will be a problem for a really large number of logs. In such case, it’s better to use some analytical databases or distributed solution. You could also use AWS Athena to query logs directly in S3.