OpenSearch to S3 Replication

To Nha Notes | June 18, 2025, 7:55 p.m.

 

Near Real-Time Data Replication from AWS OpenSearch Service to Amazon S3 and AWS Glue Data Catalog

 

 

I. Executive Summary

 

Replicating data from operational stores, such as AWS OpenSearch Service, to analytical data lakes like Amazon S3 and AWS Glue Data Catalog is a critical requirement for various business functions, including business intelligence, compliance, and long-term data retention. The mandate for "near real-time" replication, specifically leveraging Change Data Capture (CDC) or streaming methodologies, necessitates a continuous data flow rather than periodic batch exports. This report investigates viable architectural patterns and specific AWS service configurations to achieve this objective.

The analysis reveals that the most suitable AWS-native approaches for near real-time data replication from OpenSearch Service involve utilizing OpenSearch Ingestion (Data Prepper) for its fully managed, serverless streaming capabilities, or a self-managed Logstash pipeline for scenarios demanding greater customization. While traditional snapshot-based methods are invaluable for disaster recovery and data migration, their inherent point-in-time nature renders them unsuitable for stringent near real-time CDC requirements. Furthermore, a detailed examination of AWS Database Migration Service (DMS) and Kinesis Data Firehose capabilities confirms that these services are not designed to function as direct sources for data extraction from OpenSearch Service.

The primary conclusion is that OpenSearch Ingestion stands out as the recommended AWS-managed solution, offering simplicity, automatic scalability, and native integration for continuous data flow. Logstash presents a flexible alternative, albeit with increased operational overhead.

 

II. Understanding the Replication Mandate

 

Achieving effective data replication requires a clear understanding of the terminology and the capabilities of the source and target systems. This section defines the core requirements and clarifies common misconceptions regarding data flow within the AWS ecosystem.

 

Defining "Near Real-Time" and "CDC/Streaming" in the OpenSearch Context

 

"Near real-time" in the context of data replication implies a latency measured in seconds to minutes, ensuring that data is sufficiently fresh for immediate operational analytics or downstream consumption. This characteristic fundamentally distinguishes it from traditional batch processing, which typically involves hourly or daily cycles. The objective is to minimize the time lag between a data modification occurring in OpenSearch and its availability in the target S3 bucket or Glue Data Catalog.

"CDC/Streaming" refers to the process of capturing and propagating changes—including inserts, updates, and deletes—as they occur in the source system. For OpenSearch, this translates to identifying new documents, modifications to existing documents, or deletions, and then continuously streaming these changes to the designated target. This approach ensures that the target data lake remains consistently synchronized with the operational OpenSearch cluster.

It is crucial to distinguish this continuous, change-driven replication from OpenSearch snapshots. While OpenSearch snapshots are incremental, meaning they only store data that has changed since the last successful snapshot, optimizing storage efficiency 1, they are fundamentally point-in-time backups. A key limitation for near real-time CDC is that new documents and updates to existing documents that occur

during a snapshot operation are generally not included in that specific snapshot.2 Consequently, relying solely on snapshots for near real-time CDC would necessitate complex post-processing to identify and reconcile changes between successive snapshots, introducing significant latency and operational complexity.

 

Clarifying the Source (AWS OpenSearch Service) and Targets (Amazon S3, AWS Glue Data Catalog)

 

The source for this replication task is an AWS OpenSearch Service domain, which is a fully managed service that simplifies the deployment, operation, and scaling of OpenSearch clusters on AWS. This managed environment handles much of the underlying infrastructure, patching, and scaling.

The primary target is Amazon S3, a highly scalable, durable, and cost-effective object storage service that serves as the foundational layer for data lakes. Data replicated to S3 can be stored in various formats, enabling flexible consumption by a wide array of analytical services.

The secondary target, or rather an augmentation of the S3 target, is the AWS Glue Data Catalog. This service acts as a centralized metadata repository for data stored in S3, providing schema definitions and discoverability. Integrating with the Data Catalog allows other AWS analytics services, such as Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, to easily query and analyze the data without needing to understand the underlying file structures.

 

Distinguishing between Data Ingestion into OpenSearch and Data Export from OpenSearch

 

A common point of confusion in data pipeline design is the direction of data flow. The user's query explicitly asks for data replication from OpenSearch. This is a critical distinction from the more frequently discussed patterns of ingesting data into OpenSearch. Many AWS services are designed to facilitate data ingestion to OpenSearch Service, often acting as sinks for data streams that eventually land in OpenSearch. For example, Kinesis Data Firehose is commonly used to deliver streaming data to OpenSearch Service.4 Similarly, OpenSearch Ingestion pipelines can be configured with various sources to push data

into an OpenSearch domain or collection.7

A thorough review of the capabilities of AWS Database Migration Service (DMS) and Kinesis Data Firehose reveals a significant architectural constraint: neither service directly supports OpenSearch Service as a source for replicating data to S3 or AWS Glue Data Catalog. All available documentation consistently shows AWS DMS with OpenSearch Service as a target endpoint, facilitating data migration into OpenSearch from other databases like relational or NoSQL stores, or even S3.9 Similarly, Kinesis Data Firehose is designed to deliver data

to OpenSearch Service, and while it can optionally back up that ingested data to S3, it does not possess the capability to source data from an existing OpenSearch Service domain.4 This architectural limitation is a crucial clarification, preventing the pursuit of non-viable replication paths.

 

III. Architectural Approaches for Near Real-Time OpenSearch Data Export

 

Given the requirements for near real-time, CDC-like replication, the following architectural patterns are the most pertinent for exporting data from AWS OpenSearch Service to S3 and AWS Glue Data Catalog.

 

A. OpenSearch Ingestion (Data Prepper) for Continuous Export

 

Amazon OpenSearch Ingestion is a fully managed, serverless data collector powered by Data Prepper, an open-source data pipeline. It is specifically designed to ingest, filter, transform, enrich, and route real-time logs, metrics, and trace data.7 A key capability that makes it highly suitable for this use case is its support for Amazon S3 as a destination (sink).12

 

Configuring OpenSearch as a Source for Data Prepper

 

OpenSearch Ingestion pipelines can directly read indexes from OpenSearch clusters and OpenSearch Service domains through the opensearch source plugin.15 This direct integration is a fundamental enabler for CDC-like behavior from OpenSearch Service. The configuration of this source requires specifying the OpenSearch hosts (the domain endpoint), along with AWS-specific parameters such as the

aws.region and an aws.sts_role_arn for authentication.15 This IAM role must possess the necessary permissions, such as

es:ESHttpGet, to read data from the OpenSearch domain.

The opensearch source plugin supports various search context types for efficient pagination and change tracking, including point_in_time, scroll, or search_after.15 For OpenSearch Serverless collections,

search_after is the default pagination method.15 These mechanisms allow the pipeline to efficiently query for new or modified documents without rescanning the entire index. The "near real-time" aspect is influenced by

scheduling options like interval (e.g., "PT1H" for a one-hour interval) and index_read_count (which defaults to 1) that control the frequency of reprocessing data from the source.15 The native capability of OpenSearch Ingestion to directly source data from OpenSearch, leveraging its built-in change tracking mechanisms, makes it a superior choice for a managed CDC solution. This eliminates the need for complex custom logic or external tools to detect changes, providing a direct and operationally efficient way to achieve near real-time replication from AWS OpenSearch Service to S3.

 

Implementing S3 as a Sink for Data Prepper

 

The s3 sink plugin within OpenSearch Ingestion is responsible for saving and writing batches of processed events to Amazon S3 objects.14 The

codec configuration within the sink determines how the data is serialized into S3, with options such as JSON, CSV, or potentially Parquet if a compatible codec is used.14

For efficient data delivery and cost optimization, Data Prepper buffers incoming streaming data based on configurable thresholds. These include maximum_size (default 50MB), event_collect_timeout (maximum time to wait before writing), and event_count (number of events to accumulate).14 The choice of buffering type (e.g.,

in_memory, local_file, or multipart) also influences performance. Critically, the key_path_prefix setting supports dynamic naming using Data Prepper expressions (e.g., /${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/).14 This feature is invaluable for implementing data partitioning in S3, which is a best practice for later integration with AWS Glue Data Catalog and optimizing analytical queries.

 

Leveraging SQS Notifications for Event-Driven Processing

 

While not directly part of the export mechanism from OpenSearch, it is worth noting that OpenSearch Ingestion can use S3-SQS processing for near real-time scanning of files after they are written to S3.12 This pattern could be relevant if a subsequent pipeline or process needs to be triggered by the arrival of new objects in the S3 bucket where OpenSearch data is replicated.

 

Data Transformation and Schema Management

 

OpenSearch Ingestion supports a rich ecosystem of built-in processors that can filter, transform, enrich, and normalize data before it is published to the S3 sink.7 This capability allows for on-the-fly schema adjustments, data cleansing, or enrichment, ensuring that the data arriving in S3 is in the desired format for downstream analytics.

 

Key Configuration Parameters for OpenSearch Ingestion Pipeline

 

The following table outlines essential configuration parameters for an OpenSearch Ingestion pipeline designed for this replication task:

Component

Parameter

Description

Source (opensearch)

hosts

OpenSearch domain endpoint (e.g., https://search-my-domain.us-east-1.es.amazonaws.com)

 

aws.region

AWS Region of the OpenSearch domain

 

aws.sts_role_arn

IAM role ARN for OpenSearch Ingestion to assume for authentication to OpenSearch Service (e.g., arn:aws:iam::123456789012:role/my-domain-role)

 

indices.include / exclude

Regex patterns to include or exclude specific OpenSearch indexes for replication

 

scheduling.interval

Frequency of data reprocessing (e.g., "PT5M" for 5 minutes)

 

search_options.batch_size

Number of documents to read per batch from OpenSearch (default 1000)

 

search_options.search_context_type

Pagination method: point_in_time, scroll, or none (uses search_after)

Sink (s3)

bucket

Name of the target S3 bucket

 

codec

Data serialization format (e.g., json, parquet if supported by codec)

 

key_path_prefix

S3 key prefix for partitioning (e.g., data/%{yyyy}/%{MM}/%{dd}/)

 

threshold.event_count

Number of events to accumulate before writing to S3

 

threshold.maximum_size

Maximum size in bytes to accumulate before writing to S3 (default 50mb)

 

threshold.event_collect_timeout

Maximum time to wait before writing events to S3

Processors (Optional)

date

Example: from_time_received: true, destination: "@timestamp" for timestamping

 

mutate

Example: add_entries, rename_keys, delete_entries for data manipulation

 

grok

Example: For parsing unstructured log data into structured fields

 

B. Logstash-Based Data Export Pipeline

 

Logstash is an open-source, server-side data processing pipeline widely used for ETL (extract, transform, and load) operations.17 It offers a high degree of flexibility through its input, filter, and output plugins, allowing it to connect to diverse data sources and send data to various destinations, with optional transformations in between.18

 

Configuring Logstash with OpenSearch Input Plugin

 

The logstash-input-opensearch plugin enables Logstash to read data directly from an OpenSearch Service domain.17 The configuration typically involves specifying the OpenSearch domain endpoint (

hosts), along with username and password for basic authentication.17 For OpenSearch Service domains with fine-grained access control (FGAC) enabled, Logstash needs to be configured to send signed requests using appropriate IAM credentials. This involves attaching an IAM role to the Logstash host (e.g., an EC2 instance) with permissions like

es:ESHttp* actions on the OpenSearch domain resource, and then configuring the Logstash output plugin to use AWS authentication.20

For continuous replication, Logstash utilizes sinceDB files to track the progress of processed events from input sources.19 This mechanism ensures that only new events are processed on subsequent runs or after restarts, which is crucial for maintaining a continuous data flow and preventing duplicate processing.

 

Configuring Logstash with S3 Output Plugin

 

The logstash-output-s3 plugin allows Logstash to write processed data to an Amazon S3 bucket.21 Essential configuration parameters include the

bucket name and region.21 For managing file sizes in S3, the

size_file option can be used to rotate files when they reach a specified size.21

A critical security best practice for the S3 output plugin is to avoid hardcoding AWS access_key_id and secret_access_key directly in the configuration. Instead, Logstash can automatically resolve IAM credentials through the use of IAM instance profiles (when running on EC2) or environment variables.22 This approach significantly enhances security by leveraging AWS's native identity and access management mechanisms.

 

Deployment and Scaling Strategies for Logstash Instances

 

Logstash can be deployed in various environments, including Amazon EC2 instances, Docker containers, or Kubernetes clusters.23 Scaling a Logstash-based pipeline involves either adding more Logstash instances (horizontal scaling) or increasing the compute and memory resources of existing instances (vertical scaling).23 Data transfer can be parallelized by running multiple Logstash processes, each configured to handle different data slices or specific indexes.17

 

Considerations for Data Consistency and Latency

 

Logstash's sinceDB mechanism plays a vital role in maintaining state for continuous replication, providing a form of change data capture. To ensure data resiliency and robust error handling, Logstash can be configured with Persistent Queues (PQ) and Dead Letter Queues (DLQ).23 Persistent queues provide durability for events in transit, buffering them to disk to prevent data loss in case of Logstash process failures. Dead Letter Queues are essential for capturing events that fail to process or deliver to the sink, allowing for manual inspection, debugging, and potential re-processing, thereby safeguarding data integrity.

Logstash offers significant flexibility, allowing for highly customized data transformations and routing through its rich plugin ecosystem. However, this flexibility comes with increased operational management compared to serverless options. Unlike OpenSearch Ingestion, Logstash requires manual deployment and scaling of its underlying compute infrastructure, including patching and managing the operating system and Logstash software.23 This translates to higher operational overhead. The cost model for Logstash typically involves continuous compute costs for the provisioned EC2 instances, which may be less cost-efficient for fluctuating workloads compared to the consumption-based model of serverless services like OpenSearch Ingestion.7 Therefore, Logstash is a strong choice for organizations with existing Logstash expertise, complex and unique transformation needs, or specific control requirements over the processing environment, but for a purely managed and low-overhead solution, OpenSearch Ingestion is generally preferable.

 

C. Snapshot-Based Replication with Incremental Processing (Hybrid Approach)

 

OpenSearch snapshots are a fundamental feature for data backup and recovery, but their utility for near real-time CDC is limited.

 

Understanding OpenSearch Snapshots: Incremental Nature vs. True CDC

 

OpenSearch snapshots are incremental, meaning they only store data that has changed since the last successful snapshot.1 This design optimizes storage efficiency. However, snapshots inherently capture a point-in-time view of the cluster. Any new documents or updates to existing documents that occur

during the snapshot process are generally not included in that particular snapshot.2 This characteristic means that snapshots cannot provide true "near real-time" CDC, as there will always be a window during which changes are not captured by the snapshot. While shallow snapshot v2 significantly reduces snapshot creation time to mere seconds by referencing existing data checkpoints in S3, improving efficiency and reducing I/O and network operations, it does not alter the fundamental point-in-time nature of the backup.24

 

Automating Snapshots to S3

 

OpenSearch offers mechanisms to automate snapshots to S3:

  • OpenSearch Snapshot Management (SM): Policies can be configured within OpenSearch Dashboards to automate periodic snapshot creation and deletion to a user-registered S3 repository.25 Each policy can manage up to 400 snapshots for a given repository.25

  • AWS Lambda/CloudWatch Events: For more flexible scheduling, AWS Lambda functions can be triggered by CloudWatch Events (e.g., on a cron schedule) to programmatically initiate OpenSearch snapshot API calls using the AWS SDK (e.g., boto3 for Python).27

  • S3 Repository Registration: Before taking manual snapshots, an S3 bucket must be registered as a snapshot repository with the OpenSearch Service domain.1 This one-time operation requires specific IAM roles (e.g.,

    TheSnapshotRole) and permissions (iam:PassRole, es:ESHttpPut, and S3 actions like s3:PutObject, s3:GetObject, s3:ListBucket) for the user or role initiating the registration.1 If fine-grained access control is enabled, the

    manage_snapshots role in OpenSearch Dashboards must be mapped to the relevant IAM role.1

 

Post-Snapshot Processing for Incremental Updates

 

To achieve any semblance of CDC using snapshots, a complex post-processing mechanism would be required. This would involve comparing successive snapshots stored in S3, identifying the differences (inserts, updates, deletes), and then processing only those changes. This "diffing" logic is not natively provided by OpenSearch snapshot features for external consumption and would require a custom ETL solution. For instance, S3 event notifications could trigger AWS Glue jobs when new snapshot files land in S3, but the Glue job would then need to implement the bespoke logic to identify and extract the incremental changes.

 

Limitations for Achieving Strict Real-Time CDC

 

The fundamental point-in-time nature of snapshots means there will always be a lag and a potential for missed changes that occur during the snapshot window. This approach is best suited for disaster recovery, data migration 39, or long-term archiving, rather than meeting a strict "near real-time" CDC requirement.

It is important to clarify that searchable snapshots, while a powerful OpenSearch feature, are not a solution for external data replication. Searchable snapshots allow OpenSearch to mount index snapshots stored in S3 as live, searchable indexes within the OpenSearch cluster itself.40 Their primary purpose is to enhance cost efficiency and simplify management of older or infrequently accessed data by offloading it to remote storage while keeping it queryable by OpenSearch.40 The data remains in OpenSearch's proprietary snapshot format in S3 and is inherently read-only

for OpenSearch. It is not exposed in a format (like Parquet or CSV) that is directly consumable by AWS Glue Data Catalog or other analytics services outside of OpenSearch. Therefore, this feature should not be confused with a mechanism for replicating data to S3 or Glue Data Catalog for external consumption and analysis.

 

D. AWS Database Migration Service (DMS) (Not Viable as a Source)

 

AWS Database Migration Service (DMS) is a robust service designed for migrating databases quickly and securely. It supports both full data loads and continuous replication (CDC) of ongoing changes.9 DMS can migrate data

to Amazon OpenSearch Service from various sources, including relational databases (e.g., Oracle, Amazon Aurora), NoSQL databases (e.g., MongoDB), or even S3.9

However, despite its name and broad capabilities, AWS DMS does not support OpenSearch Service as a source endpoint for data replication.9 Its design is focused on migrating data

into OpenSearch, not extracting it. Therefore, AWS DMS is not a viable approach for replicating data from AWS OpenSearch Service to S3 or AWS Glue Data Catalog.

 

E. Kinesis Data Firehose (Not Viable as a Source)

 

Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to various destinations, including Amazon S3, Amazon Redshift, and Amazon OpenSearch Service.4 It can buffer incoming data and, when delivering to OpenSearch Service, can optionally back up the source data to S3 concurrently.4

Similar to AWS DMS, Firehose is fundamentally an ingestion service designed to deliver data into OpenSearch Service. It does not possess the capability to read or source data from an existing OpenSearch Service domain.4 Therefore, Kinesis Data Firehose is

not a viable approach for replicating data from AWS OpenSearch Service to S3 or AWS Glue Data Catalog.

 

IV. Data Storage and Cataloging Best Practices in S3 and AWS Glue

 

The choice of data format and the strategy for cataloging data in S3 are as critical as the replication mechanism itself for ensuring efficient and cost-effective downstream analytics.

 

A. Optimal Data Formats for Analytics in S3

 

When storing data in S3 for analytical workloads, the choice of file format significantly impacts storage costs, query performance, and overall data lake efficiency.

 

Comparative Analysis of Data Formats:

 

  • JSON/CSV: These are simple, human-readable formats.46 OpenSearch Service direct queries can support them.47 However, they are row-based, which makes them less efficient for analytical queries that typically select a subset of columns. Reading an entire row to access a few columns can be I/O intensive and costly.46 JSON can be suitable for small, trickling data streams.49

  • Parquet/ORC: These are columnar storage formats 46 and are highly recommended for data lakes due to several advantages:

    • Compression: They achieve significant compression ratios, reducing storage costs and improving I/O efficiency during data processing.46

    • Query Performance: By storing data column-wise, analytical queries can read only the necessary columns, leading to much faster query execution and reduced data scanning costs.46 OpenSearch Service direct queries also support Parquet.47

    • Splittability: Parquet and ORC files are designed to be splittable, enabling parallel processing in distributed computing frameworks like Apache Spark (used by AWS Glue).46

    • Schema Evolution: They offer better support for evolving schemas compared to flat file formats like CSV.50

 

Recommendation

 

For analytical workloads in S3 and seamless integration with AWS Glue Data Catalog, Parquet or ORC are the optimal choices. Parquet is often preferred due to its broader industry adoption.46 The choice of target data format is as critical as the replication mechanism for downstream analytics success. The performance and cost of querying data in S3 via Glue/Athena are directly tied to how the data is stored (format, partitioning). A fast replication pipeline is only half the solution if the target data is inefficiently stored. Therefore, the replication pipeline (e.g., OpenSearch Ingestion S3 sink or Logstash S3 output) should ideally output data directly into Parquet format. If direct Parquet output is not feasible, an intermediate AWS Glue ETL job should be planned to convert data from JSON/CSV to Parquet, ensuring the entire data pipeline is optimized from source to consumption.

 

Comparison of Data Formats for S3/Glue Analytics

 

Format

Storage Efficiency/Compression

Query Performance (Analytical)

Schema Evolution Support

Write Cost (Small Records)

Best Use Case

CSV

Low

Poor (full row scan)

Limited

Low

Simple logs, small datasets, human readability

JSON

Medium (with gzip)

Fair (full row scan, parsing overhead)

Good

Low

Event streams, semi-structured data, small batches

Parquet

High (columnar compression)

Excellent (columnar reads)

Excellent

High (requires batching)

Data Lake Analytics, large datasets, complex queries

ORC

High (columnar compression)

Excellent (columnar reads)

Excellent

High (requires batching)

Data Lake Analytics, large datasets, complex queries (similar to Parquet)

 

B. Integrating with AWS Glue Data Catalog

 

The AWS Glue Data Catalog plays a pivotal role in transforming raw data in S3 into a structured and discoverable asset for analytics.

 

Role of AWS Glue Data Catalog

 

The Data Catalog functions as a unified metadata repository for data stored in S3.53 It allows other AWS analytics services (e.g., Athena, EMR, Redshift Spectrum) to discover, understand, and query the data by providing schema definitions, table locations, and partitioning information.

 

Automated Schema Inference with Glue Crawlers

 

AWS Glue Crawlers are instrumental in automatically discovering data, inferring schemas, and populating the Data Catalog.55 Crawlers can classify data formats, group data into tables or partitions, and write the inferred metadata to the Data Catalog.57 They infer schemas by reading a sample of data (e.g., the first 1MB or 1000 records for CSV/JSON, or directly from Parquet file headers).57

 

Best Practices for Data Partitioning in S3 for Glue Data Catalog

 

Effective data lake integration hinges on proactive schema management and a well-defined partitioning strategy. For large datasets, strategically partitioning data in S3 based on anticipated query patterns (e.g., by date, customer ID, or event type) is critical. This practice significantly reduces the amount of data scanned by query engines like Athena, leading to faster query execution and lower costs.52 AWS Glue supports native partitioning when writing DynamicFrames to Parquet using the

partitionKeys option.61 To facilitate this, the replication pipeline (e.g., OpenSearch Ingestion S3 sink or Logstash S3 output) should be designed to write data in a partitioned manner, typically using S3 prefixes (folders) like

transactions/year=YYYY/month=MM/day=DD/.14 A well-defined partitioning strategy from the outset prevents costly re-processing and slow queries later in the analytical workflow.

 

Managing Schema Evolution in Glue Data Catalog

 

For data with evolving schemas, columnar formats like Parquet are more robust. Glue Crawlers can detect schema changes in subsequent runs and update the corresponding table definitions in the Data Catalog.58 This capability is vital for maintaining data usability as source schemas naturally evolve over time.

 

V. Cross-Cutting Concerns and Best Practices

 

Successful implementation of a near real-time data replication pipeline requires careful attention to security, performance, cost, monitoring, and data consistency.

 

A. IAM Permissions and Security

 

Adhering to the principle of least privilege is paramount: grant only the necessary permissions to IAM roles and users to minimize security risks and potential attack surfaces.63 Misconfigured IAM permissions are a leading cause of operational failures and security vulnerabilities, often manifesting as

security_exception, repository_exception, or 403 Forbidden errors.38 Therefore, a structured approach to IAM is crucial.

 

OpenSearch Service Permissions:

 

  • For OpenSearch Ingestion to source from OpenSearch: The OpenSearch Ingestion pipeline role, specified via aws.sts_role_arn, needs permissions such as es:ESHttpGet to read data from the OpenSearch domain.15

  • For Logstash to read from OpenSearch: A dedicated OpenSearch user account for Logstash is required, mapped to roles with specific cluster permissions (e.g., cluster_monitor, cluster_composite_ops) and index permissions (e.g., write, create, delete, create_index for the target index, if Logstash is also writing to OpenSearch).18 When FGAC is enabled, Logstash must sign its requests with IAM credentials, requiring

    es:ESHttp* actions on the OpenSearch domain ARN for the associated IAM role.20

  • For Snapshot Management: The IAM user or role initiating snapshot operations (e.g., via Lambda) needs iam:PassRole permission to pass TheSnapshotRole to OpenSearch Service, along with es:ESHttpPut to register the snapshot repository.1 Additionally, if fine-grained access control is enabled, the

    manage_snapshots role in OpenSearch Dashboards must be explicitly mapped to the relevant IAM role.1

 

S3 Bucket Permissions:

 

  • For OpenSearch Ingestion S3 sink: The pipeline's IAM role requires s3:PutObject to write objects, s3:ListBucket to list bucket contents, and s3:GetBucketLocation to determine the bucket's region.12

  • For Logstash S3 output: The IAM role attached to the EC2 instance running Logstash (or the configured credentials) needs s3:PutObject to write, s3:GetObject to read (if needed for state management), and s3:ListBucket.21

  • For Snapshot repositories: The TheSnapshotRole used by OpenSearch Service for snapshots requires s3:PutObject, s3:GetObject, s3:ListBucket, and s3:DeleteObject permissions on the designated snapshot bucket.35

 

AWS Glue Permissions:

 

The IAM role assumed by AWS Glue crawlers and jobs requires the AWSGlueServiceRole managed policy. Additionally, it needs S3 permissions (s3:ListBucket, s3:GetObject for sources, and s3:PutObject, s3:DeleteObject for targets) on the relevant S3 buckets. If the S3 data is encrypted with AWS Key Management Service (KMS), kms:Decrypt permissions are also necessary for Glue to access the data.64

 

Data Encryption:

 

  • At Rest: OpenSearch Service automatically encrypts data at rest using KMS.63 For S3 buckets, Server-Side Encryption (SSE-S3) or SSE-KMS should be enabled for data stored there.1 Manual snapshots are not encrypted by default but can be protected using S3-managed or KMS keys.1

  • In Transit: Node-to-node encryption and Transport Layer Security (TLS) should be enabled for all data transfer channels to ensure secure communication between components.63

 

Cross-Account Access:

 

If components of the replication pipeline reside in different AWS accounts (e.g., OpenSearch Service in one account, S3 target in another), careful configuration of cross-account IAM roles and S3 bucket policies is required to grant the necessary permissions.35

 

Essential IAM Permissions for Replication Components

 

Component

Required Actions

Resource ARNs

Notes

OpenSearch Service (Source)

es:ESHttpGet, es:ESHttpPost, es:ESHttpDelete

arn:aws:es:region:account-id:domain/domain-name/*

For OpenSearch Ingestion source or Logstash input with IAM authentication

OpenSearch Ingestion (Pipeline Role)

es:ESHttpGet, s3:ListBucket, s3:GetBucketLocation, s3:PutObject, sqs:ReceiveMessage, sqs:DeleteMessage

arn:aws:es:region:account-id:domain/domain-name/*, arn:aws:s3:::bucket-name/*, arn:aws:sqs:region:account-id:queue-name

For sourcing from OpenSearch and sinking to S3 (if SQS notifications are used for S3 source processing)

S3 (Target Bucket)

s3:PutObject, s3:GetObject, s3:ListBucket, s3:DeleteObject

arn:aws:s3:::bucket-name, arn:aws:s3:::bucket-name/*

For OpenSearch Ingestion S3 sink, Logstash S3 output, and Snapshot repository

AWS Glue (Crawler/Job Role)

AWSGlueServiceRole (managed policy), s3:ListBucket, s3:GetObject, s3:PutObject, s3:DeleteObject, kms:Decrypt (if applicable)

arn:aws:s3:::source-bucket/*, arn:aws:s3:::target-bucket/*, arn:aws:kms:region:account-id:key/key-id

For crawling S3 data and transforming/writing data with Glue jobs

Logstash (EC2 Instance Role)

es:ESHttp*, s3:PutObject, s3:GetObject, s3:ListBucket

arn:aws:es:region:account-id:domain/domain-name/*, arn:aws:s3:::bucket-name/*

For Logstash running on EC2 instance, assuming role for OpenSearch input and S3 output

Snapshot Role (TheSnapshotRole)

sts:AssumeRole (trust policy), s3:PutObject, s3:GetObject, s3:ListBucket, s3:DeleteObject

arn:aws:s3:::snapshot-bucket, arn:aws:s3:::snapshot-bucket/*

For OpenSearch Service to manage snapshots in S3

User/Role initiating Snapshot

iam:PassRole, es:ESHttpPut

arn:aws:iam::account-id:role/TheSnapshotRole, arn:aws:es:region:account-id:domain/domain-name/_snapshot/*

For the user or role that calls the snapshot API

 

B. Performance, Scalability, and Cost Optimization

 

Designing a near real-time replication pipeline necessitates careful consideration of performance, scalability, and cost efficiency across all components.

 

Tuning OpenSearch for Export Workloads:

 

For snapshot-based approaches, scheduling snapshots during periods of low traffic is a recommended practice to minimize the load on the OpenSearch cluster.27 The introduction of OpenSearch optimized instances and shallow snapshot v2 significantly enhances snapshot creation time (reducing it to seconds) and minimizes I/O and network operations, thereby improving overall scalability of the snapshot process.24 For continuous polling mechanisms, such as the OpenSearch Ingestion source, tuning parameters like

scheduling.interval and search_options.batch_size directly impacts the load on the source OpenSearch cluster and the end-to-end latency of the replication.15

 

Optimizing S3 and Glue for Efficient Data Processing:

 

  • Columnar Formats: As discussed, storing data in columnar formats like Parquet or ORC in S3 significantly reduces storage costs and dramatically improves query performance for analytical workloads.46

  • Partitioning: Implementing a strategic partitioning scheme for data in S3, aligned with common query access patterns, is crucial. This practice enables query engines like Athena to prune partitions, scanning only a subset of the data, which directly translates to lower query costs and faster performance.52

  • Compression: Applying appropriate compression (e.g., Snappy for Parquet) further reduces data size, minimizing storage costs and accelerating read/write operations.46

 

Cost Implications of Different Replication Strategies:

 

  • OpenSearch Ingestion: This service operates on a serverless, pay-per-use model, where costs are based on OpenSearch Compute Units (OCUs) consumed for ingestion. Its auto-scaling capability helps prevent over-provisioning, aligning costs more closely with actual workload demands.7

  • Logstash: A Logstash-based pipeline requires provisioning and managing EC2 instances, incurring continuous compute costs regardless of data volume. Scaling Logstash typically involves manual configuration or setting up auto-scaling groups, which adds operational complexity and can lead to less efficient cost utilization for fluctuating loads.23

  • Snapshots: Storing manual snapshots in S3 incurs standard S3 storage charges.1 Automated snapshots provided by OpenSearch Service are typically free for recovery purposes.1 Any subsequent post-processing (e.g., using AWS Glue jobs to extract changes) will incur additional compute and storage costs.

The "near real-time" requirement directly influences cost and architectural complexity, generally favoring serverless solutions for their inherent scalability and cost-efficiency. For truly dynamic, near real-time streaming, the serverless nature of OpenSearch Ingestion provides a significant advantage in terms of automatic scalability and cost optimization, as it aligns costs directly with consumption rather than requiring the provisioning of fixed capacity. This makes it a strong recommendation for dynamic workloads.

 

C. Monitoring, Alerting, and Observability

 

Proactive monitoring is paramount for maintaining the "near real-time" guarantee of any replication pipeline. Any disruption, whether due to network issues, resource bottlenecks, or service-specific problems, can compromise the desired latency.

 

Leveraging Amazon CloudWatch for Metrics and Logs:

 

  • OpenSearch Service automatically emits performance metrics (e.g., CPU utilization, JVM pressure) to Amazon CloudWatch, which should be regularly reviewed.66

  • Enabling log publishing (error logs, search slow logs, indexing slow logs, and audit logs) from OpenSearch Service to CloudWatch Logs provides invaluable data for troubleshooting performance and stability issues.66

  • OpenSearch Ingestion pipelines also support performance monitoring in CloudWatch and error logging in CloudWatch Logs, offering visibility into pipeline health and processing errors.8

 

Setting up Alerts for Replication Lag and Errors:

 

Configuring CloudWatch alarms for critical metrics (e.g., high CPU utilization, high JVM pressure on OpenSearch, pipeline processing errors, or replication lag) is crucial for immediate detection and response.27 Amazon EventBridge can be configured to receive events from OpenSearch Serverless (e.g., OCU usage thresholds) and trigger automated actions or notifications via Amazon SNS topics or Amazon SQS queues.33 Automated alerts are essential for ensuring operational stability and adherence to latency requirements.

 

D. Data Consistency and Error Handling

 

Ensuring data consistency in a near real-time, distributed replication pipeline requires careful design for idempotency and robust error handling. Due to the continuous flow of data and the inherent distributed nature of the systems involved, transient failures, network issues, or processing delays can lead to duplicates, out-of-order delivery, or even data loss if not properly addressed.

 

Strategies for Ensuring Data Integrity:

 

  • OpenSearch Ingestion: While the provided information does not explicitly detail transactionality from the OpenSearch source to the S3 sink, the managed nature of OpenSearch Ingestion and its internal batching mechanisms imply built-in reliability. Furthermore, processors within the pipeline can be used for data validation and transformation to ensure data quality before it reaches the sink.

  • Logstash: Logstash utilizes sinceDB files to track processed events, which helps prevent re-processing of the same data on restarts and contributes to data consistency.19 For enhanced durability, Logstash can be configured with Persistent Queues (PQ) to buffer events to disk, safeguarding against data loss during process failures. Dead Letter Queues (DLQ) are critical for isolating and managing events that fail to process or deliver to the sink, allowing for manual inspection and re-processing.23

  • Idempotency: Designing the target data load to be idempotent is a crucial strategy. This means that re-processing the same record multiple times should not lead to unintended side effects or duplicate data in the target. For S3-based data lakes, this often involves implementing upsert logic (update if exists, insert if new) based on a unique key, or overwriting files for specific partitions.

 

Mechanisms for Handling Failures and Retries:

 

Managed services like OpenSearch Ingestion typically handle internal retries and error propagation automatically. For self-managed solutions like Logstash, DLQs are indispensable for capturing and managing failed events, providing a mechanism for manual intervention and recovery.23 Comprehensive logging to CloudWatch Logs and the configuration of alarms provide critical visibility into errors, enabling prompt troubleshooting and resolution.8

 

VI. Conclusion and Recommendations

 

Based on the detailed investigation into approaches for near real-time data replication from AWS OpenSearch Service to Amazon S3 and AWS Glue Data Catalog, the following conclusions and recommendations are presented:

 

Summary of the Most Suitable Approaches:

 

  • Primary Recommendation: OpenSearch Ingestion (Data Prepper): This is the most robust, AWS-native, and operationally efficient solution for achieving near real-time, CDC-like replication from OpenSearch Service to S3/Glue Data Catalog. Its serverless nature, automatic scaling capabilities, and the native opensearch source plugin for Data Prepper 7 significantly minimize operational overhead and align well with dynamic workloads. It is the ideal choice for organizations prioritizing managed services, rapid deployment, and a consumption-based cost model.

  • Secondary Recommendation: Logstash-Based Pipeline: This approach serves as a viable alternative for organizations that require greater control over the data processing environment, have complex and highly custom transformation requirements, or possess existing expertise in managing Logstash deployments. While offering flexibility, it introduces more operational management for deployment, scaling, and ongoing maintenance. Its sinceDB and Dead Letter Queue features provide good control over data consistency and error handling.19

  • Not Recommended: Snapshot-Based Replication: While OpenSearch snapshots are essential for disaster recovery, data migration, and archiving, they are inherently point-in-time backups. They do not provide true near real-time CDC, as changes occurring during the snapshot window are typically not included.2 Implementing post-processing to extract incremental changes from snapshots is complex and introduces significant latency, making it unsuitable for the "near real-time" requirement.

  • Not Viable: AWS Database Migration Service (DMS) and Kinesis Data Firehose: Neither AWS DMS nor Kinesis Data Firehose support OpenSearch Service as a source for data replication. Both services are designed for data ingestion into OpenSearch, not for extracting data from it.9

 

Comparison of Replication Approaches

 

Approach

Latency (CDC)

Management Overhead

Cost Model

Data Consistency (CDC)

Complexity

Ideal Use Case

OpenSearch Ingestion

Seconds to minutes

Low

Serverless/Consumption-based

Near real-time with strong guarantees

Low to Medium

Managed streaming, dynamic workloads, rapid deployment

Logstash-Based Pipeline

Minutes to hours

Medium to High

Provisioned/EC2-based

Near real-time with configurable guarantees (PQ/DLQ)

Medium to High

Custom ETL, complex transformations, existing Logstash expertise

Snapshot-Based

Hours to days

Low to Medium

Storage-based (S3) + Post-processing compute

Point-in-time

Low (for backup) to High (for CDC simulation)

Disaster Recovery, Archiving, Data Migration

 

Future Considerations and Evolving AWS Capabilities:

 

It is advisable to continuously monitor updates from AWS OpenSearch Service and OpenSearch Ingestion for potential enhancements in CDC capabilities or new direct integrations. Further advancements in OpenSearch features, such as the search_after parameter and Point in Time API, may offer even more efficient mechanisms for data extraction. The new OR1 instances for OpenSearch Service, which leverage S3 as primary storage for improved durability and snapshot efficiency 67, represent an evolution in OpenSearch's internal storage architecture, though their primary benefit is for OpenSearch's internal operations rather than direct external data export.

References

opensearch.org

s3 source - OpenSearch Documentation

Opens in a new window

docs.opensearch.org

Cross-cluster replication security - OpenSearch Documentation

Opens in a new window

cloudtechsavvy.com

CSV, JSON conversion into Parquet - Cloud Tech Savvy

Opens in a new window

reddit.com

CSV format VS parquet for storing user usage data in S3 : r/dataengineering - Reddit

Opens in a new window

reddit.com

Which format to store data in S3 for best future proofing? - Reddit

Opens in a new window

infoq.com

Amazon OpenSearch Zero ETL with S3 and New OR1 Instances - InfoQ

Opens in a new window

serverlessland.com

Amazon Kinesis Data Streams to Amazon OpenSearch via Amazon Kinesis Data Firehose | Serverless Land

Opens in a new window

aws.amazon.com

Open-Source Data Ingestion – Amazon OpenSearch Service

Opens in a new window

docs.aws.amazon.com

Overview of Amazon OpenSearch Ingestion

Opens in a new window

docs.aws.amazon.com

Using an Amazon OpenSearch Service cluster as a target for AWS Database Migration Service - AWS Documentation

Opens in a new window

docs.aws.amazon.com

Sources for data migration - AWS Database Migration Service - AWS Documentation

Opens in a new window

docs.aws.amazon.com

Creating an Amazon S3 Tables catalog in the AWS Glue Data Catalog

Opens in a new window

docs.aws.amazon.com

Accessing Amazon S3 tables using the AWS Glue Iceberg REST endpoint

Opens in a new window

repost.aws

How does the AWS Glue crawler detect the schema?

Opens in a new window

aws-dojo.com

AWS Glue Studio Enhancements - Spark SQL, Catalog Target & Infer S3 Schema

Opens in a new window

docs.aws.amazon.com

Using an OpenSearch Ingestion pipeline with Amazon S3 - AWS Documentation

Opens in a new window

opensearch.org

s3 - OpenSearch Documentation

Opens in a new window

docs.aws.amazon.com

Working with Amazon OpenSearch Service direct queries - Amazon ...

Opens in a new window

opensearch.org

opensearch - OpenSearch Documentation

Opens in a new window

elastic.co

Deploying and Scaling Logstash - Elastic

Opens in a new window

support.nagios.com

Logstash Amazon S3 Output Plugin - Nagios Support Forum

Opens in a new window

discuss.elastic.co

S3 output using iam roles that gives access to aws security credentials - Elastic Discuss

Opens in a new window

dev.to

Working with Amazon OpenSearch Service Direct Queries with Amazon S3: The First-Ever Detailed Guide - DEV Community

Opens in a new window

docs.aws.amazon.com

Operational best practices for Amazon OpenSearch Service

Opens in a new window

docs.aws.amazon.com

Step 2: Create an IAM role for AWS Glue - AWS Glue

Opens in a new window

dev.to

AWS Data Lake Best Practices for Machine Learning Feature Engineering - DEV Community

Opens in a new window

docs.aws.amazon.com

AWS Glue Documentation

Opens in a new window

docs.aws.amazon.com

What is AWS Glue? - AWS Glue - AWS Documentation

Opens in a new window

docs.opensearch.org

Logstash - OpenSearch Documentation

Opens in a new window

instaclustr.com

Connect Logstash to OpenSearch - NetApp Instaclustr

Opens in a new window

docs.aws.amazon.com

Step 2: Choose data sources and classifiers - AWS Glue

Opens in a new window

docs.aws.amazon.com

Supported data sources for crawling - AWS Glue

Opens in a new window

docs.aws.amazon.com

Best practices - AWS Prescriptive Guidance

Opens in a new window

stackoverflow.com

AWS Glue write parquet with partitions - Stack Overflow

Opens in a new window

repost.aws

Connect to Amazon OpenSearch Service using Filebeat and Logstash | AWS re:Post

https://docs.opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/opensearch/

https://github.com/awsdocs/amazon-opensearch-service-developer-guide/blob/master/doc_source/pipeline-domain-access.md#pipeline-access-domain

https://github.com/awsdocs/amazon-opensearch-service-developer-guide/blob/master/doc_source/ingestion.md

https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-self-managed-opensearch.html