How to determine what database technology to use?

To Nha Notes | Aug. 13, 2023, 5:24 p.m.

The answers to these questions will determine what database technology to use.

  • How often will the data be retrieved?
  • How fast should it be accessed?
  • Will the data be updated often, or will it be primarily new?
  • How often will the data be ingested?
  • How fast does ingestion need to be?
  • Will the ingested data be sent in batches or in real time?
  • How many users will be consuming the data?
  • How many simultaneous processes will there be for ingestion?

Two technologies have been the standards to address these questions for many years: online transaction processing (OLTP) systems and online analytics processing (OLAP) systems. The main question that needs to be answered is - is it more important for the database to perform during data ingestion or retrieval? These databases can be divided into two categories depending on the use case; they need to be read-heavy or write-heavy.

Online Transaction Processing (OLTP) systems

OLTP databases’ main characteristics are the fact that they process a large number of transactions (such as inserts and updates). The focus in OLTP systems is placed on fast ingestion and modification of data while maintaining data integrity, typically in a multi-user environment with less emphasis on the retrieval of the data. OLTP performance is generally measured by the number of transactions executed in a given time (usually seconds). Data is typically stored using a schema that has been normalized, usually using the 3rd normal form (3NF). Before moving on, let’s quickly discuss 3NF. 3NF is a state that a relational database schema design can possess.

A table using 3NF will reduce data duplication, minimize data anomalies, guarantee referential integrity, and increase data management. 3NF was first specified in 1971 by Edgar F. Codd, the inventor of the relational model for database management.

A database relation (for example, a database table) meets the 3NF standard if each table’s columns only depend on the table’s primary key. Let’s look at an example of a table that fails to meet 3NF. Let’s say you have a table that contains a list of employees. This table, in addition to other columns, contains the employee’s supervisor’s name as well as the supervisor’s phone number. A supervisor can undoubtedly have more than one employee under supervision, so the supervisor’s name and phone number will be repeated for employees working under the same supervisor. To resolve this issue, we could add a supervisor table, put the supervisor’s name and phone number in the supervisor table, and remove the phone number from the employee table.

The following table shows a comparison between OLTP and OLAP:

Let’s go further and learn about the various kinds of database services available in AWS and how they fit to address different workload needs.

Choosing the right tool for the job

Suppose you have large datasets with many connections between them. For instance, a sports company should link its athletes with its followers and provide personalized recommendations based on the interests of millions of users. Managing all these connections and providing fast queries can be challenging with traditional relational databases. In this case, you can use Amazon Neptune, a graph database designed to efficiently handle complex queries with interconnected data.

A wide-column database is an excellent choice for applications that require fast data processing with low latency, such as industrial equipment maintenance, trade monitoring, fleet management, and route optimization. Amazon Keyspaces for Apache Cassandra provides a wide-column database option that allows you to develop applications that can handle thousands of requests per second with practically unlimited throughput and storage.

 

You should spend a significant amount of time clearly articulating the business problem you are trying to solve. Some of the questions the requirements should answer are as follows:

  • How many users are expected?
  • How many transactions per day will occur?
  • How many records need to be stored?
  • Will there be more writes or reads?
  • How will the data need to be accessed (only by primary key, by filtering, or some other way)?

In instances where there is a lot of data and it needs to be accessed quickly, NoSQL databases might be a better solution. 

Depending on your project, you might want to draw up a Proof of Concept using a couple of options to determine which option performs better and fits your needs better.

Allowing users to enter records without the benefit of a schema may lead to inconsistent data, which becomes too variable and creates more problems than it solves. Just because we can create databases without a schema in a NoSQL environment, we should not forgo validation checks before creating a record. If possible, a validation scheme should be implemented, even when using a NoSQL option.

It is true that going schema-less increases implementation agility during the data ingestion phase. However, it increases complexity during the data access phase. So, make your choice by making a required trade-off between data context vs. data performance.

 

References

AWS for Solutions Architects - Second Edition

Saurabh ShrivastavaNeelanjali SrivastavAlberto ArtasanchezImtiaz SayedDr. Siddhartha Choubey Ph.D