I always start my classes with this: It’s called optimization, not minimization or maximization. Why? Because when you make something faster, you’re usually making something else slower. It’s always a tradeoff, so before you tweak anything, you need to be very conscious of your business priorities, so you can decide whether the thing your speeding up is worth the sacrifice of the thing you’re slowing down.
Broadly, there are two types of data operations – writing (transactions) and reading (analytics). Usually, when you’re making a write operation faster, you’re making a read operation slower, and vice-versa.
Writing: Inserts, Updates, Data Integrity & Normalization
The fastest way to do write operations is to simply store the data in the same form as you receive it. However, things get tricky when you want to update that data. If you want the inventory count of an item to be deducted once you’ve sold it, or if you want a change to a customer’s address to reflect correctly everywhere, you need to organize your data in a structure so that there’s a single source of truth for certain things. What you want is called data integrity, and the structure is called a schema, and organizing things for a single source of truth is called normalization.
Adding a structure imposes a cost to inserts, since you need to split up data and place each piece in its appropriate table and field, but doing so allows you to maintain data integrity.
Reading: Indexing & Denormalization
If I asked you to look for a customer name in an unordered list of customer records in a physical logbook that was 100 pages long, you’d need to go through each page one-by-one until you found the name. However, if the names were in alphabetical order, then you’d be a lot quicker.
Now imagine you’re the one maintaining customer records on the physical logbook, in pencil. If there was no order, then you would just need to add every new customer to the end of the logbook. However, if you needed to keep the records alphabetized by customer name, then for each new customer, you would need to find its place in the order, erase all succeeding records, and then rewrite them. It would be just as much of a hassle if a customer changed its name. Maintaining this order now makes writes more costly.

What if you also tasked with providing a quick way to find customer records by date of registration, by address, or other fields. You’d probably need to maintain multiple logbooks, and every insert or update would also be multiple times the effort.
In databases, structures that provide some sort of order to specific fields of the data are called indexes. Indexes will speed up finding the records you want if the indexes are on the fields you are searching by, but each additional index will slow down write operations, since the order of data in the indexes need to be maintained. Also, if your WHERE clauses use apply functions or equations to a field, the index you created for that field won’t work; you’d need to create a specific index that uses that function or equation.
Another cost when reading is the cost of JOINs. For example, if you’re going to generate a sales report, you’d need to match data from Orders, Order Items, Products, Customers… Our transactional data was split up into multiple normalized tables, for data integrity, but now it takes time for us to find and match those records from multiple tables to generate our report. Another problem is reading all those records will lock them up, which means any transactions needing to modify those records will have to wait – a common scenario when users find their system slowing to a crawl when a report is being generated. Aggregation, ordering and grouping of data also incurs cost.
If you’re running into this problem, you might consider creating additional analytical tables with data that has been pre-joined, as well as pre-aggregated, pre-ordered, and pre-grouped if needed. Creating these additional tables for analytical operations is called denormalization. You’re not getting rid of your normalized tables, you’re just creating new tables that hold data in the form ready to read for specific reports or views. Sometimes these analytical tables are placed in a separate database server just dedicated to reports or views. If the data for these analytical database servers come from multiple sources, then you have what’s called a data warehouse.
Strong Consistency vs Eventual Consistency
The problem we encounter when we add analytical tables is how to keep those tables consistent with changes to the transactional tables. If the changes are transmitted in real-time (e.g. using triggers), this incurs a cost to writing, since each change is slower because the relevant analytical tables need to be updated.
We can avoid the slower writes by making the analytical table updates asynchronous. The updates can be done in a separate thread or process, or scheduled when the server is not busy. This is called eventual consistency. Of course, users will experience a delay in seeing the freshest data, so again, as in everything in software engineering, you need to look at your business priorities do decide on the tradeoffs in performance among your various operations. If users don’t need to immediately see the freshest data but you need to maintain high transaction throughput, then you should opt for eventual consistency.
What Next?
In my data optimization classes, I cover the following additional topics:
- Understanding query plans
- Tweaking memory parameters
- Avoiding deadlocks
- Index types & when to use them
- Storage formats & when to use them
- Replication & backup
- Recovering storage
- Keeping the query planner’s statistics up-to-date
- Partitioning & archiving
I use PostgreSQL for the hands-on exercises, but the lessons can be applied to any modern RDBMS.
If interested in training or consulting, contact [email protected].










