eBay Architect Jay Patel recently posted an article about data modeling using the Cassandra data store. In his article, he breaks down how they modeled their data using Cassandra, how they approached the use of Columns and Column Families, and query optimizations. The post is very detailed and a great read.
What I enjoyed most from the article was more of the high-level approach that Jay and his team took. Here are my favorite takeaways from their approach to data modeling and query optimization, that I believe can be applied to any NoSQL database, including Cassandra, MongoDB, Redis, and others.
Model the Domain and Relationships
“It’s important to understand and start with entities and relationships…”
Jay reminds us that we must first understand the problem domain, model the entities involved, and the relationships between the data. This may take the form of a domain model or an entity relationship diagram (ERD). Many people cringe at this step, preferring to focus on code rather than what is often considered “wasteful documentation”, but I prefer to start this way as well. This approach ensures that there is a well-understood grasp of the problem domain, including key concepts and relationships. From there, a data model can be built that will satisfy the problem domain.
Identify Query Patterns and Denormalization
“…then continue modeling around query patterns by de-normalizing and duplicating.”
You cannot optimize your data model until you understand how you will be accessing it. For relational databases, this is when you start to realize that your queries are taking longer because you have missing indexes or have joins that are too complex and slowing down performance. At this point, you must find ways to optimize the database, including denormalizing the data to prevent unnecessary joins or n+1 queries. Just keep in mind that, even with NoSQL data stores, denormalization comes with a price that is commonly in the form of complicated updates if denormalized data can be modified in the future.
Recognize That Your Data Model Will Be Case-Specific
“Remember that there are many ways to model. The best way depends on your use case and query patterns.”
Always evaluate your data model based on the intended use cases. While many examples of Cassandra include time series data and capturing logs, the way you intend to query and manipulate the data will require you to make changes to the common examples documented by others. Find the right fit for your specific needs, even if your final data model looks a little different than the reference models commonly cited.