#GlueCon 2013 Notes: Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL

1 minute read

Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL

“SQL scales, but the technology behind it may not”
When you write SQL against a distributed system, there is more to take into account than indexes and other SQL optimizations: how your data is distributed across partitions
They compute a hash on the PK and assign it to a partition
Sidenote: Added 90M rows to a SQL database in the first 2-3 minutes of their talk
If the PK isn’t specified in a query, it will scatter-gather query and aggregate the result – cost to queries
Indexes still need to be considered, as they will narrow the data set on each partition during scatter-gather queries
MemSQL uses the equiv to MapReduce to scatter-gather query partitions
Each core is assigned a slice of data for a partition
Sidenote: they are using clusterSSH and htop to visualize engagement of each partition+core and impact of queries
Caution: joining rows from one side of the cluster to another is expensive, so understand your queries and cardinalities!
They use reference tables to store and replicate smaller tables, allowing each partition to join locally – nice!
For high cardinality joins, try to reduce cardinality in subqueries first
PK-to-FK relationships can be defined and then joined locally
Many analytics tools require you to ETL data from primary datastores, MemSQL removes this need and allows for analyzing transactional data
They do not support fully-distributed transactions (yet), but will wrap inserts/updates in transactions
Both Synchronous (i.e. consistency) or Asynchronous replication are available
Default is redundancy level 2 (can be changed), uses replication internally for redundancy
2 Tiers – aggregators (metadata) and leaves (data storage node)
There is one primary aggregator, but all data is replicated so another node can become the master
Mimic MySQL protocol, so JDBC/ODBC drivers just work
They are an overlap between OLTP and OLAP systems – OLTP with OLAP speed

Share on

Twitter Facebook LinkedIn

James Higginbotham

#GlueCon 2013 Notes: Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL

Share on

You may also enjoy

The Pillars of an API Platform

Shifting to an API Platform Mindset

Lessons After a Decade of API Strategy

The Importance of API Governance