#GlueCon 2013 Notes: Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL
Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL
- “SQL scales, but the technology behind it may not”
- When you write SQL against a distributed system, there is more to take into account than indexes and other SQL optimizations: how your data is distributed across partitions
- They compute a hash on the PK and assign it to a partition
- Sidenote: Added 90M rows to a SQL database in the first 2-3 minutes of their talk
- If the PK isn’t specified in a query, it will scatter-gather query and aggregate the result – cost to queries
- Indexes still need to be considered, as they will narrow the data set on each partition during scatter-gather queries
- MemSQL uses the equiv to MapReduce to scatter-gather query partitions
- Each core is assigned a slice of data for a partition
- Sidenote: they are using clusterSSH and htop to visualize engagement of each partition+core and impact of queries
- Caution: joining rows from one side of the cluster to another is expensive, so understand your queries and cardinalities!
- They use reference tables to store and replicate smaller tables, allowing each partition to join locally – nice!
- For high cardinality joins, try to reduce cardinality in subqueries first
- PK-to-FK relationships can be defined and then joined locally
- Many analytics tools require you to ETL data from primary datastores, MemSQL removes this need and allows for analyzing transactional data
- They do not support fully-distributed transactions (yet), but will wrap inserts/updates in transactions
- Both Synchronous (i.e. consistency) or Asynchronous replication are available
- Default is redundancy level 2 (can be changed), uses replication internally for redundancy
- 2 Tiers – aggregators (metadata) and leaves (data storage node)
- There is one primary aggregator, but all data is replicated so another node can become the master
- Mimic MySQL protocol, so JDBC/ODBC drivers just work
- They are an overlap between OLTP and OLAP systems – OLTP with OLAP speed