#GlueCon 2013 Notes: Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL

1 minute read

Processing Big Data In Real-time with a Distributed SQL System – Ankur Goyal, MemSQL

  • “SQL scales, but the technology behind it may not”
  • When you write SQL against a distributed system, there is more to take into account than indexes and other SQL optimizations: how your data is distributed across partitions
  • They compute a hash on the PK and assign it to a partition
  • Sidenote: Added 90M rows to a SQL database in the first 2-3 minutes of their talk
  • If the PK isn’t specified in a query, it will scatter-gather query and aggregate the result – cost to queries
  • Indexes still need to be considered, as they will narrow the data set on each partition during scatter-gather queries
  • MemSQL uses the equiv to MapReduce to scatter-gather query partitions
  • Each core is assigned a slice of data for a partition
  • Sidenote: they are using clusterSSH and htop to visualize engagement of each partition+core and impact of queries
  • Caution: joining rows from one side of the cluster to another is expensive, so understand your queries and cardinalities!
  • They use reference tables to store and replicate smaller tables, allowing each partition to join locally – nice!
  • For high cardinality joins, try to reduce cardinality in subqueries first
  • PK-to-FK relationships can be defined and then joined locally
  • Many analytics tools require you to ETL data from primary datastores, MemSQL removes this need and allows for analyzing transactional data
  • They do not support fully-distributed transactions (yet), but will wrap inserts/updates in transactions
  • Both Synchronous (i.e. consistency) or Asynchronous replication are available
  • Default is redundancy level 2 (can be changed), uses replication internally for redundancy
  • 2 Tiers – aggregators (metadata) and leaves (data storage node)
  • There is one primary aggregator, but all data is replicated so another node can become the master
  • Mimic MySQL protocol, so JDBC/ODBC drivers just work
  • They are an overlap between OLTP and OLAP systems – OLTP with OLAP speed