shard-query, an mpp database for the cloud using the lamp stack
DESCRIPTION
This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here: https://www.youtube.com/watch?v=vC3mL_5DfEMTRANSCRIPT
Shard-QueryAN MPP DATABASE FOR THE CLOUD USING THE
LAMP STACK
Introduction
Presenter
⢠Justin Swanhart
⢠Principal Support Engineer at Percona
⢠Previously a trainer and consultant at Percona too
Developer
⢠Swanhart-tools
⢠Shard-Query â MPP sharding middleware for MySQL
⢠Flexviews â Materialized views (fast refresh) for MySQL
⢠bcmath UDF â arbitrary precision math for MySQL
Intended Audience
⢠MySQL users with data too large to query efficiently using a single machine
⢠Big Data
⢠Analytics / OLAP
⢠User generated content analysis
⢠People interested in distributed database processing
Terms
MPP â Massively Parallel Processing
⢠An MPP system is a system that can process a SQL statement in parallel on a single machine or even many machines
⢠A collection of machines is often called a Grid
⢠MPP is also sometimes called Grid Computing
MPP (cont)
⢠Not many open source databases (none?) support MPP
⢠Community editions of closed source offerings are limited
⢠Some closed source databases include Vertica, Greenplum, Redshift
The Cloud
⢠Managed collection of virtual servers
⢠Easy to add servers on demand
⢠Ideal for a federated, distributed database grid
⢠Easy to âscale upâ by moving to a VM with more cores
⢠Easy to âscale outâ by adding machines
⢠Amazon is one of the most popular cloud environments
LAMP stack
⢠Linux
⢠Amazon Linux
⢠RHEL
⢠Ubuntu LTS, etc.
⢠Apache Web Server
⢠Most popular web server on the planet
⢠MySQL
⢠The worldâs most popular open source database
⢠PHP
⢠High level language makes development easier
Database Middleware
⢠A piece of software that sits between an end-user application and the database
⢠Operates on the queries submitted by the application, then returns the results to the application
⢠Usually a proxy of some sort
⢠MySQL proxy is the open source user configurable proxy for MySQL
⢠Supports Lua scripts which intercept queries
⢠Shard-Query can use MySQL Proxy out of the box
Message Queue / Job Server
⢠Accepts jobs or messages and places them in a queue
⢠A worker reads jobs/messages from the queue and acts on them
⢠Offers support for asynchronous jobs
⢠Gearman
⢠My job server of choice for PHP
⢠Has two different PHP interfaces (pear and pecl)
⢠SQ comes bundled with a modified version* of the pear interface
⢠Excellent integration with MySQL as well (UDF)
* Removes warnings triggered by modern PHP strict mode
Sharding
⢠It is a short for Shared Nothing
⢠Means splitting up your data onto more than one machine
⢠Tables that are split up are called sharded tables
⢠Lookup tables are not sharded. In other words, they must be duplicated on all nodes
⢠Shard-Query supports directory based or hash based sharding
Shard mapper
⢠Shard-Query supports DIRECTORY and HASH mapping out of the box
⢠DIRECTORY based sharding allows you to add or remove shards from the system, but lookups may go over the network, reducing performance* compared to HASH mapping
⢠HASH based sharding uses a hash algorithm to balance rows over the sharded database. However, since a HASH algorithm is used, the number of database shards can not change after initial data loading.
* But only for queries like âselect count(*) from table where customer_id = 50â
What is âbig dataâ
Most machine generated data
⢠Line order information for a large organization like Wal-Martâ˘
⢠Any data so large that you canât effectively operate on it on one machine
⢠For example, an important query that needs to run daily executes in greater than 24 hours. It is impossible to meet the daily goal unless you can find a way to make the query execute faster.
⢠These kind of problems can happen on relatively small amounts of data (tens of gigabytes)
Analytics(OLAP) versus OLTP
⢠OLTP is focused on short lived small transactions that read or write small amounts of data
⢠OLAP is focused on bulk loading and reading large amounts of data in a single query.
⢠Aggregation queries are OLAP queries
⢠Shard-Query is designed for analytics (OLAP) not OLTP
⢠must parse all commands sent to it (and make multiple round trips)
⢠Minium query time of around 20ms
PROBLEM: Single Threaded QueriesTHE BIGGEST BOTTLENECK IN ANALYTICAL QUERIES IS THE SPEED OF A SINGLE CORE
Single thread queries in the database
⢠MySQL, PostgreSQL, Firebird and all other major open source databases have single threaded queries
⢠This means that a single query can only ever utilize the resources of a single core
⢠As the data size grows, analytical queries get slower and slower
⢠In memory, as the data grows the speed decreases because the data is accessed in a single query
⢠As the number of rows to be examined increases, performance decreases
Why single threaded
⢠MySQL is optimized for getting small amounts of data quickly(OLTP)
⢠It was created at a time when having more than one CPU was not common
⢠Adding parallelism now is a very complex task, particularly since MySQL supports multiple storage engines
⢠So adding parallel query is not a high priority (not even on the roadmap)
⢠Designed to run LOTS of small queries simultaneously, not one big query
Single Threading â bad for IO
⢠If the data set is significantly larger than memory, single threaded queries often cause the buffer pool to "churnâ
⢠For example, small lookup tables can easily be pushed out of the buffer pool, resulting in frequent IO to look up values
⢠While SSD may helps somewhat, one database thread can not read from an SSD at maximum device capacity
⢠While the disk may be capable of 1000+ MB/sec, a single thread is generally limited to <100MB/sec (usually 30-40)
⢠This is because a single thread shares doing IO AND running the query on one CPU (MySQL does not use read threads for queries)
The OLAP Example
⢠A large company maintains a star schema of their sales history for analytics purposes
⢠This company likes to present a sum total of orders for all time on the dashboard
⢠In the beginning the query is very fast
⢠It gets slower, though, as months of data are added and as the business grows, data increases too
⢠Eventually the query takes more than 24 hours to run, which means it can no longer be updated daily
⢠âDrill downâ gets slower as data increases
What can be done?
⢠Caching?
⢠Materialized views?
⢠Partitioning?
⢠Sharding?
Making OLAP more like OLTP!
⢠Shard-Query breaks on big query up into smaller queries that can access the database in parallel
⢠Partitioning and sharding are used to keep data size for any single query to a minimum
⢠If your table has 16 partitions, you can get up to 16 way parallelism
⢠If you also have 2 nodes, you get 32 way parallelism, and so on
⢠You can use multiple database schema on a single server instead (a form of sharding) if you donât partition your data
Shard-QueryADDING PARALLELISM TO QUERIES
Sharding Reviewed
⢠A sharded database contains multiple nodes or databases called shards
⢠One physical machine might host many shards
⢠Each shard has identical schema
Sharding Reviewed (cont)
⢠The multiple shards function together as one RDBMS system.
⢠You can think of the shards as a big UNION ALL of the data, with only a portion of the data on any one machine
⢠A mechanism must control which server on which to place particular pieces of data.
⢠In Shard-Query a particular column controls data placement â this is called the shard key
Sharding â Data distribution
⢠There are usually one or two large tables that are sharded
⢠These are usually called FACT tables
⢠An example might be blogs, blog_posts and blog_comments. All three share a âshard keyâ of blog_id
⢠Most common case is one big table with smaller lookup tables
Sharding Reviewed (cont)
⢠The shard key is very important!
⢠Since a specific column acts as the âshard keyâ, all sharded tables must contain the shard key.
⢠For example: blog_id might be the shard key.
⢠The rows for a specific blog_id are then located on the same shard in any table that has the blog_id column
Optimization - Shard Elimination
⢠When Shard-Query sees an expression on the shard key it looks up* the shard that contains the appropriate data and only sends queries to the necessary shards.
⢠Equality lookup is most efficient, but IN, BETWEEN and other operators are allowed as well
⢠Lookups may not use subqueries (ie, blog_id IN (1,2,3) is okay, not blog_id in (select âŚ))
⢠This is called âshard eliminationâ
⢠Shard elimination is analogous to partition elimination.
⢠where blog_id = 10, for example
Can Shard-Query help on 1 machine?
⢠Yes! - Use MySQL partitioning on a single machine
⢠Shard-Query can access the partitions of a table in parallel!
⢠This means that if you have many partitions, then Shard-Query can utilize many cores to answer the query
Use partitions for parallelism
How does that work?
⢠Shard-Query executes an EXPLAIN PLAN on the query
⢠This EXPLAIN PLAN shows the partitions that MySQL will access when running the query
⢠Shard-Query uses the 5.6 PARTITION hint to generate one query per partition
⢠These queries can execute in parallel
Sharding can help too
⢠How?
⢠Shard-Query adds parallelism to queries by spreading them over nodes in parallel
⢠Spread the data over four nodes and queries are 4x faster
MySQL database shards
Shard-Query
Sharding + Partitioning is best
⢠Why?
⢠Partition the tables to add parallelism to each node
⢠Use sharding to have multiple nodes working together
⢠4 nodes with 3 partitions each = 12 way parallelism
Shard-Query
MySQL database shards
Partitions
Shard-QueryARCHITECTURE
Configuration Repository
⢠Shard-Query stores all configuration information in a MySQL database called the configuration repository
⢠This should be a highly available replication pair (or XtraDBcluster) for HA
⢠Web interface can change the settings
⢠Manual settings changes can be done via SQL
⢠schemata_config table in Shard-Query repository
⢠Makes using Shard-Query easier, especially when using more than one node
PHP OO Apache Web
Interface
MySQL Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
Interfaces
Communication
Workers
Storage
ConfigRepository
Configuration Management
PHP OO Apache Web
Interface
MySQL Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query ArchitectureGearman job server
⢠Provides the parallel mechanism for Shard-Query
⢠Multiple Gearman are supported for HA
⢠Enables Shard-Query to use a map/reduce like architecture
⢠Sends jobs to workers when they arrive at the queue
⢠If all workers are busy the job waits
Gearman at a glance
Shard-Query OO
Store-resultsetLoader workerSQ run SQL worker
PHP OO Apache Web
Interface
MySQL Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query ArchitectureThree kinds of workers
⢠loader_worker â Listens for loader jobs and executes them. Used by parallel loader.
⢠shard_query_worker â Listens for SQL jobs, runs the job via Shard-Query and returns the results as JSON. Used by web and proxy interfaces.
⢠store_resultset_worker â Main worker used by Shard-Query. It runs SQL and stores the result in a table.
PHP OO Apache Web
Interface
MySQL Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query ArchitecturePHP Object Oriented Interface
⢠Very simple to use
⢠Constructor parameters not even usually needed
⢠Just one function to run a SQL query and get results back
⢠Complete example comes with Shard-Query as: bin/run_query
PHP OO Example (from bin/run_query):
$shard_query = new ShardQuery();$stime = microtime(true);$stmt = $shard_query->query($sql);$endtime = microtime(true);
if(!empty($shard_query->errors)) {if(!empty($shard_query->errors)) {echo "ERRORS RETURNED BY OPERATION:\n";print_r($shard_query->errors);
}}
if(is_resource($stmt) || is_object($stmt)) {$count=0;while($row = $shard_query->DAL->my_fetch_assoc($stmt)) {
print_r($row);++$count;
}echo "$count rows returned\n";$shard_query->DAL->my_free_result($stmt);
} else {if(!empty($shard_query->info)) print_r($shard_query->info);echo "no query results\n";
}echo "Exec time: " . ($endtime - $stime) . "\n";
Simple data access layercomes with Shard-Query
Errors are returned as a memberof the object
Run the query
PHP OO Apache Web
Interface
MySQL Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query ArchitectureApache web interface
⢠GUI
⢠Easy to set up
⢠Run queries and get results
⢠Serves as an example of using Shard-Query in a web app with asynchronous queries
⢠Submits queries via Gearman
⢠Simple HTTP authentication
PHP OO Apache Web
Interface
MySQL Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query ArchitectureMySQL Proxy Interface
⢠LUA script for MySQL Proxy
⢠Supports most SHOW commands
⢠Intercepts queries, and sends them to Shard-Query using the MySQL Gearman UDF
⢠Serves as another example of using Gearman to execute queries.
⢠Behaves slightly differently than MySQL for some commands
Query submitted
SQL is parsed
Query rewritefor parallelismyields multiple
queries
Gearman Jobs (map/combine)
Final Aggregation (reduce)
Return result
Shard-Query Data FlowMap/reduce like workflow
Query submitted
SQL is parsed
Query rewritefor parallelismyields multiple
queries
Gearman Jobs (map/combine)
Final Aggregation (reduce)
Return result
Shard-Query Data Flow
SQL Parser
⢠Find it at http://github.com/greenlion/php-sql-parser
⢠Supports
⢠SELECT/INSERT/UPDATE/DELETE
⢠REPLACE
⢠RENAME
⢠SHOW/SET
⢠DROP/CREATE INDEX/CREATE TABLE
⢠EXPLAIN/DESCRIBE
Used by SugarCRM too, aswell as other open source projects.
Query submitted
SQL is parsed
Query rewritefor parallelismyields multiple
queries
Gearman Jobs (map/combine)
Final Aggregation (reduce)
Return result
Shard-Query Data Flow
Query Rewrite for parallelism
⢠Shard-Query has to manipulate the SQL statement so that it can be executed over more than on partition or machine
⢠COUNT() turns into SUM of COUNTs from each query
⢠AVG turns into SUM and COUNT
⢠SEMI-JOIN is turned into a materialized join
⢠STDDEV/VARIANCE are rewritten as well use the sum of squares method
⢠Push down LIMIT when possible
Query Rewrite for parallelism (cont)
⢠Because lookup tables are duplicated on all shards, the query executes in a shared-nothing way
⢠All joins, filtering and aggregation are pushed down
⢠Mean very little data must flow between nodes in most cases
⢠High performance
⢠Meets or beats Amazon Redshift in testing at 200GB of data
Query submitted
SQL is parsed
Query rewritefor parallelismyields multiple
queries
Gearman Jobs (map/combine)
Final Aggregation (reduce)
Return result
Shard-Query Data Flow
Map/Combine
⢠The store_resultset gearman worker runs SQL and stores the result in a table
⢠To keep the number of rows in the table (and the time it takes to aggregate results in the end) small, an INSERT ⌠ON DUPLICATE KEY UPDATE (ODKU) statement is used when inserting the rows
⢠There is a UNIQUE KEY over the GROUP BY attributes to facilitate the upsert
Query submitted
SQL is parsed
Query rewritefor parallelismyields multiple
queries
Gearman Jobs (map/combine)
Final Aggregation (reduce)
Return result
Shard-Query Data Flow
Final aggregation
⢠Shard-Query has to return a proper result, combining the results in the result table together to return the correct answer
⢠Again, for example COUNT must be rewritten as SUM to combine all the counts (from each shard) in the result table
⢠Aggregated result is returned to the client
Shard-Query Flow as SQL[justin@localhost bin]$ ./run_query --verboseselect count(*) from lineorder;Shard-Query optimizer messages:
SQL TO SEND TO SHARDS:Array(
[0] => SELECT COUNT(*) AS expr_2913896658FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1
[1] => SELECT COUNT(*) AS expr_2913896658FROM lineorder PARTITION(p1) AS `lineorder` WHERE 1=1
[2] => SELECT COUNT(*) AS expr_2913896658FROM lineorder PARTITION(p2) AS `lineorder` WHERE 1=1
[3] => SELECT COUNT(*) AS expr_2913896658FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 )SQL TO SEND TO COORDINATOR NODE:SELECT SUM(expr_2913896658) AS ` count `FROM `aggregation_tmp_58392079`
Array(
[count ] => 0)1 rows returnedExec time: 0.03083610534668
Initial query
Query rewrite / map
Final aggregation / reduce
Final result
Map/Combine exampleselect LO_OrderDateKey, count(*) from lineorder group by LO_OrderDateKey;Shard-Query optimizer messages:* The following projections may be selected for a UNIQUE CHECK on the storage node operation:expr$0
* storage node result set merge optimization enabled:ON DUPLICATE KEY UPDATE
expr_2445085448=expr_2445085448 + VALUES(expr_2445085448)
SQL TO SEND TO SHARDS:Array(
[0] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1 GROUP BY expr$0
[1] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448FROM lineorder PARTITION(p1) AS `lineorder` WHERE 1=1 GROUP BY expr$0
[2] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448FROM lineorder PARTITION(p2) AS `lineorder` WHERE 1=1 GROUP BY expr$0
[3] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 GROUP BY expr$0)SQL TO SEND TO COORDINATOR NODE:SELECT expr$0 AS `LO_OrderDateKey`,SUM(expr_2445085448) AS ` count `FROM `aggregation_tmp_12033903` GROUP BY expr$0
combine
reduce
Use cases
Machine generated data
⢠Sensor readings
⢠Metrics
⢠Logs
⢠Any large table with short lookup tables
Star schema are ideal
Call detail records
⢠Shard-Query is used in the billing system of a large cellular provider
⢠CDRs generate a lot of data
⢠Shard-Query includes a fast PERCENTILE function
Green energy meter processing
⢠High volume of data means sharding is necessary
⢠With Shard-Query, reporting is possible over all the shards, making queries possible that would not work with Fabric or other sharding solutions
⢠Used in India for reporting on a green power grid
Log analysis
⢠Performance logs from a web application for example
⢠Aggregate many different statistics and shard if log volumes are high enough
⢠Search text logs with regular expressions
Performance
Star Schema Benchmark â SF 20
⢠119 million rows of data (12GB)
⢠Infobright Community Database
⢠Only 1st query from each âflightâ selected
⢠Unsharded compared to four shards (box has 4 cpu - Amazon m1.xlarge)
COLD
⢠MySQL â 35.39s
⢠Shard-Query â 11.62s
HOT
⢠MySQL â 10.99s
⢠Shard-Query â 2.95s
Query 1
select sum(lo_extendedprice*lo_discount) as revenuefrom lineorder join dim_date on lo_orderdatekey = d_datekeywhere d_year = 1993and lo_discount between 1 and 3and lo_quantity < 25;
COLD
⢠MySQL â 34.24s
⢠Shard-Query â 12.74s
HOT
⢠MySQL â 12.74s
⢠Shard-Query â 3.26s
Query 2
select sum(lo_revenue), d_year, p_brandfrom lineorderjoin dim_date on lo_orderdatekey = d_datekeyjoin part on lo_partkey = p_partkeyjoin supplier on lo_suppkey = s_suppkeywhere p_category = 'MFGR#12'and s_region = 'AMERICA'group by d_year, p_brandorder by d_year, p_brand;
COLD
⢠MySQL â 27.29s
⢠Shard-Query â 7.97s
HOT
⢠MySQL â 18.89
⢠Shard-Query â 5.06s
Query 3
select c_nation, s_nation, d_year, sum(lo_revenue) as revenuefrom customer join lineorderon lo_custkey = c_customerkeyjoin supplier on lo_suppkey = s_suppkeyjoin dim_date on lo_orderdatekey = d_datekeywhere c_region = 'ASIA'and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997group by c_nation, s_nation, d_yearorder by d_year asc, revenue desc;
COLD
⢠MySQL â 23.02s
⢠Shard-Query â 8.48s
HOT
⢠MySQL â 14.77
⢠Shard-Query â 4.29s
Query 4
select d_year, c_nation, sum(lo_revenue - lo_supplycost) as profitfrom lineorder join dim_date on lo_orderdatekey = d_datekeyjoin customer on lo_custkey = c_customerkeyjoin supplier on lo_suppkey = s_suppkeyjoin part on lo_partkey = p_partkeywhere c_region = 'AMERICA'and s_region = 'AMERICA'and (p_mfgr = 'MFGR#1'or p_mfgr = 'MFGR#2')group by d_year, c_nationorder by d_year, c_nation;