intro to nosql and mongodb
TRANSCRIPT
1
NoSQL: Introduction
Asya Kamsky
2
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
3
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
• 1980's RDBMS commercialized
– Client/Server model
– SQL becomes the standard
4
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
• 1980's RDBMS commercialized
– Client/Server model
– SQL becomes the standard
• 1990's Things begin to change
– Client/Server=> 3-tier architecture
– Rise of the Internet and the Web
5
• 2000's Web 2.0
– Rise of "Social Media"
– Acceptance of E-Commerce
– Constant decrease of HW prices
– Massive increase of collected data
6
• 2000's Web 2.0
– Rise of "Social Media"
– Acceptance of E-Commerce
– Constant decrease of HW prices
– Massive increase of collected data
• Result
– Constant need to scale dramatically
– How can we scale?
7
Computers in 1985
• x286 5-35 mhz
• 56 kbps
• 64 KB RAM
• 10 MB HDD
8
Computers in 1985
• x286 5-35 mhz
• 56 kbps
• 64 KB RAM
• 10 MB HDD
Computers in 1995
• Pentium 100 mhz
• 20-50 Mbps
• 16 MB RAM
• 200 MB HDD
9
Computers in 1985
• x286 5-35 mhz
• 56 kbps
• 64 KB RAM
• 10 MB HDD
Computers in 1995
• Pentium 100 mhz
• 20-50 Mbps
• 16 MB RAM
• 200 MB HDD
Phone in 2012
• Dual core 1.2 Ghz
• WiFi 802.11n - 300+Mbps
• 1 GB RAM
• 48 GB SSD
10
Computers in 1985
• x286 5-35 mhz
• 56 kbps
• 64 KB RAM
• 10 MB HDD
Computers in 1995
• Pentium 100 mhz
• 20-50 Mbps
• 16 MB RAM
• 200 MB HDD
Computers in 2012
• Dual core 1.8 Ghz
• WiFi 802.11n - 300+Mbps
• 180+ Gbps
• 8 GB RAM
• 512 GB SSD
11
• Agile Development Methodology • Shorter development cycles
• Constant evolution of requirements
• Flexibility at design time
12
• Agile Development Methodology • Shorter development cycles
• Constant evolution of requirements
• Flexibility at design time
• Relational Schema • Hard to evolve
• long painful migrations
• must stay in sync with
application
• few developers interact directly
13
OLTP / operational
BI / reporting
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
+ ad hoc queries
+ SQL standard
protocol between
clients and servers
+ scales horizontally
better than oper dbs.
- some scale limits at
massive scale
- schemas are rigid
- no real time; great at
bulk nightly data loads
fewer issues here
a lot more issues here
14
OLTP / operational
BI / reporting
caching
flat files
map/reduce
app layer partitioning
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
+ ad hoc queries
+ SQL standard
protocol between
clients and servers
+ scales horizontally
better than oper dbs.
- some scale limits at
massive scale
- schemas are rigid
- no real time; great at
bulk nightly data loads
15
16
• Agile Development Methodology • Shorter development cycles
• Constant evolution of requirements
• Flexibility at design time
17
• Agile Development Methodology • Shorter development cycles
• Constant evolution of requirements
• Flexibility at design time
• Relational Schema • Hard to evolve
• long painful migrations
• must stay in sync with
application
• few developers interact directly
18
19
• Horizontal scaling
• Run anywhere
• Flexible data model
• Faster development
• Low upfront cost
• Low cost of ownership
20
Relational
vs
Non-Relational
What is NoSQL?
21
scalable nonrelational
("nosql")
OLTP / operational
BI / reporting
+ speed and scale
- ad hoc query limited
- not very transactional
- no sql/no standard
+ fits OO well
+ agile
22
Non-relational next generation
operation data stores and databases
A collection of very different products
• Different data models (Not relational)
• Most are not using SQL for queries
• No predefined schema
• Some allow flexible data structures
23
• Relational
• Key-Value
• Document
• XML
• Graph
• Column
24
• Relational
• ACID
• Key-Value
• Document
• XML
• Graph
• Column
• BASE
• Some ACID properties
25
• Relational
• ACID
• Two-phase commit
• Key-Value
• Document
• XML
• Graph
• Column
• BASE
• Some ACID properties
• Atomic transactions on
document level
26
• Relational
• ACID
• Two-phase commit
• Joins
• Key-Value
• Document
• XML
• Graph
• Column
• BASE
• Some ACID properties
• Atomic transactions on
document level
• No Joins
27
28
• Fits your use case
• Reliability
• Maintainability
• Ease of Use
• Scalability
• Cost
29
MongoDB: Introduction
30
31
• Designed and developed by founders of Doubleclick, ShopWiki, GILT groupe, etc.
• GOAL: create high performance, fully consistent, horizonally scalable general purpose data store.
• Coding started fall 2007
• Open Source – AGPL, written in C++
• First production site March 2008 - businessinsider.com
• Currently version 2.2 – August 2012
32
MongoDB
Design Goals
33
34
• Document-oriented
Storage
• Based on JSON
Documents
• Data serialized to BSON
• Flexible Schema
• Scalable Architecture
• Replication
• High availability
• Auto-sharding
• Extensive use of memory
mapped files
• Durable
• Strong Consistency
• Key Features Include:
• Full featured indexes
• Ad-hoc Query Language
• Interactive shell
• Aggregation queries
• Map/Reduce
35
• Rich data models
• Seamlessly map to native programming
language types
• Flexible for dynamic data
• Better data locality
36
Blogging website:
Register users
Users post blog entries
Comment on others' entries
Considering:
Tagging, Voting, ???
37
join
table
38
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
}
39
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"]
}
> db.posts.ensureIndex( { tags : 1 } )
40
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"]
}
> db.posts.find( { tags : "news" } )
41
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"]
}
> db.posts.find( { tags : "news" } ) .explain()
{ "cursor" : "BtreeCursor tags_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"tags" : [
[
"news",
"news"
42
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"],
votes : 3,
voters : ["dmerr", "sj", "jane" ]
}
> db.posts.update( { }, – query for documents to update
{ } – update to perform
)
43
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"],
votes : 3,
voters : ["dmerr", "sj", "jane" ]
}
> db.posts.update( {_id:..., voters:{$ne:"asya"} },
{ $push: {voters:"asya"},
$inc : {votes: 1}
} )
44
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"],
votes : 4,
voters : ["dmerr", "sj", "jane", "asya" ],
comments : [
{ by : "tim157", text : "great story", ... },
{ by : "gora", text : "i don’t think so", ... },
{ by : "dmerr", text : "also check out..." }
]
}
45
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"],
votes : 4,
voters : ["dmerr", "sj", "jane","asya" ],
comments : [
{ by : "tim157", text : "great story" },
{ by : "gora", text : "i don’t think so" },
{ by : "dmerr", text : "also check out..." }
]
}
> db.posts.ensureIndex( { "comments.by" : 1 } )
46
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : "My Very Important Thoughts",
published: ISODate("2011-07-26T19:49:00.147Z"),
author : { name:"Asya Kamsky", username:"asya" },
text : "It was a long and stormy night ..."
tags : ["business", "news", "north america"],
votes : 4,
voters : ["dmerr", "sj", "jane","asya" ],
comments : [
{ by : "tim157", text : "great story" },
{ by : "gora", text : "i don’t think so" },
{ by : "dmerr", text : "also check out..." }
]
}
> db.posts.find( { "comments.by" : "gora" } )
47
Seek = 5+ ms Read = really really fast
Post
Author Comment
48
Post
Author
Comment Comment Comment Comment Comment
Disk seeks and data locality
49
• High Availability
• Data Redundancy
• Increase capacity with no downtime
• Transparent to the application
50
• A cluster of N servers
• Any (one) node can be primary
• All writes to primary
• Reads go to primary (default) optionally to a secondary
• Consensus election of primary
• Automatic failover
• Automatic recovery
Node 3
Node 1
Node 2
Primary
Pick me!
51
Replica Sets
• High Availability/Automatic Failover
• Data Redundancy
• Disaster Recovery
• Transparent to the application
• Perform maintenance with no down time
52
Asynchronous
Replication
53
Asynchronous
Replication
54
Asynchronous
Replication
55
56
Automatic
Election
57
58
• Increase capacity with no downtime
• Transparent to the application
• Range based partitioning
• Partitioning and balancing is automatic
59
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
min..25
Key Range
26..50
Key Range
51..75
Key Range
76.. max
60
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
min..25
Key Range
26..50
Key Range
51..75
Key Range
76.. max
MongoS
Application
61
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
min..25
Key Range
26..50
Key Range
51..75
Key Range
76.. max
MongoS MongoS MongoS
Application
62
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
min..25
Key Range
26..50
Key Range
51..75
Key Range
76.. max
MongoS MongoS MongoS
Config Config Config
Application
MongoS
MongoS
Application Application Application
63
• Few configuration options
• Does the right thing out of the box
• Easy to deploy and manage
64
Better data locality
Relational MongoDB
In-Memory
Caching
Auto-Sharding
Write scaling Re
ad
sca
ling
We just can't get any faster than the way MongoDB handles our data.
Tony Tam CTO, Wordnik
65
• Supported Platforms:
– Linux, Windows, Solaris, Mac OS X
– Packages available for all popular distributions
No external/third party software dependencies
10gen maintains drivers for over dozen languages
66
User Data Management High Volume Data Feeds
Content Management Operational Intelligence E-Commerce
67
68
Open source, high performance database