intro to nosql and mongodb

1

NoSQL: Introduction

Asya Kamsky

2

• 1970's Relational Databases Invented

– Storage is expensive

– Data is normalized

– Data storage is abstracted away from app

3





• 1980's RDBMS commercialized

– Client/Server model

– SQL becomes the standard

4





• 1980's RDBMS commercialized

– Client/Server model

– SQL becomes the standard

• 1990's Things begin to change

– Client/Server=> 3-tier architecture

– Rise of the Internet and the Web

5

• 2000's Web 2.0

– Rise of "Social Media"

– Acceptance of E-Commerce

– Constant decrease of HW prices

– Massive increase of collected data

6

• 2000's Web 2.0

– Rise of "Social Media"

– Acceptance of E-Commerce

– Constant decrease of HW prices

– Massive increase of collected data

• Result

– Constant need to scale dramatically

– How can we scale?

7

Computers in 1985

• x286 5-35 mhz

• 56 kbps

• 64 KB RAM

• 10 MB HDD

8

Computers in 1985

• x286 5-35 mhz

• 56 kbps

• 64 KB RAM

• 10 MB HDD

Computers in 1995

• Pentium 100 mhz

• 20-50 Mbps

• 16 MB RAM

• 200 MB HDD

9

Computers in 1985

• x286 5-35 mhz

• 56 kbps

• 64 KB RAM

• 10 MB HDD

Computers in 1995

• Pentium 100 mhz

• 20-50 Mbps

• 16 MB RAM

• 200 MB HDD

Phone in 2012

• Dual core 1.2 Ghz

• WiFi 802.11n - 300+Mbps

• 1 GB RAM

• 48 GB SSD

10

Computers in 1985

• x286 5-35 mhz

• 56 kbps

• 64 KB RAM

• 10 MB HDD

Computers in 1995

• Pentium 100 mhz

• 20-50 Mbps

• 16 MB RAM

• 200 MB HDD

Computers in 2012

• Dual core 1.8 Ghz

• WiFi 802.11n - 300+Mbps

• 180+ Gbps

• 8 GB RAM

• 512 GB SSD

11

• Agile Development Methodology • Shorter development cycles

• Constant evolution of requirements

• Flexibility at design time

12




• Relational Schema • Hard to evolve

• long painful migrations

• must stay in sync with

application

• few developers interact directly

13

OLTP / operational

BI / reporting

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

+ ad hoc queries

+ SQL standard

protocol between

clients and servers

+ scales horizontally

better than oper dbs.

- some scale limits at

massive scale

- schemas are rigid

- no real time; great at

bulk nightly data loads

fewer issues here

a lot more issues here

14

OLTP / operational

BI / reporting

caching

flat files

map/reduce

app layer partitioning

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

+ ad hoc queries

+ SQL standard

protocol between

clients and servers

+ scales horizontally

better than oper dbs.

- some scale limits at

massive scale

- schemas are rigid

- no real time; great at

bulk nightly data loads

16




17




• Relational Schema • Hard to evolve

• long painful migrations

• must stay in sync with

application

• few developers interact directly

19

• Horizontal scaling

• Run anywhere

• Flexible data model

• Faster development

• Low upfront cost

• Low cost of ownership

20

Relational

vs

Non-Relational

What is NoSQL?

21

scalable nonrelational

("nosql")

OLTP / operational

BI / reporting

+ speed and scale

- ad hoc query limited

- not very transactional

- no sql/no standard

+ fits OO well

+ agile

22

Non-relational next generation

operation data stores and databases

A collection of very different products

• Different data models (Not relational)

• Most are not using SQL for queries

• No predefined schema

• Some allow flexible data structures

23

• Relational

• Key-Value

• Document

• XML

• Graph

• Column

24

• Relational

• ACID

• Key-Value

• Document

• XML

• Graph

• Column

• BASE

• Some ACID properties

25

• Relational

• ACID

• Two-phase commit

• Key-Value

• Document

• XML

• Graph

• Column

• BASE


• Atomic transactions on

document level

26

• Relational

• ACID

• Two-phase commit

• Joins

• Key-Value

• Document

• XML

• Graph

• Column

• BASE


• Atomic transactions on

document level

• No Joins

28

• Fits your use case

• Reliability

• Maintainability

• Ease of Use

• Scalability

• Cost

29

MongoDB: Introduction

31

• Designed and developed by founders of Doubleclick, ShopWiki, GILT groupe, etc.

• GOAL: create high performance, fully consistent, horizonally scalable general purpose data store.

• Coding started fall 2007

• Open Source – AGPL, written in C++

• First production site March 2008 - businessinsider.com

• Currently version 2.2 – August 2012

32

MongoDB

Design Goals

34

• Document-oriented

Storage

• Based on JSON

Documents

• Data serialized to BSON

• Flexible Schema

• Scalable Architecture

• Replication

• High availability

• Auto-sharding

• Extensive use of memory

mapped files

• Durable

• Strong Consistency

• Key Features Include:

• Full featured indexes

• Ad-hoc Query Language

• Interactive shell

• Aggregation queries

• Map/Reduce

35

• Rich data models

• Seamlessly map to native programming

language types

• Flexible for dynamic data

• Better data locality

36

Blogging website:

Register users

Users post blog entries

Comment on others' entries

Considering:

Tagging, Voting, ???

37

join

table

38

{

_id : ObjectId("4e2e3f92268cdda473b628f6"),

title : "My Very Important Thoughts",

published: ISODate("2011-07-26T19:49:00.147Z"),

author : { name:"Asya Kamsky", username:"asya" },

text : "It was a long and stormy night ..."

}

39

{






tags : ["business", "news", "north america"]

}

> db.posts.ensureIndex( { tags : 1 } )

40

{







}

> db.posts.find( { tags : "news" } )

41

{







}

> db.posts.find( { tags : "news" } ) .explain()

{ "cursor" : "BtreeCursor tags_1",

"isMultiKey" : true,

"n" : 1,

"nscannedObjects" : 1,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 0,

"nChunkSkips" : 0,

"millis" : 0,

"indexBounds" : {

"tags" : [

[

"news",

"news"

42

{






tags : ["business", "news", "north america"],

votes : 3,

voters : ["dmerr", "sj", "jane" ]

}

> db.posts.update( { }, – query for documents to update

{ } – update to perform

)

43

{







votes : 3,

voters : ["dmerr", "sj", "jane" ]

}

> db.posts.update( {_id:..., voters:{$ne:"asya"} },

{ $push: {voters:"asya"},

$inc : {votes: 1}

} )

44

{







votes : 4,

voters : ["dmerr", "sj", "jane", "asya" ],

comments : [

{ by : "tim157", text : "great story", ... },

{ by : "gora", text : "i don’t think so", ... },

{ by : "dmerr", text : "also check out..." }

]

}

45

{







votes : 4,

voters : ["dmerr", "sj", "jane","asya" ],

comments : [

{ by : "tim157", text : "great story" },

{ by : "gora", text : "i don’t think so" },


]

}

> db.posts.ensureIndex( { "comments.by" : 1 } )

46

{







votes : 4,

voters : ["dmerr", "sj", "jane","asya" ],

comments : [

{ by : "tim157", text : "great story" },

{ by : "gora", text : "i don’t think so" },


]

}

> db.posts.find( { "comments.by" : "gora" } )

47

Seek = 5+ ms Read = really really fast

Post

Author Comment

48

Post

Author

Comment Comment Comment Comment Comment

Disk seeks and data locality

49

• High Availability

• Data Redundancy

• Increase capacity with no downtime

• Transparent to the application

50

• A cluster of N servers

• Any (one) node can be primary

• All writes to primary

• Reads go to primary (default) optionally to a secondary

• Consensus election of primary

• Automatic failover

• Automatic recovery

Node 3

Node 1

Node 2

Primary

Pick me!

51

Replica Sets

• High Availability/Automatic Failover

• Data Redundancy

• Disaster Recovery


• Perform maintenance with no down time

52

Asynchronous

Replication

53

Asynchronous

Replication

54

Asynchronous

Replication

56

Automatic

Election

58

• Increase capacity with no downtime


• Range based partitioning

• Partitioning and balancing is automatic

59

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

min..25

Key Range

26..50

Key Range

51..75

Key Range

76.. max

60

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

min..25

Key Range

26..50

Key Range

51..75

Key Range

76.. max

MongoS

Application

61

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

min..25

Key Range

26..50

Key Range

51..75

Key Range

76.. max

MongoS MongoS MongoS

Application

62

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

min..25

Key Range

26..50

Key Range

51..75

Key Range

76.. max

MongoS MongoS MongoS

Config Config Config

Application

MongoS

MongoS

Application Application Application

63

• Few configuration options

• Does the right thing out of the box

• Easy to deploy and manage

64

Better data locality

Relational MongoDB

In-Memory

Caching

Auto-Sharding

Write scaling Re

ad

sca

ling

We just can't get any faster than the way MongoDB handles our data.

Tony Tam CTO, Wordnik

65

• Supported Platforms:

– Linux, Windows, Solaris, Mac OS X

– Packages available for all popular distributions

No external/third party software dependencies

10gen maintains drivers for over dozen languages

66

User Data Management High Volume Data Feeds

Content Management Operational Intelligence E-Commerce

68

Open source, high performance database

intro to nosql and mongodb

Technology