nosql consepts

Chapter 2: NoSQL Concepts 1

NO SQL Concepts (Part II)

No SQL Databases

Based on Chapter 2 – Making Sense of NoSQL


Outline• Keeping Components Simple

• Using Application Tier

• Speeding Performance

• Using Consistent Hashing

• Comparing ACID and BASE

• Database sharding

• Brewer’s CAP theorem


Keeping Components Simple• Relational databases can be very complex. They perform lot of

functions to manage multiple tables, perform join operations, do query optimization, replicate transactions, run stored procedures, set triggers, enforce security, and perform indexing.

• From the NoSQL perspective, simple is good. It’s not necessary to include all functions in a single software application. They often have a number of simpler components that can be reassembled to meet the needs of different applications.

• For example, one function allows sharing of objects in RAM (memcache), another function runs batch jobs (MapReduce,) and yet another function stores binary documents (key-value stores).


Using Application Tier• Using application tiers in your architecture allows you to create

flexible and reusable applications.

• By segregating an application into tiers, you have the option of modifying or adding a specific layer instead of reworking an entire application when modifications are required.

• Application designers have to choose whether they need to use application tiers to divide the overall functionality of their application.

• Identifying each layer breaks the application into separate architectural components, which allows the software designer to determine the responsibility of each component. This separation of concerns allows designers to make the complicated seem simple when explaining the system to others.


Using Application Tier (Cont.)Application tiers are typically viewed in a layer-cake-like drawing.

• The user events (like a user clicking a button on a web page) trigger code in the user interface. The output or response from the user interface is sent to the middle tier.

• This middle tier may respond by sending something back to the user interface, or it could access the database layer.

• The database layer may in turn run a query and send a response back to the middle tier. The middle tier then uses the data to create a report and sends it to the user.


Using Application Tier (Cont.)


Using Application Tier (Cont.)• Both models have a user interface tier at the top. In the

relational database, most of the application functionality is found in the database layer.

• In the NoSQL application, most of the application functionality is found in the middle tier.

• In addition, NoSQL systems leverage more services for managing BLOBs of data (the key-value store), for storing full-text indexes (the Lucene indexes), and for executing batch jobs (MapReduce).

• In traditional relational database systems, the complexity found in the database tier impacts the overall scalability of the application.


Pros and Cons of RDBMSRDBMS pros:ACID transactions at the database level makes development easier.

Fine-grained security on columns and rows using views prevents views and changes by unauthorized users.

Most SQL code is portable to other SQL databases, including open source options.

Typed columns and constraints will validate data before it’s added to the database and increase data quality.

Existing staff members are already familiar with entity-relational design and SQL.

RDBMS cons:The object-relational mapping layer can be complex.

Entity-relationship modeling must be completed before testing begins, which slows development.

RDBMSs don’t scale out when joins are required.

Sharding over many servers can be done but requires application code and will be operationally inefficient.

Full-text search requires third-party tools.

It can be difficult to store high-variability data in tables.


Pros and Cons of NOSQLNoSQL pros:Loading test data can be done with drag-and-drop tools before ER modeling is complete.

Modular architecture allows components to be exchanged.

Linear scaling takes place as new processing nodes are added to the cluster.

Lower operational costs are obtained by autosharding.

Integrated search functions provide high-quality ranked search results.

There’s no need for an object-relational mapping layer.

It’s easy to store high-variability data.

NoSQL cons:ACID transactions can be done only within a document at the database level. Other transactions must be done at the application level.

Document stores don’t provide fine-grained security at the element level.

NoSQL systems are new to many staff members and additional training may be required.

The document store has its own proprietary nonstandard query language, which prohibits portability.

The document store won’t work with existing reporting and OLAP tools.


Terminology of database cluster• In general, each cluster consists of racks filled with commodity

computer hardware.

• Nodes contain a single logical processor called a CPU, local RAM and disk. The CPU may in fact be implemented using multiple chips, each with multiple core processors. The disk system may also be composed of several independent drives.

• Nodes are grouped together in racks that have high-bandwidth connections between all the nodes within a rack.

• Racks are grouped together to form a database cluster within a data center.

• A single data center location may contain many database clusters.


Speeding Performance• Generally, traditional database management systems weren’t concerned

with memory management optimization.

• The solution to faster systems is to keep as much of the right information in RAM as you can, and check your local servers to see if they might also have a copy. This local fast data store is often called a RAM cache or memory cache.

• Many memory caches use a simple timestamp for each block of memory in the cache as a way of keeping the most recently used objects in memory.

• When memory fills up, the timestamp is used to determine which items in memory are the oldest and should be overwritten.

• A more refined view can take into account how much time or resources it’ll take to re-create the dataset and store it in memory.

• This “cost model” allows more expensive queries to be kept in RAM longer than similar items that could be regenerated much faster.


Speeding Performance (Cont.)The effective use of RAM cache is predicated on the efficient answer to the question:

• Have we run this query before?

• Have we seen this document before?

These questions can be answered by using consistent hashing, which lets you know if an item is already in the cache or if you need to retrieve it from SSD or HDD.


Using Consistent Hashing• Consistent hashing quickly tells you if a new query or document is

the same as one already in your cache.

• Knowing this information prevents you from making unnecessary calls to disk for information and keeps your databases running fast.

• A hash string (also known as a checksum or hash) is a process that calculates a sequence of letters by looking at each byte of a document.

• The hash string uniquely identifies each document.

• It is used to determine whether the document you’re presented with is the same document you already have on hand.

• If there’s any difference between two documents (even a single byte), the resulting hash will be different.


Using Consistent Hashing (Cont.)

• Consistent hashing occurs when two different processes running on different nodes in your network create the same hash for the same object.

• Consistent hashing confirms that the information in the document hasn’t been altered and allows you to determine whether the object exists in your cache or message store, saving precious resources by only rerunning processes when necessary.

• Consistent hashing is also critical for synchronizing distributed databases. For example, version control systems such as Git or Subversion can run a hash not only on a single document but also on hashes of hashes for all files within a directory.


ACID vs. BASE• Typically one of two types of transaction control models are used:

1. ACID, used in RDBMS that focuses on consistency2. BASE, found in many NoSQL systems that focus on availability.

• The difference is in the amount of effort required by application developers and the location (tier) of the transactional controls.


BASEBASE stands for these concepts:

• Basic Availability allows systems to be temporarily inconsistent so that transactions are manageable. In BASE systems, the information and service capability are “basically available.”

• Soft-state recognizes that some inaccuracy is temporarily allowed and data may change while being used to reduce the amount of consumed resources.

• Eventual consistency means eventually, when all service logic is executed, the system is left in a consistent state.


BASE (Cont.)

• The number-one objective is to allow new data to be stored, even at the risk of being out of sync for a short period of time. They relax the rules and allow reports to run even if not all portions of the database are synchronized.

• They’re optimistic in that they assume that eventually all systems will catch up and become consistent.

• BASE systems tend to be simpler and faster because they don’t have to write code that deals with locking and unlocking resources.

• Their mission is to keep the process moving and deal with broken parts at a later time.

• BASE systems are ideal for web storefronts, where filling a shopping cart and placing an order is the main priority.


Horizontal scalability with sharding• There may come a point when the amount of data needed to

run the business exceeds the current environment and some mechanism for breaking the information into reasonable chunks is required.

• Breaking a database into chunks called shards and spreading the chunks across a number of distributed servers.

• Organizations and systems that reach this capacity can use automatic database sharding.


Database Sharding• Let’s say you’ve created a website that allows users to log in

and create their own personal space to share with friends.

• They have profiles, text, and product information on things they like (or don’t like).

• You set up your website, store the information in a MySQL database, and you run it on a single CPU.

• People love it, they log in, create pages, invite their friends, and before you realize it your disk space is 95% full.

What do you do?


Database Sharding (Cont.)

The answer is buy a new system and transfer half the users to the new system. Oh, and your old system might have to be down for a while so you can rewrite your application to know which database to get information from.

The process of moving from a single to multiple databases can be done in a number of ways; for example:

1. You can keep the users with account names that start with the letters A-N on the first drive and put users from O-Z on the new system.

2. You can keep the people in the United States on the original system and put the people who live in Europe on the new system.

3. You can randomly move half the users to the new system.


Database Sharding (Cont.)Some important Questions

• In option 1, if a user changes their name, should they be automatically moved to the new drive?

• In option 2, if a user moves to a new country, should all their data be moved?

• If people tend to share links with people near them, would there be performance advantages to keeping these users together?

• What if people in the United States tend to be active at the same time in the evening?

• Would one database get overwhelmed and the other be idle?

• What happens if your site doubles in size again?

• Do you have to continue to rewrite your code each time this happens?

• Do you have to shut the system down for a weekend while you upgrade your software?



As the number of servers grows, you find that the chance of any one server being down remains the same.

• So for every server you add the chance of one part not working increases.

• So you think that perhaps the same process you used to split the database between two systems can also be used to duplicate data to a backup or mirrored system if the first one fails.

• But then you have another problem. When there are changes to a master copy, you must also keep the backup copies in sync.

• You must have a method of data replication. The time it takes to keep these databases in sync can decrease system performance. You now need more servers to keep up!



Welcome to the world of database sharding, replication, and distributed computing.

• NoSQL systems have been noted for having many ways to allow you to grow your database without ever having to shut down your servers.

• Keeping your database running when there are node or network failures is called partition tolerance—a new concept in the NoSQL community and one that traditional database managers struggle with.


Understanding trade-offs with Brewer’s CAP theoremThe CAP theorem states that any distributed database system can have at most two of the following three desirable properties:

1. Consistency—Having a single, up-to-date, readable version of your data available to all clients. This isn’t the same as the consistency we talked about in ACID. Consistency here is concerned with multiple clients reading the same items from replicated partitions and getting consistent results.

2. High availability—Knowing that the distributed database will always allow database clients to update items without delay. Internal communication failures between replicated data shouldn’t prevent updates.

3. Partition tolerance—The ability of the system to keep responding to client requests even if there’s a communication failure between database partitions. This is analogous to a person still having an intelligent conversation even after a link between parts of their brain isn’t working.


CAP theorem


Questions?

Introd u ction to Per vas ive Compu tin g

nosql consepts

Data & Analytics