Friday, 11 August 2017

BIG DATA VS RDBMS TRADE OFFS+ STORING BIG DATA IN RDBMS


Source: https://www.quora.com/What-are-some-good-examples-of-Big-Data-vs-RDBMS-tradeoffs-What-are-good-signs-that-you-should-consider-making-said-tradeoffs-as-an-architect


BIG DATA VS RDBMS TRADE OFFS + STORING BIG DATA IN RDBMS  
**** 


What are some good examples of Big Data vs. RDBMS tradeoffs? What are good signs that you should consider making said tradeoffs as an architect?

5 Answers




Daniel Lemire
You take it from a pure performance/scalability point of view. I think this is the wrong point of view. With enough money, you can make Oracle scale up. 

Look: you can buy a server with 2 TB of RAM from Dell. It is a bit expensive, but not at all unrealistic for a large corporation. There are very few databases that will fail to be fast with 2 TB of RAM.

The problem is deeper and more serious. It is about design. The RDBMS assumes that your data must be fundamentally legible. Indeed, in theory, a RDBMS uses Codd's relational model. That is what the "R" in RDBMS stands for, after all. How do you map your real-world problems to the relational model? There are many strategies, but the textbook approach is a three-step process: conceptual modelling, logical modelling and physical modelling.
So we have a solved problem. Take your enterprise problem and map it to a RDBMS and you are good to go.

There is  a problem with this story though. It just does not work. The empirical evidence is that this modelling is never done. Never.

So, what is wrong? It makes some false assumptions and they all have to do with the fact that in the real world, data is only partially legible:

1. Your information system is strongly consistent at all time.

2. Semantic is absolute. (not in relation to other things, not relative or comparative)

3. Semantic is static. (latent)


And so on... These assumptions are always false, but they are "more" false in some cases.

A pretty good example is this... Should you maintain web server access logs using the same database engine you use for financial transactions? Clearly, one has a much higher degree of consistency (density) than the other. It is quite artificial to force both into the same framework.

Facebook is another example. They have forgone completely the relational model. They use MySQL, but only as a key-value store. Part of the reason they do it this way is that they recognize that all three assumptions above (1,2,3) simply do not apply to their model. There is a new Facebook every week...

But *why* is it a problem in practice? Well. Imagine that you are a Facebook developer and you need a new attribute to code a new cool feature. It is easy: you just do it because there is no schema to change as there is no  explicit schema to begin with. But if you used a RDBMS the way it is supposed to be used, then you would need to update the schema. This would require an update to your conceptual and logical modelling, as well as an update to your physical model. This would involve meetings. This would involve ontological discussions... Thus, a feature that might take 2 hours to code might turn into a major project costing millions of dollars.

I invite you to read our paper where we discuss these issues further:

Antonio Badia and Daniel Lemire. A call to arms: revisiting database design. SIGMOD Record 40 (3), pages 61-69, 2011 http://arxiv.org/abs/1105.6001





Navin Kabra
(Warning:  this answer is written so that not-so-technical readers get a good idea of  the issues involved. For pedagogical purposes, there are a number of  over-simplifications. In addition, while I believe I got the "overall picture" right, I might have mis-stated some of the capabilities of specific NoSQL database. If you find any glaring errors, please submit a Suggest Edit)

First you need to understand what an RDBMS does for you. In this context, that means you need to understand what transactions are, then you need to understand what guarantees an RDBMS gives you with respect to transactions, and then finally you can understand which of these guarantees are not provided by a particular NoSQL database.

To understand transactions, see my answer to What guarantees do Relational Databases (RDBMS) provide with respect to safe data storage? You should now know what a transaction is, and what ACID properties are. Specifically, atomicityconsistency, and durability are important for the purposes of this answer. Isolation is not relevant. For consistency, the triggers, cascades, integrity constraints are not important - but distributed consistency is extremely important. That answer also contains an example which you'll be expected to know (of Steven transferring 3000 Quora credits to me, and the fact that this update needs to happen on the 10 identical copies of the database on 10 different servers).

Before I start talking about NoSQL, There are two more concepts I need to introduce in relation to consistency, and those are partition tolerance, and availability. Here, partition tolerance simply says that if one or more of the Quora servers are not contactable (either because the server is down, or because the network link to the server is down) Quora is not allowed to tell its users that the service will be unavailable because some servers are down. And availability says that all read and update requests should complete in a reasonable amount of time.

The problem is that consistency, availability, and partition tolerance are very sadly interlinked by a theorem that says: No system in the world can provide all three. This is known as the CAP Theorem (See: Distributed Systems: What is the CAP theorem? and What is the relation between SQL, NoSQL, the CAP theorem and ACID?)

Now, in the modern, cloud-computing enabled world, all users expect partition tolerance. Most users don't even know that the websites they go to have multiple servers, and that is the way it should be. So partition tolerance is table stakes. Which leaves the website owner with having to trade consistency off against availability.

This is primarily where NoSQL comes in. Providing all ACID guarantees, and availability is impossible. In fact, just providing all the ACID guarantees, even without providing availability, is expensive and needs lots of beefy servers if you're going to serve millions of users.

In the old world, where banks use RDBMS, they value their ACID properties above all. Which means that they give up on availability, and even performance. If a server goes down, your banking software will be unavailable. If millions of people use their online banking software at the same time, the response times will be terrible and/or the banks will have to spend quite a lot on their server and network infrastructure (not to mention database and server administrators). This is how the world worked for the first 30 years of the existence of databases.

However, as the Facebooks and Flickrs and Quoras of the world came up, they realized that keeping a user's poke and like data accurate up to the last microsecond is not as important as doing the same with bank data. And if, for a few minutes, Navin and Steven see different number of credits, or if the number of friends Makarand has on Facebook is reported differently to different people, no one cares. So they said, let's give up some ACID properties, and see if we can get amazing availability, blazing fast websites with very few servers.

But the RDBMS systems refused to play ball. After spending their entire lives sacrificing everything on the altar of ACID properties, they just could not handle having to give up some of these properties. (As with any generalization, this one is not entirely true, but let's not allow minor facts get in the way of a good story.)

So, the Googles and Facebooks and Amazons and other dotcoms of the world started writing new database systems that give up on one or more specific ACID guarantees in return for scalability, availability and performance.

Let's look at some examples:

Memcachedb and Redis are mainly main memory key-value stores which give up on durability to get blindingly fast speeds. They are very appropriate for keeping secondary data like caches, or various summaries, pageview counts, etc. which are constantly being updated, needed in lots of places in the website, and which can be recreated from the "real" database in case there is a power failure and all the data is lost. Recent versions of these databases do allow the database to be saved to the disk periodically, but the guarantees fall far, far short of the durability guarantees in RDBMS.

Architecturally, these are almost never used as the primary data store. They are usually added close to the front end to significantly speed up things for which durability is not important.

Most "standard" NoSQL databases like Facebook's Cassandra, Amazon's Dynamo, Voldemort give up on the consistency requirement to give better availability and performance. These database usually provide atomicity, isolation, and durability guarantees, but for consistency, they provide a weaker guarantee called "eventual consistency". This is equivalent to saying that after a transaction has succeeded, all users will "eventually" see consistent data. For a while, different users might see different values, and this could be a few seconds, or a few minutes, or if the shit is really hitting the fan, it could be a few hours or days, but eventually, we'll update the copies of the data on all 10 servers, and then everyone will be happy.

Specifically, what they do is that when Steven transfer 3000 credits to me, they do not send these updates to all 10 servers. Typical behavior would be to send the update to 2 or 3 servers. The actual number differs for various database and implementations and can be configured, but basically the idea is that the update should happen in more than one place to ensure that one server's disk getting totally toasted does not result in loss of data. But, since the update is not being sent to all the 10 servers, it happens much faster, and does not block if one of the servers is unreachable. The downside is that if a users who are connected to servers which did not get the update, will see old values.

The NoSQL database has various processes and algorithms to ensure that over the next few seconds or minutes, the updates spread through the server, and the database guarantees that sooner or later it will ensure that each and every server gets the update. This is called eventual consistency.

Architecturally, it is possible that one such database is your "primary" database, especially if you're a cloud-friendly web-2.0 company. But the implications of the lack of consistency guarantees need to be thought through because most users expect consistency and can get surprised by its absence. Another issue is that you typically would have some data for (especially financial data), in which "eventual consistency" is not good enough, and that data might still be stored in a traditional RDBMS. Hence it is pretty common for a NoSQL database to co-habit with a RDBMS, especially in larger companies.

In this context, it is especially important to read Charlie Cheever's answer to the question Why does Quora use MySQL as the data store instead of NoSQLs such as Cassandra, MongoDB, or CouchDB? Are they doing any JOINs over MySQL? Are there plans to switch to another DB?

Other NoSQLs:

Google's BigTable is a strange beast. It does provide ACID guarantees, but only on certain subsets of the data. Specifically, updates are guaranteed to be transactional only if they touch data within the same "entity group" - and some cross-group transactions are allowed) but you don't get ACID guarantees for updates that cross group boundaries. So, you get a little of this and a little of that, and the best of both worlds if you're able to map your problem nicely to the entity groups, and other constraints, but also frustration if at a later date your requirements change and you need transactions across entity groups.

There are a bunch of NoSQL databases which exist to get around the very rigid structure that an RDBMS imposes on each row of a table in the database.

MongoDB and CouchDB  allow JavaScript "documents" - basically almost any JavaScript data structure - to be stored in the database instead of a DBMS "row". Due to this, their query language also tends to be very different. They give a lot of flexibility in terms of the structure of data to be stored.

Graph databases like Neo4j are optimized to store "graph" data, e.g. which could be used for sophisticated analysis of social networks, for example.

A more detailed look at various NoSQL database is beyond the scope of this answer, but here is a list of names for the motivated reader to Google

Document Databases: LotusNotes, CouchDB, MongoDB, SimpleDB, TerraStore
Key-Value Stores: Dynamo, Cassandra, Voldemort, Tokyo, Riak, Hibari
Graph: AllegroGraph, VertexDB, Neo4j, FlockDB, Cerebrum
Column Database: BigTable, HBase, Bypertable, Mnesia
You should also look at Geo NoSQL, Object Databases, and FileSystems (since modern filesystems often have some transactional guarantees and can in some cases act as databases - an area worth exploring)



1. Most of the noSQL systems scale horizontally rather than vertically unlike RDBMS systems. 
2. Cost factor. Most of the open source noSQL systems are run on commodity hardware. If the team invests enough, there is almost no cost except commodity hardware.
3. Some RDBMS systems cannot scale the writes at all after a break point.
4. If you are geographically distributed website serving content and you need cross datacenter support, some of the RDBMS's never made systems for such requirements and if at all there are solutions available, they do cost.

What you loose over RDBMS?
Depends on your use case. Everyone needs to ask the question what is it that is pushing one towards noSQL systems. Is it cost? Is it scale? Don't fix things which are not broken.

So, most of the noSQL systems are distributed systems and brewers theorem(http://en.wikipedia.org/wiki/CAP...) guides pretty much where a particular solution falls. So, you need to choose 2 out of the 3.

Every noSQL solution has it's own feature list.

In short, 3 things mostly will guide your choice: Data models, CAP theorem, community.

As an architect, ask questions? Make a checklist of what are the pain points currently and what you want out of the system in near future?. Then, almost you would be rounding yourself off to 1 or 2 solutions and then pick the one which you like.



Jonathan Jaffe
For NoSQL the tradeoffs mainly is whether there's a masterless model versus one or more nodes that act as a master. 

Another tradeoff is whether the database is schemaless (such as CouchDB) versus a ragged table (such as Cassandra).

So I would look at replication, whether the database is optimized for read/write or read/write/update, and the structure of the database,



So first, I wouldnt say the tradeoff is simply b/w capabilities  and performance. Its b/w one set of capabilities and another. The most commonly given reason is actually scalability (on a single node - RDBMS performance is likely competitive and very often much better). Having said that - here are some tradeoffs that existing systems have made
1) Cassandra/HBase/Mongo etc. give up general purpose transactions that RDBMS's offer for scalability.
2) The MR implementation in Hadoop (that Hive/Pig inherit from) gives up pipelining between operators/processes (which offers better performance) for fault tolerance by writing to intermediate files
3) Cassandra (as compared to HBase or Mongo) gives the developer the option of giving up Consistency for Availability

BTW I personally find schemaless a surprising argument for moving out of RDBMS land. Thanks to a lot of the Extensibility work done in the 90's many RDBMS's (Postgres for e.g. as well as commercial systems) make it easy to change the structure of a table very quickly. You can even combine fixed fields (that will occur all the time) with dynamic fields. I guess these systems arent really Relational but more Object - Relational with flexible type systems. Also  some commercial db's also make it much easier to make schema changes dynamically in the true relational world (no stop the world when adding a column for e.g.). So the performance impact of schema change is minimal.

As for when to make the tradeoff look at http://highscalability.com/blog/.... I dont agree with all of it but its as good a place to start as any. I did have a better reference that I need to find. Will edit if I get it







***


Source: https://www.quora.com/Can-we-store-big-data-in-RDBMS-Why-and-why-not




Can we store big data in RDBMS? Why and why not?

8 Answers




Lavi Nigam
The answer to the question depends on whether your data is structured or unstructured. If it's unstructured the answers plain no. You cant store unstructured(text,music, documents, videos etc) data into rdbms. It dosnt matter if its small or big data. You have much better options in that scenario as mentioned in other answers. Nosql. 
But assuming you have structured data which is big (may or may not follow velocity,variety and volume paradigm). The answer whether that can be stored to RDBMS system is yes. But the catch here is the performance more than storage. How can you get the same level of performance?how can i query into large database with time efficiency?
The answer is: using columnar database (Column Oriented DataBases) Column-oriented DBMS implementation of sql. Where your data is stored in columns fashion and not in row-column fashion. Try AWS | Amazon Redshift - Cloud Data Warehouse Solutions. I have seen amazing benchmarks

You can perform million rows operation in seconds. They are not traditional rdbms systems but they gets the work done.
So, yes you can store big data in 'RDBMS' types systems. Where RDBMS type is replaced by column oriented implementation which gives you capability to store and analyze large data sets.





Gary
You can. 
There are many forms of big data,take example Social Network Site data,
Facebook and twitter 's primary store is MySQL.
You can shard your data and you can use MySQL as a key-value store,Maybe this way is not inefficiently,but it's practical.
Facebook has one of the largest MySQL database clusters in the world.
Under the hood: MySQL Pool Scanner (MPS)



You'll get much better answers Googling this. However let me give you an ELI5 version.

Big data primarily refers to data that has three characteristics - velocity, volume, variety.

Velocity- the data is produced at a very fast rate.

Variety - the data is not structured, you can't represent it in tables and columns. For example Web server logs, or the contents of a novel.

Volume - the size of the data is enormous. In the order of terabytes.

Now the traditional approach is that you copy your data from the storage device to your RAM and process it, this becomes impractical in the case of such large amounts of data.

So what you do is that you store the data in different systems and you copy the code into those systems and do the processing in parallel. ( That's what hadoop does)

The code would be only few megabytes and you can easily copy it.

And yes you wouldn't store big data (you can do it inefficiently) in rdbms because you cannot represent big data (unstructured) in rows and columns.

If I give you a thousand page novel and ask you to represent in an rdbms. How would you do it? What would be the primary keys?



Harut Martirosyan
BigData means not only vast amounts of data but also unstructured semantics, so fitting it into a strictly structured store which RDBMS represents is rather inconvenient.

Along with that, most of the benefits that RDBMSs bring are too slow when data is huge, even if you have lots of beefy hardware in your datacenter. In particular, joins and aggregations become really slow, having secondary indexes becomes too expensive, and in the end you need something which can store your data firmly and supports random lookups and fast scans through your data.

DISCLAIMER: I'm NOT affiliated with neither of products listed below.
If you need a SQL-like approach to your data have a look at Apache Cassandra, if you need more extended support for SQL consider using Apache Spark SQL, paired with Cassandra or with Parquet. If you need fast scans for deep analysis use HBase.



Harish
By referring to big data, if in case you meant big data technologies such as hadoop here is my take.

I want to clarify a fundamental difference in need for big data and traditional RDBMS.

Traditional RDBMS such as Oracle, MySQL, DB2 are predominantly used for transactional purposes. For ex: an order placed on ecommerce website. Though analytical database can be and are stored on RDBMS, there is an upper bound to its processing capacity and speed.

In contrast, big data technologies are primarily aimed at batch processing of terabytes and petabytes of data. To my knowledge, there aren't any big data technologies that could support transactional purposes effectively.  This is possible because of the distributed computing.

In a typical large scale enterprise, both traditional RDBMS and big data technologies are necessary for business needs. Neither can replace the other.



Yes, we can store up to the limit supported by the RDBMS. But several RDBMS strategies like querying, indexing etc. does not apply to traditional 'Big Data' so retrieving the data and making sense of it will be almost impossible. Also, the core of Big Data is semi-structured data which again defeats the core referential integrity, relational theory and rigid schema concepts of almost any modern day RDBMS.


David Badenchini
Yes.  I've done it. 

Created simple tables of key / value pairs that could be used to store any manner of unstructured data.
The problem is that they get pretty large fairly quickly so it's not very well suited to huge quantities of data.  Tables grow, indexes grow, things get slow.
The advantage is that you can structure it into table format using a view for easy consumption by reporting or importation into an operational system - if quantities remain small. 
It was also advantageous to my client because they did not have the systems, support or expertise to take on a big data project.



William Emmanuel Yu
If you trim all the fat, structure it properly and optimize the structure then it is possible. People having been doing this for quite sometime. 

But this has a costly trade-off of throwing away some data that might be useful later on.


****


Source: 2.  https://www.quora.com/How-do-I-find-a-good-format-for-big-data



How do I find a good format for big data?

2 Answers






Hi,

There are several data formats to choose from to load your data into the Hadoop Distributed File System (HDFS). Each of the data formats has its own strengths and weaknesses, and understanding the trade-offs will help you choose a data format that fits your system and goals.

We performed tests in Hortonworks, Cloudera, Altiscale, and Amazon EMR distributions of Hadoop.

For the writing tests we measured how long it took Hive to write a new table in the specified data format.

For the reading tests we used Hive and Impala to perform queries and record the execution time each of the queries.

We used snappy compression for most of the data formats, with the exception of Avro where we additionally used the deflate compression.

The queries ran to measure read speed, were in the form of:

SELECT COUNT(*) FROM TABLE WHERE

Query 1 includes no additional conditions.

Query 2 includes 5 conditions.

Query 3 includes 10 conditions.

Query 4 includes 20 conditions.

Thanks,

Priyanka,







You have to think of them as all.what is it? Information quality is the act of ensuring information is exact and usable for its planned reason. Much the same as ISO 9000 quality administration in assembling, information quality ought to be utilized at each progression of an information administration prepare. This begins from the minute information is gotten to, through different coordination focuses with other information, and even incorporates-http://bigdatahadooppro.com/

No comments:

Post a Comment