php[architect] Home

Want to check out an issue? Sign up to receive a special offer.

The NoSQL Hype Curve is Bending

Posted by on December 29, 2010

The technology hype of 2010 was clearly NoSQL, which proved to be more of a brand-name than a technical term.

Today in his tech blog, Bozho set out his view that NoSQL is probably not a good choice for startups that don’t know yet where their database and application bottlenecks are:

But an important downside of NoSQL solutions, which is mentioned by most sources (twitter, facebook, rackspace) is that in NoSQL (at least for Cassandra and HBase) you must know what will be the questions that you will be asking upfront. You can’t just query anything you like. … And I can bet that a startup does not yet know all the questions it is about to ask its data store.

I wrote a similar conclusion in a feature article in September’s php|architect. Relational data modeling is driven by data, and there are mathematical rules of normalization that guide this process. Whereas nonrelational data modeling has no formal rules. It’s driven by the queries you need to support. Either you define your schema up front, or you define your usage of data up front, or else you set yourself up for a lot of sub-optimal queries and laborious database refactoring. You find yourself writing lots of code to reinvent the wheels that SQL gives you for free, and before long you’ve unwittingly reinvented the relational database.

There were also a few high-profile walk-backs and failures associated with NoSQL adoption in 2010. The most dramatic was the implosion of Digg after it launched a ground-up rewrite, architected around Cassandra. It turns out that no secret sauce can compensate for bad business decisions.

And who can forget the instantly-classic viral video MongoDB is Web Scale?
A little knowledge is a dangerous thing — the more little, the more dangerous.

In spite of this, I’m sure in 2011 we’ll hear some new claims of a panacea that puts a “turbo button” on your web server, supposedly obsoleting quaint, old-fashioned habits like thoughtful architecture, design, testing, and monitoring. I wonder what the miracle technology will be this time?



 

Responses and Pingbacks

Bill, thanks for writing this. Basing a technology decision on hype is generally a bad idea. However, within the past year I’ve started using CouchDB and have a very different perspective on the issue than I did a year ago.

First of all, can everyone please stop using the term “NoSQL”? It is a meaningless word and tells you absolutely nothing about the various properties of any particular databases technology. I know why you used it here in this blog post but I think the conversation going forward would be more meaningful if everyone stopped using this word. Other words such as “document-oriented database” are more meaningful and useful.

Second, Bozho said, “But an important downside of NoSQL solutions … is that in NoSQL … you must know what will be the questions that you will be asking upfront… And I can bet that a startup does not yet know all the questions it is about to ask its data store.” While technically a true statement, this is also a very misleading statement. Yes, ad hoc queries tend to not be supported but Bozho’s statement implies that a startup must know what all of there queries will ever be before they first deploy their database. This is not true. You can bring new queries online after your database is deployed if you realize that you have new questions to ask of your data store. Again, my experience is with CouchDB but my understanding is that other document-oriented databases have similar abilities.

Third (and this is more of a comment on Bozho’s statements again), performance is not the only reason to choose a non-relational database. I agree with Knuth that “premature optimization is the root of all evil.” For many problems (especially web problems) a non-relational database is a much better fit. For example, a blogging application or a CMS may be better off with a document-oriented database than a relational database. Trying to cram such applications into a relational-database often results in developer’s using anti-patterns such as EAV (hat tip to your SQL Antipatterns book). If your data is better represented as documents, then use a document-oriented database.

 

I can’t be more agree. That’s why I have actually no application using NoSQL yet. I still have so many question in mind.

So I would be interested by best practises and used cases resources.

 

Hi Bradley, it’s good to hear from you. I don’t mean to say that non-relational database aren’t innovative, useful in the right scenarios, and generally nifty. Of course they are. But as a group these technologies certainly got over-hyped during 2010 to such a degree that no technology could match the expectations that were encouraged by the marketing.

 

Hi Nicolas, the way I see it, the best practice for non-relational databases is the same as the best practice for denormalization in an SQL database. Decide which queries you want to optimize for, and design your database with that in mind.

To use Bradley’s example of a document-oriented database for a blog, would you rather define a document as a logical blog post, with embedded fields for author, tags, date, etc. Or would you instead define the document as an author, with embedded blog posts written by that author? Or does a document correspond to a date, with one or more blog posts within? The answer is that this choice depends on how you’re going to query the data most often. You might even have a need for multiple types of queries, in which case you might resort to duplicate storage or storing lists of document references.

My point is that this process of designing to fit your queries instead of designing to fit your data is characteristic of all non-relational solutions, including denormalizing a relational database.

And none of this analysis happens by magic, as the hype claimed. You still have to do some work to know your data and your queries up front.

 

I think you are stretching it a bit with your Digg analogy. You’ve wrapped some careful wording around it, but you are essentially implying that somehow the move to Cassandra and Digg’s fall from favor are related. That’s misleading. I actually agree with your point: there are use cases for relational databases and use cases for non-relational databases. It’s dangerous to try to solve the wrong problem with either approach. But thoughtful design, testing, and monitoring are universally applicable across any technology selections.

 

I agree with Brad, NoSQL is much better for some applications; especially CMS.

I built an online community website that uses NoSQL (actually XML LOBs stored in RDBMS) for 10+ years and it is running great. Initially I was paranoid because I am leaving my RDBMS behind then it made sense; everything is much easier.

The down side is as this article talked about; you need to know what you are doing; think of the data model through and know the limitation;

Play with it first; I suggest starting with BigTable on Google’s AppEngine as place to start. Flex your brains; abandoning SQL and RDMS is a big change so it won’t be easy; and do not let this article scare you; it worked for Google & it might work for you!!!

 

This analysis is not entirely correct. Much of your critique is unique to Cassandra, which does have the property you describe. In Cassandra, you are essentially designing your indexes from the beginning, but this is less true with other datastores lumped under the NoSQL banner.

I’m sure we’ve all seen plenty of relational databases that could have benefitted from someone thinking through the model and usage.

Your analysis of how you would model a blog post and the impact on queries applies just as much to laying out a relational model. You don’t need to know your queries up front with CouchDB or Riak any more than you would with MySQL. Scale will inevitably break your 3rd normal relational data model anyway. At scale, SQL starts to look a suspiciously like NoSQL. (see also: http://gigaom.com/cloud/nosql-is-for-the-birds/)

Blaming what happened at Digg on Cassandra is nonsensical. That was caused by a host of bad business decisions, not the least of which was to release by a set date without regard to scope or what your engineers are telling you. There have certainly been failures of that nature on relational datastores.

Fanbois aside, I agree some of the irrational hype is subsiding, but that’s the nature of hype cycles. The proliferation of ideas and approaches is good, but TANSTAAFL. Rushing into something without thought or understanding seldom solves more problems than it causes.

 

Regarding document oriented databases, we need to implement a best strategy for utilizing them efficiently in our project. The problem is that designing our application to work with document database will be a sever headache. We need to code for everything that a relational database can implicitly perform while querying. It should be mentioned that many successful applications like foursquare and bit.ly make use of them in production.

 

Bill I want to comment on your post above about choosing the right document for the queries you want.

Now I work with MongoDB and can really only speak to its features, but when designing a document for a “document-oriented database” instead of designing the document around the queries you want to use instead I’ve found it better to design the documents around your application, assuming your building your application with an object-oriented approach. Keeping with the blog concept if I were to build it, I would have an Author object which contains Post objects. In MongoDB, as I assume other “document-oriented databases”, you can have embedded objects, makes sense for the application as well, but like you said the queries then become difficult, what about all posts from a certain date? With MongoDB you are able to index based on arrays and sub-documents of documents, that way you are able to still go through and pull out Posts which match a certain date or date range and it is indexed. The way this hinders your application with Mongo is that it would slow down your inserts, because of the index, but your queries will run fairly quick. Back to the inserts, depending on how you build your application you might not even notice a performance difference with the indexed inserts because when you perform an insert with MongoDB you usually do not wait for a response(but you can), you just pass the insert to the database and it gets throw on the operations list and you keep on moving and when it gets to it it gets to it.

Now yes with the proposed structure above it would get obtrusive to have all of an authors posts grouped with their object and it would probably end up breaking Mongo’s 4mb per document rule. A much better design would be a collection containing all blog posts and they would each have a reference to their author documents. Or author documents which have references to their blog posts, either or depending on your application.

With a “document-oriented database” you do not design to your queries you design to your application.

Although I disagree with your feelings on “document-oriented database” design and queries. I do have to agree on the hype of “document-oriented databases” It is not they have issues and do not work or that they are not made for the job or better or worse than relational databases but when choosing one be careful, weigh all of your options, but most importantly find out how it will benefit your application and how it will hinder it.

If anyone is interested in “document-oriented databases” then check on MongoDB’s website they have an interactive shell with a tutorial so you can play with it without having to download and install it.
http://www.mongodb.org

 

Psst, your agenda is showing. Perhaps you should remove your bio from the end of this post.

“There were also a few high-profile walk-backs and failures associated with NoSQL adoption in 2010.”

Why do you mention Digg here and then admit the real problem was “bad business decisions”? Why not mention one that was actually due to NoSQL? While you’re at it, to be fair and balanced, you should probably mention failures in 2010 with relational db’s as well.

“It’s driven by the queries you need to support. Either you define your schema up front, or you define your usage of data up front,…”

Not true. This applies only to certain NoSQL db’s. All db types have their shortcomings somewhere. It’s not accurate to take a drawback (actually, it’s a tradeoff for crazy speed) of a few and stamp them all with it. The fact that you even present it this way makes me question how much you really know about NoSQL.

“…or else you set yourself up for a lot of sub-optimal queries and laborious database refactoring.”

And relational db’s are immune from this? I think not.

“Whereas nonrelational data modeling has no formal rules.”

If it’s formal rules you want, should you be writing articles for a site whose language of focus has a type system that is dynamic and weak?

 

[…] NoSQL Hype Curve is Bending The hype of 2010 was clearly NoSQL…. [full post] Bill Karwin php|architect – The site for PHP professionals opiniondatabasenosql […]

 

Thanks for all the sincere comments. All I can say is some of you have misjudged my message. I don’t have any objection to NoSQL technology — I have an objection to NoSQL marketing. (I respect Brad’s comment that “NoSQL technology” is kind of meaningless, but NoSQL *is* now a marketing term.)

And I reiterate that one cannot claim that “non-relational is better for CMS data.” Non-relational can be better for some queries, but not for data itself.

 

Interesting back and forth… I don’t have an ax to grind one way or the other. I plan to use both SQL and NoSQL technologies going forward. I doubt very seriously that most startups accurately get their sweet spot (where they are making the most money) right the very first time, that’s why the word ‘pivot’ is so heavily used around start-ups.

Regardless of the schema, methodology, databases, SQL product (MariaDB, MySQL, PostgreSQL, Oracle, DB2, ad nauseum), NoSQL product, CMS (over 500 of those…) etc…

Eventually someone is going to realize their ‘system’ is not giving the user the experience that they designed it to give and want, therefore will ultimately design ‘views’ that optimize the speed at which information is presented to the end user. Probably throw in an S-Load of caching, routers, additional servers, additional data storages (slaves | sharding) and more.

And good luck trying to get a smart business owner to pay for an excessive amount of up front “what-ifing” especially if a startup.

As to why some websites fail, IMO, its because end user bandwidth is throttled, restricted, reduced, limited so severely, especially by 100% of Cable Internet providers, that the real problem is the lack of upstream bandwidth.

You put too much front-end, CSS, JavaScript code, cookies, flash, etc stuff on a page (or all your websites pages) with upstream bandwidth throttled to less than 20Kb and you are going to have problems…or rather your users are going to have problems leaving you wondering why they are not returning.

Ask yourself if your testing group (you if you work alone) actually test the user experience with a throttled bandwidth of less than 20Kb. Personally I am not aware of this type of testing getting funded, but sincerely hope that it is.

In my personal experience with Digg, stumbleupon, mashable, Yahoo, and a few others…the problem has always been my upstream bandwidth. (DD-WRT, OpenWRT and Tomato enabled hardware (firewall-routers) lets you see your actual bandwidth in real time 24 X 7, a great eyeopener, heck you turn your $30 -$200 firewall/router into a commercial strength solution costing up to $3000 for FREE… but that is another story)

We might pay $60 – $100 per month for 20Mb/2Mb, but I NEVER see more than 300Kb upstream except during the SpeedTest (what a farce they are). I rarely see even 300Kb upstream. In fact 98% of the time I see less than 110Kb upstream; less than 30Kb 80% of the time. You would be shocked to know how often your cable provider restricts your upstream bandwidth to less than 20Kbs…its awful.

And since they lobby at the rate of $1.5 Million per week to keep their oligopoly, its unlikely to change anytime in the future. Probably be cheaper for them just to roll out the Fiber, but that might mean lowering their monthly prices…

Ask yourself if your frontend scripting will work decently with only 20Kb upstream of bandwidth? I already know the answer. The new twitter literally slowed to a crawl after their most recent upgrade, and while there has been a little improvement, its still not as fast as the old twitter…not really sure why, just know its been my experience. End result, I use it much less frequently than I use to. I stopped going to Digg and stumbleupon for the same reason and avoid washable unless I am really interested in a topic. My guess this is what was Digg’s real problem, though like Herman Cain, I do not have any facts to back up my assertion.

My ultimate solution is to move to one of the less than 30 communities offering bi-directional synchronous Fiber To The Home (FTTH) (http://is.gd/HCi80q) in the USA. There are no other options, sadly. Where you can get 10Mb/10Mb for less than what most of us pay our Cable providers and that entire 10Mb upstream will be available to you 24 X 7…thus the usage of both bidirectional and synchronous. I seriously doubt that I would ever exceed that 10Mb upstream, however if I should, those FTTH providers will let my buy more, say 15Mb/15Mb; 30Mb/30Mb; 60Mb/60Mb; 100Mb/100Mb; 1Gb/1Gb. One thing is for sure, the Cable Internet providers will never be able to do that even if they wanted too. Which they do not. Note even FIOS (20Mb/5Mb) while using Fiber, throttles their upstream bandwidth to something less than 5Mb, because its not FTTH!

Ask yourself if a provider restricts your bandwidth to below the FCC definition of 768Kb for broadband…even for a second…should they be allowed to call their service “Broadband”? I think not.

But I digress, lets get back on topic…

After I move, I look forward to enjoying what most of you are enjoying, websites with heavily scripted front ends, until than I know there is always another source for anything that I am looking for online and I will just move along. No amount of backend caching memory, hardware or algorithms is going to fix this…just a move to a FTTH community. LOL, Hey Cable providers, you can fix stupid. I know I will not purchase another house without bi-directional synchronous FTTH…its no longer enough just to be a Fiber network is the point.

I honestly think the conversation between SQL vs NoSQL is highly overrated, they both offer tools to intuitive developers and just like an orchestra, will always work well together…still an interesting back and forth and a good read thank you for posting.

 

Bill, lol, The URL is http://is.gd/HCi80q and there might be an extra semicolon, hopefully people will just copy/paste the URL into their browsers….I don’t make any money on this, just trying to help others as I started looking for a solution for me.

Funny, I know I centered the website and the URL in is.gd to the center of the USA, but it opened for me in the middle of the ocean, probably a great analogy for US Cable providers and where they should be sent to fish for customers until they fix their junk…too funny. Even Japan had 100Mb/100Mb in 2000, its 2011 and soon to be 2012…Kansas City is sooooo fortunate as will be 4 other cities in the next year or two thanks to Google.

 

At least with Cassandra, if you need to ask different questions of your data, you can add additional column families and then backpopulate those from original data using hadoop. Also it has some rudimentary secondary indexes now.

 

Leave a comment

Use the form below to leave a comment: