Merce

NoSQL databases

It ain't no SQL

  • Our enterprise software teams often face resistance, sometimes rejection, whenever they propose the use of any database other than well-known relational databases. We suspect it is the job of enterprise IT managers to be conservative in the interests of risk aversion (they call it stability). This has not stopped us sailing into more interesting waters. In most cases, this has been smooth sailing, something which our more risk-averse customers find disconcerting.

  • A hash of things Our earliest contact with noSQL databases were in our initial years with Unix, when the buzzword "noSQL" had not been coined. We learned the power of the Unix dbm files, and then learned about the GNU gdbm equivalent on Linux. In parallel, we tasted the power of in-memory hash-maps as native data types for the first time with Perl. We realised that large datasets could be trivially extended to disk storage using the Perl tie facility. We switched from gdbm to Berkeley DB files and gained a greater variety of ISAM file types. This became an essential part of our Perl Swiss Army knife.

  • Why NoSQL The power of persistent hash tables (e.g. gdbm files) or ISAM files comes from the fact that they are less feature-rich than an RDB. There is no schema other than that which the application imposes. There is no interpretation or translation of data types, no constraints or referential integrity checks. There are no complex queries or query optimisers. This makes basic inserts, replacements, and lookups so much faster than a relational database on the same hardware that inexperienced programmers (and their equally inexperienced managers) simply cannot grok it. One example of this speed was the incident cited in our distributed applications page, where another team had benchmarked a task at 12 months but we could finish it in about three weeks. We used the power of distributed processing there, but even with a single computer the speed of ISAM files was about four times as high as a relational database. In some other use-cases, speedups can be 10 times or more.

  • Modern noSQL For a long time, the only noSQL databases we encountered were object databases and ISAM files. Since 2010, we have begun to work with Lucene and Solr, and have deployed Solr-based solutions in production environments. We have also begun to explore CouchDB and MongoDB. These are most interesting options.

    One of the most important positive features of the RDB is its hard guarantee of serialisability of transactions. Most software designers have taken these properties for granted and have designed around relational databases without evaluating whether some of these powerful properties are needed. For instance, we use an LDAP server as a database, without getting any guarantee of serialisability of updates. A database of usernames and passwords rarely needs these guarantees. Once the designer is willing to let go of these properties, noSQL databases open up a new world of possibilities. CouchDB makes distributed databases so easy that some problems which would have needed enormous quantities of code on top of relational databases now become trivially simple. Similarly, the power of Solr with its full-text search and dynamic fields in records is a culture shock for architects brought up on the relational school of thought. A Solr database can have a fixed set of fields and a potentially unlimited set of optional fields whose field names can be defined on the fly, record by record. No need to alter table add column. Each record can have a different set of fields. Many noSQL databases can do similar tricks.

    It is hard to put labels on some of these systems. One of the most popular data stores to challenge conventional thinking about databases is memcached. This system maintains a key-value pair dataset on the networks and allows accesses over the network to get and set key-value pairs. That's the limit of its data-manipulation feature set. Any SQL database with a two-column table and a network interface has been offering these features for the last four decades. However, memcached has taken up such a powerful place in scalable Web applications that it has defined its own solution category. This underscores the increased refusal to use a hammer and treat all problems as nails.

  • NoSQL databases are the most exciting thing to have happened to data management since Mr E F Codd's ideas. These are exciting times for software professionals and architects to design systems in, provided you are not strait-jacketed by enterprise-class risk aversion.

RELATED READING

  • dbm: Wikipedia

    Wikipedia page on dbm and descendants, including Berkeley DB

  • Lucene and Solr

    Open source document database with super-fast full-text indexing and search

  • CouchDB and MongoDB

    Two popular distributed object database systems

  • memcached

    Open source in-memory cache of key-value pairs for small chunks of arbitrary data