Cosmos DB in 5 Minutes

Recently I have had to dive relatively deep into Comos and based on the questions I have been getting, a blog seemed like a good opportunity to explain some stuff. As with everything I write in this blog it’s a stream of consciousness with references.

What is a JSON document and how does it relate to something I know?
Based on a standard JSON document the difference between a Relational database and a JSON Document

Document – Table Row
Properties – Table Column
Collections — Table
JSON is Denormalized — Tables are Normalized
JSON is Schemaless(ish) – Tables has enforced schema
JSOn has No RI – Tables have Referential Integrity

NoSQL JSON Store

I use the phrase JSON dumping ground, but that is not really accurate, it is a JSON document store, no other format of data is supported.  While it does support SQL-API, Table-API, Mogno, Cassandra  and Gremlin,  everything is stored in JSON.  This does not necessarily mean that the APIs are necessarily interoperable with each other.  For instance, if you wake up one day and want to use Graph on your standard SQL -API sotred JSON data you may have a problem if you your JSON is not already compliant with the GraphSON format.   That means your JSON has vertices, edges, and properties, this is how the graph relationships are formed.  In a case I am looking at right now, I will need to take the data that I do have and determine the common relationships and create a new Collection. 

What else is there?  Well, generically in the NOSql space, MongoDB, Couchbase, NeoJ4, Google Bigtable and few others, you can read the link to learn more…  

Stored Procedures, Triggers and UDFs are not what you think

My first question when I teach this section is, “What language do you write your stored procedures in? ”.  Not surprisingly T-SQL comes back because I typically draw a data audience.  This is a new paradigm.  Stored Procedures, Triggers and  UDFs are all written in Javascript there is no other option, which means that this is a very developer centric data platform, there is not much for the standard data pro to do in Cosmos, turn it on and go away.   Most importantly if you are looking for ACID transactions within Cosmos, this is done implicitly through stored procs.

On the other hand, stored procedures behave the way you think, they take parameters and execute but they can only execute are scoped to with in the collection and only with in the scope of the logical partition key they are executed against.  

There is NO Relational Join

While the SQL ANSI JOIN keyword is supported it is only designed to join a JSON doc to itself, not another JSON doc.  Nothing more I can say about that…. 

How to get my data into Cosmos

Getting data in is actually relatively easy.  You can write an app using the Azure Cosmos DB .Net SDK,  use Azure data factory, use the Data Migration Tool for Cosmos, Azure Logic Apps, (but don’t forget to manually add an ID to the Cosmos sink)

Partitioning

Partitioning is a religious and political argument rolled into one.  Unless you have made a career out of understanding how each database platform performs at scale with specific data you will need to test every variation based on the queries coming in especially with Cosmos. 

Cosmos Partitioning is not like SQL Server mostly because the stakes are different, getting this wrong will cause huge performance implications and cost that will rack up fast.  In some cases, it has been recommended to create multiple collections with the exact same data partitioned differently to not only solve performance problems but reduce cost.  Don’t assume a single collection can respond to all queries, you have to use the change feed to keep more than one collection in sync to satisfy queries.  Be careful though, you have now entered a world where there may no longer be one source of truth if things go wacky…

Logical Partitions are capped at a maximum of 10gb and based on the partition key you have chosen. If you reach 10gb it is your problem to solve, cosmos will throw errors to indicate the logical partition is full.

Physical partitions are a bit trickier, you have no control over these but in an odd way the larger the collection and the larger the logical partition, the more they can affect performance.   Many logical partitions can be inside one physical partition and all the RUs are allocated to that one physical partition.  Let’s imagine you have 2 logical partitions which will start off in one physical partition, and as both logical partitions grow and approach 5gb which means that the cumulative size of the physical will be approaching 10gb.  This is the point that Cosmos will split the physical partition into 2 physical partitions, one for each logical, when this happens it will also split the RUs.  Um, what?  If you have a 1000 RU collection, and 2 physical partitions, these RUs will be divided evenly among the physical partitions.  This means that you will have 500 RUs allocated to each physical, as you can imagine this could cause a performance problem if you have a hot partition. 

Developer Centric Database

There is no management studio, there is no profiler trace, in the portal you create the database, the collection and you can create the JavaScript stored procsm UDFs, and triggers, run a data queries and execute stored procs, and get a cursory of 30,000 foot view of metrics in the last hours or days but that is pretty much it.

All of the query metrics are on a request by request basis in the app provided by the SDK, so if the developer swallows these you will have no way of knowing what is going on.  That being said the developer has every opportunity to determine exactly how much impact and cost each request has on the system right down to the RUs per request.  The dev will know what a fan out query cost vs. a point lookup.  Perhaps this will promote better code in the long run…

RU’s

Everyone wants to discuss this because it is the least understood, and if you get it wrong the most impactful.  While it’s not really difficult the information to figure out what your document will cost is difficult to put together.

So what is it?  An RU is a measure of cost and performance to a 1k JSON document.  Sticking to the documentation I will follow their guidance;

Well that’s all great what does that mean for my document?  There is a document calculator that is still under an old URL, but it is for cosmos, I do happen to know that as of January this is supported by the product group and that they are looking to improve it, so it may move.

So you have an RU, now what?  Go to the Azure Pricing Calculator select Azure Cosmos DB, scroll down and look for RUs Reserved, notice that the number is in 100s of RUs, so if the number in the box is 4, that means 400.  I wanted to see what 30,000 would cost per month so I select 300 which cost about $1,700 per month. 

Now, what to do?  Run a POC for as long as you possibly can, the stakes are higher because I screwed up partition key is a collection recreation, not a 2 minute table update…

SLA

SLAs for azure are actually pretty cool, but you have to understand what they are delivering.

In short SLA are financially guaranteed availability. Whats mroe interesting to consider is the read/write SLA, reads are <10ms at 99 Percentile which means that 99% of the reads return in under 10ms, I have heard from some pretty connected people in the PG that the real number is closer to 5ms and the 50 Percentile number is 2ms.  To tier 1 SQL person, 10ms read is a crtitsit, but statistically in cosmos a 10ms read is an outlier, most are much faster.  But be cognizant of where you’re app, clients and database are actually residing from each other,  if your app is in East US and your Cosmos DB is in Australia you will have a bout a 200ms wait…

Global scale

This is so easy and boring its just not worth talking about, go click an Azure region on a map and you are scaled out… You will need to deal with soma pp changes, locations and some traffic manager, and decide on conflict resolution, but its super simple compared to trying to find space in a datacenter in another country on your own.

Consistency

There are 5 consistency models in cosmos that can be set in the portal or by the app on connection. Spend some time on each, the once you have created a CosmosDB account the portal has a great graphic under consistency to help you understand the differences between each.

Leave a Reply

Your email address will not be published. Required fields are marked *