Recently I have had to dive relatively deep into Comos and based on the questions I have been getting, a blog seemed like a good opportunity to explain some stuff. As with everything I write in this blog it’s a stream of consciousness with references.
What is a JSON document and how does it relate to something I know?
Based on a standard JSON document the difference between a Relational database and a JSON Document
Document – Table Row
Properties – Table Column
Collections — Table
JSON is Denormalized — Tables are Normalized
JSON is Schemaless(ish) – Tables has enforced schema
JSOn has No RI – Tables have Referential Integrity
NoSQL JSON Store
I use the phrase JSON dumping ground, but that is not really accurate, it is a JSON document store, no other format of data is supported. While it does support SQL-API, Table-API, Mogno, Cassandra and Gremlin, everything is stored in JSON. This does not necessarily mean that the APIs are necessarily interoperable with each other. For instance, if you wake up one day and want to use Graph on your standard SQL -API sotred JSON data you may have a problem if you your JSON is not already compliant with the GraphSON format. That means your JSON has vertices, edges, and properties, this is how the graph relationships are formed. In a case I am looking at right now, I will need to take the data that I do have and determine the common relationships and create a new Collection.
What else is there? Well, generically in the NOSql space, MongoDB, Couchbase, NeoJ4, Google Bigtable and few others, you can read the link to learn more…
Stored Procedures, Triggers and UDFs are not what you think
On the other hand, stored procedures behave the way you think, they take parameters and execute but they can only execute are scoped to with in the collection and only with in the scope of the logical partition key they are executed against.
There is NO Relational Join
While the SQL ANSI JOIN keyword is supported it is only designed to join a JSON doc to itself, not another JSON doc. Nothing more I can say about that….
How to get my data into Cosmos
Getting data in is actually relatively easy. You can write an app using the Azure Cosmos DB .Net SDK, use Azure data factory, use the Data Migration Tool for Cosmos, Azure Logic Apps, (but don’t forget to manually add an ID to the Cosmos sink)
Partitioning is a religious and political argument rolled into one. Unless you have made a career out of understanding how each database platform performs at scale with specific data you will need to test every variation based on the queries coming in especially with Cosmos.
Cosmos Partitioning is not like SQL Server mostly because the stakes are different, getting this wrong will cause huge performance implications and cost that will rack up fast. In some cases, it has been recommended to create multiple collections with the exact same data partitioned differently to not only solve performance problems but reduce cost. Don’t assume a single collection can respond to all queries, you have to use the change feed to keep more than one collection in sync to satisfy queries. Be careful though, you have now entered a world where there may no longer be one source of truth if things go wacky…
Logical Partitions are capped at a maximum of 10gb and based on the partition key you have chosen. If you reach 10gb it is your problem to solve, cosmos will throw errors to indicate the logical partition is full.
Physical partitions are a bit trickier, you have no control over these but in an odd way the larger the collection and the larger the logical partition, the more they can affect performance. Many logical partitions can be inside one physical partition and all the RUs are allocated to that one physical partition. Let’s imagine you have 2 logical partitions which will start off in one physical partition, and as both logical partitions grow and approach 5gb which means that the cumulative size of the physical will be approaching 10gb. This is the point that Cosmos will split the physical partition into 2 physical partitions, one for each logical, when this happens it will also split the RUs. Um, what? If you have a 1000 RU collection, and 2 physical partitions, these RUs will be divided evenly among the physical partitions. This means that you will have 500 RUs allocated to each physical, as you can imagine this could cause a performance problem if you have a hot partition.
Developer Centric Database
All of the query metrics are on a request by request basis in the app provided by the SDK, so if the developer swallows these you will have no way of knowing what is going on. That being said the developer has every opportunity to determine exactly how much impact and cost each request has on the system right down to the RUs per request. The dev will know what a fan out query cost vs. a point lookup. Perhaps this will promote better code in the long run…
Everyone wants to discuss this because it is the least understood, and if you get it wrong the most impactful. While it’s not really difficult the information to figure out what your document will cost is difficult to put together.
So what is it? An RU is a measure of cost and performance to a 1k JSON document. Sticking to the documentation I will follow their guidance;
Well that’s all great what does that mean for my document? There is a document calculator that is still under an old URL, but it is for cosmos, I do happen to know that as of January this is supported by the product group and that they are looking to improve it, so it may move.
So you have an RU, now what? Go to the Azure Pricing Calculator select Azure Cosmos DB, scroll down and look for RUs Reserved, notice that the number is in 100s of RUs, so if the number in the box is 4, that means 400. I wanted to see what 30,000 would cost per month so I select 300 which cost about $1,700 per month.
Now, what to do? Run a POC for as long as you possibly can, the stakes are higher because I screwed up partition key is a collection recreation, not a 2 minute table update…
SLAs for azure are actually pretty cool, but you have to understand what they are delivering.
In short SLA are financially guaranteed availability. Whats mroe interesting to consider is the read/write SLA, reads are <10ms at 99 Percentile which means that 99% of the reads return in under 10ms, I have heard from some pretty connected people in the PG that the real number is closer to 5ms and the 50 Percentile number is 2ms. To tier 1 SQL person, 10ms read is a crtitsit, but statistically in cosmos a 10ms read is an outlier, most are much faster. But be cognizant of where you’re app, clients and database are actually residing from each other, if your app is in East US and your Cosmos DB is in Australia you will have a bout a 200ms wait…
This is so easy and boring its just not worth talking about, go click an Azure region on a map and you are scaled out… You will need to deal with soma pp changes, locations and some traffic manager, and decide on conflict resolution, but its super simple compared to trying to find space in a datacenter in another country on your own.
There are 5 consistency models in cosmos that can be set in the portal or by the app on connection. Spend some time on each, the once you have created a CosmosDB account the portal has a great graphic under consistency to help you understand the differences between each.