Schema Design

by Kevin Hanson, Solutions Architect, 10gen

Parallels

Tables == Collections Row == Document Column == Field Index == Index Join == Embedding & Linking Schema Object == None

The Big Question

Do we link or do we embed?

Blog posts and comments

Embedded

  • Faster
  • But large embeds can make the master document slow. Ex: If a post has a billion comments

Linked

  • Slower
  • Returning the master document requires extra logic

Each comment gets own doc

Comment gets its own copy of the master blog post

  • Fast but inverted
  • Great if you have gajillions of comments
  • Even more logic

Denormalization

  • Caching via memchached, redis, etc are functionally denormalized instances of data sets.
  • NoSQL means you cut out the middleman

More thoughts on denormalized data

  • Faster than normalized
  • More object-oriented
  • application level applications

Managing Arrays

  • Pussing to an array infinitely

    • Document will grow larger than Pre-allocated size
    • Document may increase max doc size of 16MB

Sometimes you have to limit size of an array

Logic idea:

first 200 comments are insert into the blog document
After that have a linked comment document

Schema decisions when sharding

  • Can we intelligently partition data?
  • Will this partitioning create hotspots?
  • Can our partitioning actually improve overall performance?

Bad shard key:

Sharding on "date" field and constantly inserting most recent data...

Good example:

sharding blog posts on "author"

Note

TODO find out why the Good example is actually good