Metadata

Metadata, courtesy of http://www.flickr.com/photos/sjcockell/6126442977

Metadata is data that describes and gives information about other data. A classic example is the information about a digital image file such as the ISO settings, resolution, pixel depth, exposure, camera settings, camera type and so forth.

The Metadata Schema is to handle the situation where the information we wish to store in a document has a varying number of fields but we still want to retain the ability to search efficiently on any field in the metadata. Let’s use the example of a digital image file.

Model

Let’s look at sample metadata for an image.

Field Value
File name img_1771.jpg
File size 32764 Bytes
MIME type image/jpeg
Image size 480 x 360
Camera make Canon
Camera model Canon PowerShot S40
Image timestamp 2003-12-14 12-01-44
Image number 117-1771
Exposure time 1500 s
Aperture F4.9
Exposure bias 0 EV
Flash No, auto

The first naive approach to modeling this might be to just translate the table directly into a corresponding document.

{
  "File name"       : "img_1771.jpg",
  "File size"       : "32764",
  "MIME type"       : "image/jpeg",
  "Image size"      : {"width": 480, "height": 360},
  "Camera make"     : "Canon",
  "Camera model"    : "Canon PowerShot S40",
  "Image timestamp" : ISODate("2014-01-01T10:01:00Z"),
  "Image number"    : "117-1771",
  "Exposure time"   : "1/500 s",
  "Aperture"        : "F4.9",
  "Exposure bias"   : "0 EV",
  "Flash"           : "No, auto"
}

One of the shortcomings of this model is the need for multiple indexes for multiple fields, increasing the time it takes to write documents and possibly hitting the max number of indexes available on a single collection. For each index you add to a collection it takes longer to insert new documents as all indexes have to potentially be updated.

Another possibility is to leverage the fact that one can index arrays of objects easily. Let’s modify the schema above to leverage this.

{
  "metadata": [
    {"key": "File Name", "value": "img_1771.jpg"},
    {"key": "File size", "value": 32764},
    {"key": "MIME type", "value": "image/jpeg"},
    {"key": "Image size", "value": {"width": 480, "height": 360}},
    {"key": "Camera make", "value": "Canon"},
    {"key": "Camera model", "value": "Canon PowerShot S40"},
    {"key": "Image timestamp", "value": ISODate("2014-01-01T10:01:00Z")},
    {"key": "Image number", "value": "117-1771"},
    {"key": "Exposure time", "value": "1/500 s"},
    {"key": "Aperture", "value": "F4.9"},
    {"key": "Exposure bias", "value": "0 EV"},
    {"key": "Flash", "value": "No, auto"}
  ]
}

Next lets add a compound index on the metadata array.

var col = db.getSisterDB("supershot").images;
db.images.ensureIndex({"metadata.key": 1, "metadata.value": 1});

So how do we actually query the data to efficiently use the index. Let’s take a look at two query operators called $all and $elemMatch and how we can leverage them.

$all

The $all operator is defined as selecting all the documents where the value of a field is an array that contains all the specified elements.

var col = db.getSisterDB("supershot").images;
col.findOne({tags: {$all: [ "appliance", "school", "book" ]}});

$elemMatch

The $elemMatch operator is defined as matching more than one component within an array element.

var col = db.getSisterDB("supershot").images;
col.findOne({metadata: {$elemMatch: {key: "File Name", value: "img_1771.jpg"}}});

The $elemMatch operator looks like the obvious first choice. However the problem is that our metadata array is defined as objects that all have key and value. If you attempt to enter multiple matches using key and value in the $elemMatch only the last pair will be used.

If you wish to locate a photo that has MIME type equal to image/jpeg and also Flash equal to No, auto we need to combine $all and $elemMatch.

Let’s take a look at how to correctly pick the document.

var col = db.getSisterDB("supershot").images;
col.find({ metadata: { $all: [
            { "$elemMatch" : { key : "MIME type", value: "image/jpeg" } },
            { "$elemMatch" : { key: "Flash", value: "No, auto" } }
          ]}
       }).toArray();

The first $elemMatch will use the index to locate all the documents with the MIME type equal to image/jpeg and then filter on the Flash key.

Indexes

One of the thing to remember when it comes to indexes is that the more indexes are on a collection the more BTree’s need to be updated each time a document is inserted or updated causing additional overhead and IO. In this case we replace 12+ indexes with a single compound index saving both space as well as increasing insert and update performance.