Metadata
Metadata is data that describes and gives information about other data. A classic example is the information about a digital image file such as the ISO settings, resolution, pixel depth, exposure, camera settings, camera type and so forth.
The Metadata Schema is to handle the situation where the information we wish to store in a document has a varying number of fields but we still want to retain the ability to search efficiently on any field in the metadata. Let’s use the example of a digital image file.
Model
Let’s look at sample metadata for an image.
Field | Value |
---|---|
File name | img_1771.jpg |
File size | 32764 Bytes |
MIME type | image/jpeg |
Image size | 480 x 360 |
Camera make | Canon |
Camera model | Canon PowerShot S40 |
Image timestamp | 2003-12-14 12-01-44 |
Image number | 117-1771 |
Exposure time | 1⁄500 s |
Aperture | F4.9 |
Exposure bias | 0 EV |
Flash | No, auto |
The first naive approach to modeling this might be to just translate the table directly into a corresponding document.
{
"File name" : "img_1771.jpg",
"File size" : "32764",
"MIME type" : "image/jpeg",
"Image size" : {"width": 480, "height": 360},
"Camera make" : "Canon",
"Camera model" : "Canon PowerShot S40",
"Image timestamp" : ISODate("2014-01-01T10:01:00Z"),
"Image number" : "117-1771",
"Exposure time" : "1/500 s",
"Aperture" : "F4.9",
"Exposure bias" : "0 EV",
"Flash" : "No, auto"
}
One of the shortcomings of this model is the need for multiple indexes for multiple fields, increasing the time it takes to write documents and possibly hitting the max number of indexes available on a single collection. For each index you add to a collection it takes longer to insert new documents as all indexes have to potentially be updated.
Another possibility is to leverage the fact that one can index arrays of objects easily. Let’s modify the schema above to leverage this.
{
"metadata": [
{"key": "File Name", "value": "img_1771.jpg"},
{"key": "File size", "value": 32764},
{"key": "MIME type", "value": "image/jpeg"},
{"key": "Image size", "value": {"width": 480, "height": 360}},
{"key": "Camera make", "value": "Canon"},
{"key": "Camera model", "value": "Canon PowerShot S40"},
{"key": "Image timestamp", "value": ISODate("2014-01-01T10:01:00Z")},
{"key": "Image number", "value": "117-1771"},
{"key": "Exposure time", "value": "1/500 s"},
{"key": "Aperture", "value": "F4.9"},
{"key": "Exposure bias", "value": "0 EV"},
{"key": "Flash", "value": "No, auto"}
]
}
Next lets add a compound index on the metadata array.
var col = db.getSisterDB("supershot").images;
db.images.ensureIndex({"metadata.key": 1, "metadata.value": 1});
So how do we actually query the data to efficiently use the index. Let’s take a look at two query operators called $all and $elemMatch and how we can leverage them.
$all
The $all operator is defined as selecting all the documents where the value of a field is an array that contains all the specified elements.
var col = db.getSisterDB("supershot").images;
col.findOne({tags: {$all: [ "appliance", "school", "book" ]}});
$elemMatch
The $elemMatch operator is defined as matching more than one component within an array element.
var col = db.getSisterDB("supershot").images;
col.findOne({metadata: {$elemMatch: {key: "File Name", value: "img_1771.jpg"}}});
The $elemMatch operator looks like the obvious first choice. However the problem is that our metadata array is defined as objects that all have key and value. If you attempt to enter multiple matches using key and value in the $elemMatch only the last pair will be used.
If you wish to locate a photo that has MIME type equal to image/jpeg and also Flash equal to No, auto we need to combine $all and $elemMatch.
Let’s take a look at how to correctly pick the document.
var col = db.getSisterDB("supershot").images;
col.find({ metadata: { $all: [
{ "$elemMatch" : { key : "MIME type", value: "image/jpeg" } },
{ "$elemMatch" : { key: "Flash", value: "No, auto" } }
]}
}).toArray();
The first $elemMatch will use the index to locate all the documents with the MIME type equal to image/jpeg and then filter on the Flash key.
Indexes
One of the thing to remember when it comes to indexes is that the more indexes are on a collection the more BTree’s need to be updated each time a document is inserted or updated causing additional overhead and IO. In this case we replace 12+ indexes with a single compound index saving both space as well as increasing insert and update performance.