Software Engineer, Taj Singh, talks about the ease of reduplication of data when using Cheddar with MTurk.

Solving a head-scratcher with Cheddar and Amazon MTurk

Tajinder Singh -  12 Jan, 2015

Property Directory is an online database of accommodation providers that we built pretty much by hand over a considerable period of time. It includes hundreds of thousands of hotels, motels, B&B’s, serviced apartments and other forms of rental accommodation, right around the globe.

But when it comes to providing a great meta search engine for accommodation Property Directory is only part of the story. Think about it: a user submits their search criteria, which you then need to submit to an array of hotel booking services, then collate the results and present them back to the user.

But if booking service X and booking service Y both return results for, say, the Green Dragon Inn, Hobbiton, you need to be able to identify that you have a duplicate so you can merge the results. If you don’t, the hotel will appear twice in your results which is sucky UX.

The deduplication process is a real head-scratcher. Attempting to do it programmatically is dangerous because the name and address won’t always match: “The Green Dragon, Back Lane, Hobbiton” from one supplier is “Green Dragon Inn, Hobbiton, The Shire” from a different supplier. It’s hard to imagine a lexical analysis we could program to spot the duplicates every time (and not get any false-positives). In fact it’s really hard; we tried the same thing a few years ago and abandoned it.

That was before we discovered Amazon Mechanical Turk (MTurk). MTurk provides a marketplace where businesses can ask questions which require human intelligence in return for a small reward.

A Human Intelligence Task (HIT) can be identifying objects in photos or videos, performing data de-duplication, transcribing audio recordings, or researching data details. Traditionally, tasks like this have been accomplished by hiring a large temporary workforce (which is time consuming, expensive, and difficult to scale) or have gone undone.

So we employed MTurk to create HITs where we ask if a pair of properties are actually duplicates. We request 5 answers for each HIT, so that we can be confident if we get a consensus. For example, if 4 answers say "Yes" and 1 says "No", then we conclude that the answer to the HIT is "Yes".

As there are likely to be many HITs being created, we use Cheddar's implementation of a Process Tracker (AbstractProcess). A Process Tracker stores data regarding what is being tracked and is stateful. The states are user defined (for example - Initiated, Processing, Completed, etc). Extend the AbstractProcess class for your own implementation of a Process Tracker.

Whilst the HITs are posted at MTurk and are answered in its own time, we needed a process that would actively check for which of the HITs are ready for review. Answers for such HITs could then be retrieved and the process tracker moved to the next state. One way would be to check the status of the HIT periodically until we have all the answers or it has expired.

Thankfully, MTurk provides a Notifications API as a more efficient way to keep track of a HIT. Notifications can be set up for any HIT type, specifying which type of HIT life cycle events need to be tracked. When MTurk detects an event that has been set up for notification, a message is sent to an email address or to an Amazon Simple Queue Service (Amazon SQS) queue.

As Cheddar has support for AWS, we specify the SQS queue name we want to receive messages on when sending notification to MTurk. The HIT lifecycle event of interest is "HITReviewable" i.e. when the HIT has been answered and is ready for review.

The process in action:

  1. When the System is not sure that two properties are the same it spawns a Process Tracker and creates a HIT with MTurk.
  2. The Process Tracker persists relevant information regarding the properties and the HIT id.
  3. A notification is then sent with the SQS queue name and event type for the HIT id.
  4. When the HIT is "reviewable", MTurk puts a message on the SQS queue.
  5. The System picks up the message and retrieves the HIT answer.
  6. The Process Tracker can then be set as "Complete".

This process takes advantage of Cheddar's event driven architecture and inherent AWS integration. Leveraging the framework we were able to save valuable processing time for a fraction of the cost/effort. And MTurk gives us the unique ability to solve a challenging task by tapping programmatically into a huge pool of intelligent human resources.

Killer combo.

Subscribe to this blog

Use of this website and/or subscribing to this blog constitutes acceptance of the Privacy Policy.