Last week I was talking with a software manager about the design of a feature for Yelp. For starters I am not a Yelp user. If you are like me, you can find more about Yelp in this Wikipedia link.
It seems that Yelp is a business directory and uses a crowd-source review forum to collect information and ratings about different businesses (i.e., barber shops, cinemas, coffee shops, restaurants). It happens that I use OpenTable for restaurant-reservations. Apparently on both platforms one can get nearby restaurants of desired types in a specified distance so you can decide where to go for dinner (or for that matter for any meal at any time) tonight. A side fact, both companies happen to be headquartered in the San Francisco, California area.
Given that there are many features in Yelp, let’s decide on a specific use case so we can discuss a design of it at a very high level. How would a user get a list of restaurants in her vicinity?
We would need to start by getting the location of the user. Once we have that, we need the type of food of interest. Perhaps the user starts by looking at all restaurants and based on the findings start narrowing it down to a specified cuisine (i.e., Indian, Italian, etc). We would probably need to get a radius from the user location to search our database. The user might be willing to walk, use public transit or drive for the meal and the radius might shrink or grow based on her mood. I guess we have all been in this situation.
My approach was to get the geolocation coordinates of the user and with that, query a database with geolocation coordinates. Note that this would require optimizations because the database might have billions of Point of Interest (POI) records for different businesses (i.e., clothing stores, grocery stores, restaurants, and zoos among others) with locations all over the world (i.e., China, India, Peru, US, etc.). A single search on a single database would not be a good idea, even if the database is replicated in different data centers distributed over the world.
The approach would be to locate the user in a grid and return entries that match the request within a specified radius (i.e., 100, 50, 20, miles). Of course if the user wishes to walk, the radius would have to be quite smaller. Based on statistics on use the scale for the radius would have impact on the size of the grids to use.
Allow me to draw an image showing what we have so far.
I suggested dividing the world into rectangular squares. The database could be sharded based on the rectangles. Of course this approach needs further refinement when the user is close to the edge of a rectangle. The software would also have to include the adjacent rectangle or in case of a corner the three adjacent ones.
I suggested using a graphic database such as Neo4j. I have worked with Neo4j and it is able to select records based on addresses. Not sure how well such database supports sharding.
After the chat I went and did some research on the subject. One of the recommendations was the use of a graphical database.
I also watch a very good YouTube video on the subject at hand named System Design Interview – Design Yelp Geospatial Geohash by Success in Tech.
There are also spatial databases. If you take a look at the Wikipedia article about spatial databases in the Spatial index section of the article you will find Geohash, Grid (spatial index) and R-tree data structure.
From Wikipedia:
“Types of grids – Square or rectangular grids are frequently used for purposes such as translating spatial information expressed in Cartesian coordinates (latitude and longitude) into and out of the grid system.”
Also from Wikipedia:
“Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer and (similar work in 1966) G.M. Morton, which encodes a geographic location into a short string of letters and digits. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.”
If I would be architecting, designing or implementing the Yelp feature here discussed, I would become familiar with the terms, algorithms and data structures mentioned in the links in this post. Of course if you start work at Yelp, you can learn from coworkers and existing code.
In my next post I will experiment with the R-tree data structure. I was planning on writing the code from scratch but due to time restrictions, will experiment with code developed by David Moten. You can find his code in the following GitHub repository.
It seems to me that the approach I suggested for the Yelp design seems quite reasonable, specially when I did not have access to Google during the conversation. If you disagree with my assertion, I would like to hear your comments.
By the way, I am not or have been associated in any way shape or form with Neo4j, OpenTable or Yelp.
If you have comments or questions regarding this, or any other post in this blog, or if you would like for me to serve of assistance with any phase in the SDLC (Software Development Life Cycle) of a project associated with a product or service, please do not hesitate and leave me a note below. If you prefer, send me a private message using the following address: john.canessa@gmail.com. I will reply as soon as possible.
Keep on reading and experimenting. It is the best way to learn, refresh your knowledge and enhance your developer toolset!
John
Twitter: @john_canessa
Aw, this was an extremely nice post. Taking a few minutes and actual effort to produce a top notch article… but what can I say… I hesitate a whole lot and never manage
to get nearly anything done.
Glad you liked it.
I use a “To Do List”.
Every day at the end of the day I jot down a list of things to do the following day.
Each morning I prioritize them. As I finish one I cross it and take the next one.
I do not finish all tasks every day, but in general I tend to get a considerable number of tasks done.
Hope this help.