Each day, hundreds of thousands of sellers aim to sell hundreds of millions of products to as many consumers. And this competition leads to thousands of price changes each second. In the process of building systems to gather, standardize, deduplicate and categorize this data, we’ve sifted through thousands of datapoints.
So after months of wrestling with this data, how intuitively do the guys know their data? I put together a data trivia quiz for my team, the catch being, ofcourse, that they weren’t allowed to use our API to lookup answers. The results turned out to be interesting, so I’m documenting a sample of questions and answers in this blog post for all to see.
[If you're in a hurry, scroll to question 2, which I think is the most interesting].
Question 1
Which product is sold by the most number of distinct sellers? I’ll give points for identifying the category of products, at the very least.
Answer
Gran Turismo 3 A-spec (Playstation 2) on amazon.com, with >900 sellers. Most of the items on sale are “used” goods, naturally.
Details
sem3_id: 1eta5uVvkOGsIekGumKIQe
API Query: products?q={“sitedetails”:{“recentoffers_count” : {“gte” : INPUT}},”cat_id”: CAT_ID}
Note: For each of the answers, I’ve provided the sem3_id unique identifier of the product, which can be used to retrieve products from the database through the query products?q={“sem3_id”:SEM3_ID}.
Further note: For each answer, I’ve also provided the approximate query used to generate the answer. CAT_ID refers to the the category of products you wish to restrict your search to and INPUT refers to your search value. You can lookup category IDs through queries such as categories?q={“name”:”Video games”}.
Question 2
What’s the costliest product that you’ve come across?
Answer
The “costliest” items that we found turned out to be kinky sexual wellness toys. I’d like to keep this blog PG, so I won’t link to them. The costliest product that I will link to is this Classic Flame 23MM070EEPC/SO 23 Palisades Home Theater with Electric Fireplace, which currently costs $46,688,998,697.61 (~$46 billion)… to be precise. To put that in context, this home theater cum fireplace would bankrupt Mark Zuckerberg three times over, and is about twice the once fabled price of The Making of a Fly. Algorithmic pricing FTW?

$46,688,998,697.61 product on Amazon.com
The price history API (query 2 below) reveals the chronology of the rise in price of this product:
– 12th September, 2012: The product was priced at $1984.47 and shipped in 6-10 business days.
– 2nd December, 2012: The price had risen to $93,662.35. The hike was, you’ll be glad to know, justified by a reduced window of 1-2 day shipping.
– 21st Jan, 2013: $7,440,653,641.12. The price has risen 6.5x in the last week. Time for a quick buy and resell?
This isn’t the highest price that we’ve recorded for a product though. Turns out this Samsung TV was priced at $1,000,000,000,000.00 ($1 trillion) in early November last year. A dozen sales of this would have gone a long way towards offsetting the American national debt!
Details
sem3_id: 3wimVmBWkCsEAesceYkecK
API Query 1: products?q={“price”:{“gte”:1000000},”cat_id”:CAT_ID}
API Query 2: offers?q={“sem3_id”:”3wimVmBWkCsEAesceYkecK”}
Question 3
What’s the longest item that you’ve come across? That’s length, the physical dimension.
Answer
4000 feet jumbo roll toilet tissue! Some out of the box thinking needed there.
Details
API Query: products?q={“length”:{“gte” :LENGTH},”cat_id”: CAT_ID}
Question 4
What’s the most popular color among all categories of products?
Answer
Black, ofcourse! ~1MM of our 15MM or so products are black in color. Interestingly, only half as many products are white.
Details
API Query: products?q={“color”:COLOR,”cat_id”: CAT_ID}
Question 5
Which brand has the greatest number of distinct product listings?
Answer
I found ~65k products branded Wall Spirit.
Details
API Query: products?q={“brand”:BRAND,”cat_id”: CAT_ID}
Question 6
Which product contains the most variations (size-dimension-color-etc. combinations)?
Answer
1775 variations for Vermont Gage Steel No-Go Plug Gage.
Details
sem3_id: 7fimF6qaGG66oc2sYSwKaO
API Query: products?q={“variation_id”:”6rV6YeWdvsyKCWI6QMeMSY”}
[Returns all variations of the specified product].
The world of e-commerce data, particularly price data, never ceases to amaze me. Managing this data is certainly an uphill challenge, but it is also paradise for the data geek in me.
Sign up here (link expires in a week) if you wish to explore the data for yourself. Lucky for you, we’re already handling the data management bit, so, I invite you to go crazy exploring. Do email me at govind at semantics3 dot com if you have any questions or insights.
By the way, for you non-developers out there, we’ve built an API playground through which anyone can run queries with no setup time.
The problem with APIs is that you often don’t grasp their entire utility until you write those first lines of code. And it takes something really compelling to get developers motivated enough to write those first lines.
This is particularly true of our Products API. It’s difficult to understand the depth of our data and the full capabilities of our API without live examples. So we’ve built a new tool, which we call the “API Playground”, to help users explore our API at the click of a few buttons.

Returns all Toshiba products that fall under the category of "Computers and Accessories" (ID 4992), further filtered by price, order, weight and the store
The API Playground is a simple interface that can quickly demonstrate the kind of queries you can make with our API. It’s a full fledged console – anything you type in the box will be directly forwarded to the API. You can even play with our newly launched offers API through this console (more on this in a later post). Developers we have been working with (and even the engineers in our team!) use this to quickly test drive their queries before plugging it into their applications.
We’ve preloaded the Playground with some sample queries that will help you frame your queries better – category lookup, name lookup, sorting, pagination. We hope this will help you get up to speed with our API faster and start building cool applications with it!
Head over to http://www.semantics3.com/dashboard/playground to test it out.
Don’t have an account with us? Click here to sign up.
Do get in touch with us with your feedback and comments.
I gave a general technical talk on building a distributed web crawler at last Friday’s NUS Hackers meetup. Here are my presentation slides, which I used during the talk.
We are hiring!
If you are interested in joining a fast growing Cloud (80-node EC2 cluster), Mobile (SSH + Screen) and Social (We are all friendly hackers. No ‘ideas’ people.) startup, drop me a mail at varun [at] semantics3.com!
This post was written by our guest author, Youssef Mehamed, who is a Mobile Developer and Sys Admin. He is currently studying Computer Science in Melilla, Spain. StackOverflow. App Store Profile. LinkedIn.
The Semantics3 Products API uses the Oauth v1.0 2-legged authorization scheme, so let’s see how to get your iPhone app to interact with this webservice using Objective-C.
To achieve this, the first thing we need is to download and add to our project the OauthConsumer. I’ve chosen github.com/jdg/oauthconsumer. As the author says, it is an iPhone ready version of Google Code’s OauthConsumer (Mac).
Let’s get started!
1. We’re going to clone the project, add it to ours and then write the needed code to make correct queries to Semantics3 API.
git clone git://github.com/jdg/oauthconsumer.git
2. Copy the OauthConsumer entire directory to our project.
3. Add the required frameworks and libs:
-
Security.framework
-
libxml2.dylib
4. Edit the “header search paths” and include “$SDKROOT/usr/include/libxml2” with “Recursive” checked.
Note: The README mentions to include sys/types.h, I haven’t included it and everything works so far.
5. Next import OauthConsumer.h (wherever you want to code the queries):
(To make the code clearer I have defined the Oauth Key and Secret as macros)
6. We need at least three methods for the basic usage of the app:
The code is pretty self-explanatory, it creates and initialises the OauthConsumer and the OAuth request. After that, we need only to create the parameter q and assign it a value (continuing with the Quick Start example). Then we just need to perform the query and catch the success and error events using a delegate (in this example the same class).
That’s it. Following these simple and easy steps we have a simple app capable of interacting with the Semantics3 Products API.
Web crawling is one of those tasks that is so easy in theory (well you visit some pages, figure out the outgoing links, figure out which haven’t been visited, queue them up, pop the queue, visit the page, repeat), but really hard in practice, especially at scale. Launching a badly written webcrawler is equivalent to unleashing a DoS attack (and in a distributed crawler, it constitutes a DDoS attack!)
At Semantics3, we aggregate a large number of products from quite a few websites. Doing so requires some heavy duty web crawling and we have built a distributed web crawler to suit our needs. In this post I will be describing the design architecture of our web crawler, implementation details and finally some future improvements.
tl;dr:
- 60-node cluster (bulk of them are spot micro-instances) running on Amazon AWS. We crawl 1-3 million pages a day at a cost of ~$3 a day (excluding storage costs).
- Built using the Gearman map-reduce framework based on a supervisor-worker architecture.
- Stack is redis (supervisor)+gearman (coordinator)+perl workers on aws. Chef and Capistrano are used to manage the server herd.
- Future plans are to use bloom filters to keep track of visited URLs and switch to a distributed db like Riak or Cassandra for storing the state of the crawl.
Evolution
Let me begin with how our crawling has evolved over the past few months. We started off with a single threaded, single process crawler architecture written in 200 lines of Perl. As the number of requests increased, we rewrote it to run in a multi-threaded, multi-process manner. When even that was not enough, we went for bigger machines (more CPU power and more memory), like the extra-large high CPU EC2 instances. Finally even this didn’t seem adequate. It was pretty obvious that we had to figure out a way for our web crawler to run on multiple machines.
The original plan was to go with something out of the box. There were two solutions which we investigated. One was 80Legs, a web-crawler-as-a-service API. The other was running Nutch on a Hadoop cluster. The former was very restrictive as we had no control on how the pages were crawled, while the latter required a lot of grokking of internals to customize crawling to our needs.
We came to a conclusion that building our own web crawler architecture made a lot of sense. Another reason was that we had quite a few other tasks (processing our data, resizing images, etc.) that needed to be run in a distributed manner. Our web crawler is written as a specific type of job as part of a distributed job work system, built to suit our needs.
Architecture
High level overview
Our architecture is one based on a supervisor-worker model and follows the map-reduce pattern. The supervisor controls the state of each of the web crawls and determines which page to crawl next.
The workers are the ones that do the actual work (duh) – in this case the downloading of pages and processing of HTML. Once the workers have finished their work, they update the supervisor with details such as the HTTP status of the page download and links discovered. They then store the processed HTML content in our database.
Failed pages are retried (5XX responses) or discarded (404s). If the number of failed URLs crosses a threshold, the crawl is immediately paused.
(c) Mark Handley – http://bit.ly/QSX5q8
Implementation Details
Perl is our weapon of choice. We do a lot of text processing and Perl has been a godsend (though the quirky syntax sometimes annoys me). We use it for pretty much everything except for our API, which is written in Node.js.
The actual distribution of work tasks is done through Gearman, which is a work distribution library. This could be said to be the bedrock of our distributed web crawler. It handles concurrency related issues, failures, and most importantly figures, out which machine to farm out work to, all automagically without any fuss. The best thing about Gearman is that it automatically takes care of horizontal scaling. We can just keep spinning up more instances and Gearman will auto-detect them and start farming work to them.
If Gearman is the bedrock of our distributed system, then Redis is the oracle. Redis is used to hold the state of all web crawls. It stores the status of each crawl (running, paused, cancelled), the URL crawl queue and the hash of all the URLs that have been previously visited.
Our entire distributed system is built on top of Amazon AWS:
- The ‘Supervisor’ runs on a small instance, running Redis and a Perl script that figures out which URL to crawl next, which it then sends to the Gearman server.
- The ‘Gearman’ server runs on a micro instance. This farms out work to all the workers who have registered with the Gearman server.
- The ‘Workers’ all run on spot micro-instances. We have, at any given time, about 20-50 worker instances running based on our workload. The ‘Worker’ script runs on a single process, in an evented manner. They update the ‘Supervisor’ server with the status of the crawl and write the processed HTML to our ‘DB’ servers which run MongoDB. Another design feature is that the ‘Workers’ are stateless – they can be spun up or terminated at will without affecting jobs state.
The actual crawling is done 10 pages a time from each domain. We resolve and cache the DNS request, keep the TCP connection open and fetch the 10 pages asynchronously.
[To handle our server farm, we use Chef to handle the configuring of all servers based on their roles and Capistrano for the actual deployment from our Git repos.]
In terms of costs, we are able to crawl about 1-3 million pages daily for just $3 a day (excluding storage costs).
Why ‘Almost’ Distributed?
The actual weak-point in our system is the ‘Supervisor’ which constitutes a single point of failure (to somewhat mitigate this, we use replication and also store Redis state on disk) and hence our system is not purely distributed.
The primary reason is because there is no distributed version of Redis and we are addicted to its support for in-memory data structures. We use pretty much all their data structures – sets, sorted sets, lists and hash – extensively.
Possible alternatives which we are investigating are distributed databases such as Riak and Cassandra to replace the role of Redis.
Data Structures
We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.
We assign an importance score to each URL which we discover and then crawl them accordingly. We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.
A possible alternative would be to use Bloom filters. A Bloom filter is a probabilistic data structure and is used for answering set-existential questions (eg: has this URL been crawled before?). Due its probabilistic nature, it can give erroneous results in the form of false positives. You can however tweak the error rate, allowing for only a small number of false positives. The great benefit is the large amount of memory you can save (much more memory efficient than Redis Hashes). If we start crawling pages in the hundreds of millions, we definitely would have to switch to this data structure. As for the false positives, well, there ain’t no harm in occasionally crawling the same page twice.
Recrawling
Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently. We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price. Hence a brute-force complete recrawl doesn’t really make sense.
We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have). Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).
I will be writing a post detailing our recrawl strategies sometime in the future.
Resources
Here are some academic papers for reference:
Building Blocks Of A Scalable Webcrawler
Finally a really good post I came across on HackerNews, How To Crawl A Quarter Billion Webpages in 40 Hours , which motivated me to write this post.
Conclusion
Building you own distributed web crawler is a really fun challenge and you get to learn so much about distributed systems, concurrency, scaling, databases, economics, etc. But once again, you may want to evaluate the pros and cons before you decide to roll out your own. If what you do is plain ol’ vanilla crawling, then you may want to just use an open source web crawler like Nutch.
Feel free to drop me an email (varun [at] semantics3.com) or leave a comment if you have any questions, suggestions or even improvements.
Sign-up
We just launched our closed beta of our API. We would be most glad if you could give it a try. Here is the signup link (it comes preloaded with the invitation code):
Don’t hesitate to share it with your developer friends.
In this post I am going to talk about how we built the backend infrastructure to support our Products API, Semantics3’s core product. This post can also be seen as a follow up or maybe even a reply to a post some time back by Zemanta, which dealt with building public APIs.
In that post, the author suggests using 3rd party API support vendors like 3Scale and Mashery to handle all API backend requirements. We seriously considered those two services and concluded that they weren’t appropriate for us because:
- Our API is our core product offering and we didn’t feel comfortable outsourcing such a critical part to a 3rd-party service.
- Those services aren’t cheap.
- We are engineers. We love to build things (while keeping a wary eye on reinventing the wheel).

(c) Techcrunch – http://tcrn.ch/ModUnx
Our core product is our Products API, which allows developers to get immediate access to millions of products data which is constantly updated. A sample query to the API may read: “return LCD televisions with price >= USD400, with length >= 65cms of brand Samsung”.
One can think of an API as a utility service. A paid API is essentially the sale of some sort of utility, either on a usage-based model or a monthly subscription model. In our case, the utility which we are selling is realtime access to products data.
In this post, I am going to focus mainly on the technical aspects of the administrative parts of running such a service. “Administrative parts” includes the delivery, the metering, the billing of customers, authentication of users, etc. How we built the actual utility service is a discussion for another day.
It took three of us (Govind, Vinoth and myself), three weeks to build the API support infrastructure. In terms of work allocation, Govind worked on the actual piping of the data. Vinoth worked on the billing and frontend analytics dashboard. I worked on the architecture, authentication, throttling and metering of customers.
So, what the are some of the things you need to consider when building your own paid API offering?
0. Design Of The API
REST-based APIs have become the defacto standard, so you probably want to build one on these principles. Beware, it’s very easy to design an unREST-ful API so do spend lots of time planning your endpoints. These resources from 3Scale helped us better understand the technicalities of desining a RESTful service. JSON has also become the most popular delivery format, so ensure that you supports that.
1. The Language/Platform To Build The API
The language/platform you pick is of critical importance. You’d ideally want to choose something that can handle a large number of concurrent requests and scale well. Our original plan was to go with Perl, which is our predominant language of choice. However, after some investigation, it didn’t seem to be a great option (Plack unfortunately is lacking in good documentation). We evaluated Python (Tornado) and Node.js and eventually chose the latter. This is because Node.js has baked in async functionality, native JSON support (duh) and a great community behind it. Our API is built using the really good Restify framework.
Unfortunately, code in Node.js doesn’t run in an async manner automatically. You often need to write it with (ugly spaghetti-like) callbacks for it to happen. Finally, debugging javascript is not a very pleasant experience.
Two other language which you may want to consider, are Golang and Erlang. On hindsight, I probably would have picked Golang.
2. API Key Generation and User Management
You need to have a robust system of generating keys and secrets, since these credentials are used to authenticate customers who access your service. It has to be secure, unique and one-way (not decipherable). Our algorithm for generating keys uses the base64 encoding of the output of some well known hashing algorithms of user details and a random crypto number.
3. Authentication
We decided to go with OAuth v1.0 2-legged as our primary authentication scheme. The other authentication scheme which we support, is basic authentication (just send in a http request header with your key present), but its restricted to only our test endpoints.
Since Node.js didn’t have a suitable OAuth server library, we ended up writing our own (we will open source it some point in the future). We had to then test it with all the popular oauth client libraries for all the major scripting languages, Perl,Python, Ruby, PHP and Node.js.
4. Metering
This is the most critical part of a paid API offering. Metering is used to track exactly how much resources each customer has used. It’s also the starting point for debugging your system.
When designing your system, try to capture as much information as possible from each API query. For each request made to our API, we log 25 different parameters related to the call, giving us more than enough data to hunt down even the hairiest of bugs. This information can later come in handy for analyzing your customers’ usage patterns – e.g.: How many requests are being made, how frequently? Which resources are requested for most often?
All API calls are logged on a Mongodb server. We then run map-reduce jobs to aggregate the number of calls made by each api key, to determine daily usage of each of our customers.
5. Throttling API usage
API throttling is very critical, because you don’t want your service (especially your free plan) to, bluntly put, become a free-for-all unlimited buffet service. We use the leaky bucket algorithm to throttle API requests based on the tier of the plan. [E.g.: our free plan is capped at 1000 calls for any given 24-hour period, from the time the first call was made.] We use a patched version of an implementation of this algorithm, which is available in the Restify API framework which we use.
Here is a great Stackoverflow thread that discusses about request throttling.
6. Billing of Customers
We are based in Singapore, and hence have no access to Stripe
. As a result we ended up choosing the not-so-easy-to-integrate Paypal API, which took a good one week for Vinoth to integrate, what with the Paypal API’s creaky url-callback system, poor documentation and buggy sandbox environment. It’s quite amusing that Paypal doesn’t even provide a REST API!
We support two types of payments. One is a monthly subscription (get a fixed number of calls per month) and the other is a bulk call purchase (purchase X number of calls at Y dollars).
7. Dashboard/Analytics Platform for Customers

Finally, you want to build a visual front-end so that your customers can track and monitor their usage.
Our analytics dashboard allows users to view the the total number of calls that they made for each day for their chosen date range. We also display the last 100 calls made on the last day of the chosen data range.
We used client side rendering to build the dashboard, using the excellent ICanHaz.js library (which comes with built-in mustache templates support). Client-side rendering is a great strategy when building dashboards, because it makes work division between frontend and backend devs really convenient. More importantly, it allows for changes and new features to be introduced more easily as the various aspects of code (front-end display and back-end data generation) are clearly demarcated.
The javascript library we used for rendering the graphs (bar graph and pie chart) was JQuery Flot. JQuery DataTables was used to display the table of calls. Finally we used the Bootstrap datepicker library to allow for users to select their date range.
Conclusion
Building a paid API offering involves several considerations that need to be planned thoroughly. I hope this blog post serves as a simple guide for those looking to build something similar for their own startups. That said, if your API is just a non-critical add-on service and not your core offering, using a third party service like 3Scale or Mashery may be a much better choice.
On a side note, if our current idea doesn’t take off, we may setup a Mashery competitor (just kidding
).
PS: We just launched a closed beta of our API. We would be most glad if you could give it a try. Here is the signup link (it comes preloaded with the invitation code). Don’t hesitate to share it with your developer friends.
I spoke about our back-end infrastructure at the May meetup of BigData.sg. Here are my presentation slides, which I used during the talk.
In the coming weeks I plan to write a few blog posts regarding specific aspects of our infrastructure and the various design decisions we undertook while building them. Stay tuned!
I am very excited to introduce our first engineering hire and fourth hacker to join our team, Srinivas Kidambi aka Kid!
Kid is a recent graduate from the National University of Singapore. He is a computational geometry expert, having done pretty much every CG course offered and did a really cool final year project on 2 dimensional alpha shapes. We unfortunately do not deal with polygons and convex hulls, but hey! Rockstar-ninja-pirate-cyborgs make shit happen!
Story Time
I got to know Kid really well during my final semester at NUS. Both of us ended up taking an advanced grad-level cryptography course where we invented a novel method of hash collision introductory Philosophy of Science course. Only God knows how both of us ended up taking that course.
We soon became pretty good friends. During the entire duration of the course we probably discussed ‘the interesting philosophical problems plaguing science’ for about 2 mins 27 seconds. Most of the time it was Kid talking endlessly about CGAL and CUDA. Pretty soon I got sick of them as much as philosophy of science itself. (Recap: At that time Semantics3 was still only in idea stage.)
Inception
To fight back and also not wanting to miss out on a good hacker, I took a leaf out of Inception and starting planting seeds in his mind, without him knowing itself. I got him addicted to data, structured data, unstructured data and semi-structured data and big data and small data… The Inception worked! After graduation, Kid rejected a couple of offers from big, badass companies to join us.
We expect great things from Kid!
The past few weekends I have been working on a personal side project (20% time project à la El-Goog) called Quanthunt with a friend of mine, Eileen Chan. It is basically an online trading platform where you compete with others – with the catch being that all trades can only be done programmatically using an API. I started on it mainly for fun but also to try my hand at something quite different from the things I have worked on previously. It is also pretty technically interesting as it requires the processing of a large number of trades from multiple users and tracking their positions in a leaderboard in real time.
Soon after I got started writing the code for Quanthunt, I hit a roadblock. I wasn’t too sure on going about implementing the trading engine (the part that executed the trades locked in by the users according to the latest market rates) in a simple, efficient manner. I could have done a while(1) loop iterating through a MySQL table but that was clearly a very naive way.
A Rabbit And A 4-letter Acronym
So it was back to doing background research. After many google queries and reading some excellent posts on Hacker News, I started to get an inkling on how to proceed. I learned about messaging queues and that the gold-standard was TIBCO and that most of the financial institutions used their software. Being a free software zealot I started searching for open source messaging queues. I soon discovered AMQP and read about RabbitMQ. It was my first AHA! moment. Now I knew how to go about implementing the trading engine.
But something else was going on in my mind. I constantly kept relating all the new stuff I was learning with the stuff I was working at Semantics3. For example, one thing which I wanted to improve was our web crawler, which was implemented in a single-machine, single-process (monolithic) manner. Hadoop was a potential option but that was way overkill and didn’t really suit our needs. This was my second AHA! moment. The message queue was exactly what I needed to write a distributed web crawler that was completely tailored for our needs.
The Revelation
This got me really excited and I started reading extensively about message queues. I soon discovered ZeroMQ (which turned out to be more of a library than a framework) and also learned about how Redis could be used as a job queue (Small advice: Read up on redis. Its got so many different features and capabilities. It’s much, much more than a simple key-value in-memory store!) Later, I found about dedicated job queue servers and learned about Beanstalk. Then I discovered Gearman. This was exactly what I needed and what I was unknowingly searching for.
With this new found knowledge I was able to completely redesign our core architecture from one that was centralized to one that was highly distributed and could potentially scale up and down according to our needs. From big fat machines doing all the work, we could now do with just using a large number of smaller machines. Not only has this made our infrastructure more robust, it has also brought us significant cost savings.
Now let me come to my point at hand. Looking back it seems very obvious. But hindsight is often 50/50. If I hadn’t worked on my side project I don’t think I would have gotten to know about and appreciate messaging queues, their various different implementations (Gearman, AMQP, RabbitMQ, ZeroMQ,etc..) and the possibilities that they offer (at least for some time to come). Except for a bunch of load balancers, our infrastructure would have continued to be fat and monolithic
Other Projects
Quanthunt is still a work in progress (I ended up using Redis to handle the queuing and execution of trades) and should be done in the next two weeks. I already have a couple of other personal weekend projects that I have lined up, which I plan to undertake once I am done with it. The first one is to build a quadcopter completely from scratch, including writing the control system. Part boyhood fantasy and part not to lose my roots in hardware engineering. In fact, I have been discussing this with a bunch of friends for quite some time now. Who knows what skills/lessons I might learn out of it, which I could apply at Semantics3?
Another project, which I have in mind, is to build a clone of facebook chat using Erlang/OTP (I read a lot about it as I was researching on RabbitMQ). I have always wanted to learn a functional programming language and I feel that Erlang is the best choice as it seems to be a very functional (pardon the pun) language to build practical web applications. Also as Erlang is designed for writing distributed applications, this just might be the best language to implement the next version of our web crawler.
Words of Wisdom
So, get started on a weekend project. Pick a project in an area that you don’t have much background knowledge or experience in or one that requires a different stack of technologies from what you are familiar with. Use a different kind of database (eg: try a project using a graph database like Neo4j) or a different library/framework or even better, use a completely different language. It will not only make you a better engineer, it might just help you understand and improve your core work in different and better ways.
I woke up this morning to a mail from Varun. All I managed to see in the alert on my iPhone was the forwarded address “YCombinator” and the words “Oh. Well.”. I didn’t even bother reading the whole email. I went back to sleep.
When we started working on this idea in December we told ourselves – we are gonna work towards YC Summer 2012. That was our goal. We wanted to be part of the cult too. We wanted to be back in the valley.
So for the next few months, we worked hard on our offerings. Launched a MVP marketplace. Pushed out a beta version of the API. Our consulting work was bringing in enough money. Pitched at several events. When we felt we were finally on track, we switched our attention to the YC application. We spent weeks writing and rewriting it. Thought about each question thoroughly. Reaching out to alumni. Getting them to review (secretly hoping they would put in a good word). After we finally submitted our application, we were very confident of at least landing ourselves an interview call. Varun posted “TO CALIFORNIA AND BEYOND!!” on Facebook. We were excited. Come on, 3 computer engineers who were making revenue on an idea that might change the way developers look at data!
But that night I realized something. We wanted YC too dearly. It made it sound like our company’s mere existence depended on us entering the sacred scrolls of YC startups alumni and being blessed by HRH PG. That was so wrong. That is not how startups are supposed to think. As Govind later put it, we were focusing on “becoming a YC company” more than “becoming a successful company”. That would have proven unhealthy in the long run. Rather than spending time on YC recommendations, HackerNews points, visa issues and other such distractions, we should focus on the product. And rather than answering questions from the angle of “what would PG think”, we can give ourselves honest answers to how we are different, who will buy from us and so on.
From that day, first I started making a list of all the things I would look forward to incase we din’t get in. Personally, I had my graduation ceremony to look forward to. I had a close friend’s engagement ceremony in India. Dream Theater was coming to Singapore and I had already bought tickets for them. Most importantly, I started imagining a life without YC. I imagined we went ahead and released our API in May. Made our first hire in June. Signed a few more deals. Raised an angel round later in the year. Saw the company growing.
When “the” mail finally came, we were obviously devastated. But I was prepared for it.
Reflections:
- If you don’t get into YC, move on! You applied to YC not because you just wanted to get in because its cool. You applied because you thought your idea was awesome and you hoped to use it to build some contacts and get some recognition. I am sure there are other ways this can be done. You call yourself an entrepreneur for a reason.
- They rejecting you doesn’t mean your idea sucks. It wasn’t their cup of tea, they didn’t believe in you idea as much you did or they were probably driving home when they were reading your application. If you have faith in it, continue working on it.
- DO NOT set your goal as “getting into YC”. Its tough. Its highly competitive. Chances of getting in are minuscule. Keep a few targets in mind. So even if this fails, you have something else to look forward to.
I woke up an hour later, took a quick shower, packed my stuff and left for office. On my way I finally read through the email. Govind’s motivational messages soon followed. Varun suffered from migraine after spending 3 hours (from 4 am Singapore time) awaiting the (rejection) email. And there I was. Excited to get started on today’s work.
What if there is no YC? We will continue working on what we have set out to achieve. Building an easy-to-use platform to allow developers to access data.




