Nov 03
2010

What Happened to NewsBlur: A Hacker News Effect Post-Mortem

Last week I submitted my project, NewsBlur, a feed reader with intelligence, to Hacker News. This was a big deal for me. For the entire 16 months that I have been working on the project, I was waiting for it to be Hacker News ready. It's open-source on GitHub, so I also had the extra incentive to do it right.

And last week, after I had launched premium accounts and had just started polishing the classifiers, I felt it was time to show it off. I want to show you what the Hacker News effect has been on both my server and my project.

Hacker News As the Audience

When I wasn't writing code on the subway every morning and evening, I would think about what the reaction on Hacker News would be. Would folks find NewsBlur too buggy? Would they be interested at all? Let me tell you, it's a great motivator to have an audience in mind and to constantly channel them and ask their opinion. Is a big-ticket feature like Google Reader import necessary before it's Hacker News ready? It would take time, and time was the only currency which I could pay with. In my mind, all I had to do was ask. ("Looks cool, but if there's no easy way to migrate from Google Reader, this thing is dead in the water.")

Kurt Vonnegut wrote: "Write to please just one person. If you open a window and make love to the world, so to speak, your story will get pneumonia." (From Vonnegut's Introduction to Bagombo Snuff Box.)

Let's consider Hacker News as that "one person," since for all intents, it is a single place. I wasn't working to please every Google Reader user: the die-hards, the once-in-a-seasons, or the twitter-over-rss'ers. For the initial version, I just wanted to please Hacker News. I know this crowd from seeing how they react to any new startup. What's the unique spin and what's the good use of technology, they would ask. What could make it better and is it good enough for now?

If you're outsourcing tech and just applying shiny visuals to your veneer, the Hacker News crowd sniffs it out faster than a beagle in a meat market. So I thought the best way to appeal to this crowd is to actually make decisions about the UI that would confuse a few people, but enormously please many people. From comments on the Hacker News thread, it looks like I didn't wait too long.

How the Server Handled the Traffic

Have I got some graphs to show you. I use munin, and god-love-it, it's fantastic for monitoring both server load and arbitrary data points. I watch the load on CPU, load average, memory consumption, disk usage, db queries, IO throughput, and network throughput (both to external users and to internal private IPs).

I also have a whole suite of custom graphs to watch how many intelligence classifiers users are making, how many feeds and subscriptions users are adding, the rate of new users, premium users, old users returning, new users sticking around, and load times of feeds (rolling max, min, and average).

Used to be that when a thundering herd of visitors came to NewsBlur, I'd have to watch the server nervously, as CPU would smack 400% (on a 4-core machine), the DB would thrash on disk, and inevitably some service or another would become overrun.

Let's see the CPU over the past week:

CPU - Past week

Spot the onslaught? NewsBlur's app server is only responsible for serving web requests, queueing feeds to be updated, and calculating unread counts. Needless to say, even with nearly a thousand new users, I offloaded so much of the CPU-intensive work to the task servers that I didn't have a single problem in serving requests.

This is a big deal. The task server was overwhelmed (partially due to a bug, but partially because I was fetching tens of thousands of new feeds), but everybody who wanted to see NewsBlur still could. Their web requests, and loading each feed, were near instantaneous. It was wonderful to watch it happen, knowing that everybody was being served.

CPU - Past year

Clearly, bugs have been fixed, and CPU-intensive work has been offloaded to task servers.

Load average - Past week

The load of the server went up and stayed up. Why did it not fall back down? Because the app server is calculating unread counts, it has more work to do even after the users are gone. This will become a pain point when one app server is not enough for the hundreds of concurrent users NewsBlur will soon have. But luckily, app servers are the easiest to scale out, since each user will only use one app server at a time, so the data only has to be consistent on that one server, as it propagates out to the other app servers (which may become db shards, too).

# of feeds and subscriptions - Past week

Economies of scale. The more feeds I have, the more likely a subscription to a feed will be on a feed that already exists. I want that yellow line to run off into space, leaving the green line to grow linearly. It's fewer feeds to fetch.

Memory - Past week

Memory doesn't move, because I'm being CPU bound. I'm not actually moving all that much more data around. I use gunicorn to rotate my web workers, so NewsBlur's few memory leaks can be smoothed over.

MongoDB Operations - Past week

I use MongoDB to serve stories. All indexes, no misses (there's a graph for this I won't bother showing). You can extrapolate traffic through this graph. Sure, you don't know average feeds per user, but you can take a guess.

My Way of Building NewsBlur

In order to build all of the separate pieces, I broke everything down into chunks that could be written down and crossed off. Literally written down. I have all of my priorities from the past 7 months. It's both a motivator and estimator. I've learned how to estimate work load far better than back in May, when these priorities start. I finish more of what I tried to start.

The way it works is simple: write down a priority for the month it's going to be built in, number it, then cross it off if it gets built before the end of the month. You get to go back and see how much you can actually do, and what it is you wanted to build. This means I'm setting myself up for a pivot every month, when I re-evaluate what it is I'm trying to build.

Google Reader as a Competitor

Lastly, what more could you ask for? A prominent competitor, known to every Gmail user as the empty inbox link. Feed reading is a complicated idea made simple by having most users already exposed to a product that fulfills the feed reading need. By improving over that experience, users can directly compare, instead of having to learn NewsBlur on top of learning how to use RSS and track every site you read.

If your space has a major competitor and the barrier to entry is an OAuth import away, then consider yourself lucky. Anybody can try your product and become paid customers in moments. It's practically a Lotus123 to Excel import/export, except you don't need to buy the software before you try it out.

Going Forward

I'm half-way to being profitable. I only need 35 more premium subscribers. But so far, people are thrilled about the work I'm doing. Here are some tweets from a sample of users:

I'm e-mailing blogs, chatting with folks who have a blog influence, and most importantly, continuing to launch new features and fix old ones. Thanks to Hacker News, I get to appeal to a graceful and sharp audience. And good looking.

I'm on Twitter as @samuelclay, and I'd love to hear from you.

Aug 22
2010

There are Two Paper Towel Rolls

It's almost time to restock, but the shelf can only hold 5 rolls, so you might as well restock at an appropriate time. But you have to choose which of the two remaining rolls is going in the business end of the side-gripping dispenser.

I can choose the larger of the two rolls. The Mega-Roll. Or I can choose the standard size, which is visibly puny compared to the bigger choice. If you know the answer, it seems obvious, and that's because it's an obvious answer.

But it's not so obvious if you start thinking about why choose one in the first place. The larger roll is larger, but does that mean it should go first simply because it is preferable? The assumption is that you don't like changing rolls often and you don't think larger rolls look or work any differently than their smaller counter-part.

And maybe the smaller roll has preference, just to get it out of the way for more Megas when it's time to buy more. You need to remember to buy more. What causes you to remember to buy more? Absence or a dwindling stock. Once you get down to having one left, and it gets placed into service, you commit to memory that you need to stock up next time you remember. It's a modified version of The Game that you play with yourself, except that by remembering, you win.

The smaller roll goes in first, so that at exhaustion the larger roll has a longer opportunity for you to remember to buy more. Nothing shocks you more than an absence.

Jul 18
2010

Migrating Django from MySQL to PostgreSQL the Easy Way

I recently moved NewsBlur from MySQL to PostgreSQL for a variety of reasons, but most of all I want to use connection pooling and database replication using Slony, and Postgres has a great track record and community. But all of my data was stored in MySQL and there is no super easy way to move from one database backend to another.

Luckily, since I was using the Django ORM, and with Django 1.2's multi-db support, I can use Django's serializers to move the data from MySQL's format into JSON and then back into Postgres.

Unfortunately, If I were to use the command line, every single row of my models has to be loaded into memory. Issuing commands like this:

#!python
python manage.py dumpdata --natural --indent=4 feeds > feeds.json

would take a long, long time, and it wouldn't even work since I don't have even close to enough memory to make that work.

Luckily, the dumpdata and loaddata management commands are actually just wrappers on the internal serializers in Django. I decided to iterate through my models and grab 500 rows at a time, serialize them and then immediately de-serialize them (so Django could move from database to database without complaining).

#!python
import sys
from django.core import serializers

def migrate(model, size=500, start=0):
    count = model.objects.using('mysql').count()
    print "%s objects in model %s" % (count, model)
    for i in range(start, count, size):
        print i,
        sys.stdout.flush()
        original_data =  model.objects.using('mysql').all()[i:i+size]
        original_data_json = serializers.serialize("json", original_data)
        new_data = serializers.deserialize("json", original_data_json, 
                                           using='default')
        for n in new_data:
            n.save(using='default')

migrate(Feed)

This assume that you have both databases setup in your settings.py like so:

#!python
DATABASES = {
    'mysql': {
        'NAME': 'newsblur',
        'ENGINE': 'django.db.backends.mysql',
        'USER': 'newsblur',
        'PASSWORD': '',
    },
    'default': {
        'NAME': 'newsblur',
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'USER': 'newsblur',
        'PASSWORD': '',
    }
}

Note that I changed my default database to the Postgres database, because otherwise some management commands would still try to run on the default MySQL database. This is probably resolved and I didn't do something right, but when I migrated, I changed Postgres to be the default database.

I just run the short script in the Django console and wait however long it takes. This script prints out which set it's working on, so you can at least track the progress, which might take a long, long time, but is much less prone to crashing like dumpdata and loaddata.

A word of warning to those with large datasets. Instead of iterating straight through the table, see if you have a handier index already built on the table. I have a table with a million rows, but there are a few indices which can quickly find stories throughout the table, rather than having to order and offset the entire table by primary key. Adapt the following code to suit your needs, but notice that I use an index on the Feed column in the Story table.

#!python
import sys
from django.core import serializers

def migrate_with_model(primary_model, secondary_model, offset=0):
    secondary_model_data = secondary_model.objects.using('mysql').all()
    for i, feed in enumerate(secondary_model_data[offset:].iterator()):
        stories = primary_model.objects.using('mysql').filter(story_feed=feed)
        print "[%s] %s: %s stories" % (i, feed, stories.count())
        sys.stdout.flush()
        original_data = serializers.serialize("json", stories)
        new_data = serializers.deserialize("json", original_data, 
                                           using='default')
        for n in new_data:
            n.save(using='default')

migrate_with_model(primary_model=Story, secondary_model=Feed)

This makes it much faster, since I only have to sort a few hundreds records rather than the entire Story table and its million rows.

Also of note is that while all of the data made it into the Postgres tables, the sequences (counts) were all off. Many were at 0. To remedy this easily, just use the count of the table itself and store it in the sequence table, like so:

#!sql
select setval('rss_feeds_tag_id_seq', max(id)) from rss_feeds_tag;
select setval('analyzer_classifierauthor_id_seq', max(id)) from analyzer_classifierauthor;            
select setval('analyzer_classifierfeed_id_seq', max(id)) from analyzer_classifierfeed;              
select setval('analyzer_classifiertag_id_seq', max(id)) from analyzer_classifiertag;               
select setval('analyzer_classifiertitle_id_seq', max(id)) from analyzer_classifiertitle;             
select setval('analyzer_featurecategory_id_seq', max(id)) from analyzer_featurecategory;

I just made a quick text macro on the table names. This quickly set all of the sequences to their correct amounts.

This post has been translated to Spanish by Maria Ramos.

Samuel Clay is the founder of NewsBlur, a trainable and social news reader for web, iOS, and Android. He is also the founder of Turn Touch, a startup building hardware automation devices for the home. He lives in San Francisco, CA, but misses Brooklyn terribly. In another life in New York, he worked at the New York Times on DocumentCloud, an open-source repository of primary source documents contributed by journalists.

Apart from NewsBlur, his latest projects are Hacker Smacker, a friend/foe system for Hacker News, and New York Field Guide, a photo-blog documenting New York City's 90 historic districts. You can read about his past and present projects at samuelclay.com.

Follow @samuelclay on Twitter.

You can email Samuel at samuel@ofbrooklyn.com. He loves receiving email from new people. Who doesn't?

Elsewhere