Open source grid computing takes off

This has been fun to watch. The Hadoop team at Yahoo! is moving quickly to push the technology to reach its potential. They’ve now adopted it on one of the most important applications in the entire business, Yahoo! Search.

From the the Hadoop Blog:

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

  • Number of links between pages in the index: roughly 1 trillion links
  • Size of output: over 300 TB, compressed!
  • Number of cores used to run a single Map-Reduce job: over 10,000
  • Raw disk used in the production cluster: over 5 Petabytes

I’m still trying to figure out what all this means, to be honest, but Jeremy Zawodny helps to break it down. In this interview, he gets some answers from Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development):

The Hadoop project is opening up a really interesting discussion around computing scale. A few years ago I never would have imagined that the open source world would be contributing software solutions like this to the market. I don’t know why I had that perception, really. Perhaps all the positioning by enterprise software companies to discredit open source software started to sink in.

As Jeremy said, “It’s not just an experiment or research project. There’s real money on the line.

For more background on what’s going on here, check out this article by Mark Chu-Carroll “Databases are hammers; MapReduce is a screwdriver”.

This story is going to get bigger, I’m certain.

Interactive journalism: An amazing homicide mashup

I had the pleasure of interviewing Sean Connelly and Katy Newton for YDN Theater recently with YDN videographer Ricky Montalvo. They created the amazing (and award-winning) crime data mashup Not Just A Number in partnership with The Oakland Tribune.

Not Just A NumberAfter getting tired of watching the homicide count for 2006 climb higher and higher, they decided to humanize the issue and talk to the families of the victims directly. They wanted to expose the story beneath the number and give a platform upon which the community could make the issue real.

Statistics can tell effective stories, but death and loss reach emotional depths beyond the power of any numerical exploration.

Sean and Katy posted recordings of the families talking about the sons, daughters, sisters and brothers that they lost. They integrated family photos, message boards, articles and more along with the interactive homicide map on the site to round out the experience making it much more human than the traditional crime data mashup.

Here is the video (7 min.):

I also asked them if they had trouble getting data to make the site, and they said the Oakland Tribune staff were very supportive. There weren’t any usable open data sets coming out of the city, so they had to collect and enter everything themselves.

This, of course, is a very manual process. Given the challenge of getting the data Sean and Katy didn’t see how the idea could possibly scale outside of the city of Oakland.

SOmebody needs to take that on as a challenge.

I’m hopeful that efforts like Not Just A Number and the Open Government Data organization will be able to surface why it’s important for our government to open up access to the many data repositories they hold. And if the government won’t do it, then it should be the job of journalists and media companies to surface government data so that people can use it in meaningful ways.

This is a great example of how the Internet can empower people who otherwise have no voice or audience despite having profound stories to tell.

Investing in video at YDN

We’ve been playing around with video as a communications mechanism on Yahoo! Developer Network for a while now. Our casual attempts to generate interest in Yahoo! technologies through interviews, screencasts, tech talks, etc. have worked really well.

So, we hired a full time videographer/filmmaker named Ricky Montalvo and got him some decent gear to push the envelope a little further. And today we rolled out YDN Theater on the YDN web site to establish a home for all the work he has been producing.

The journey here started with a pretty lame but surprisingly successful screencast that Dan Theurer and I did to explain how browser-based authentication worked. It was blurry. We made mistakes. The subject matter was pretty abstract. And neither Dan nor I have particularly strong camera presence.

Regardless, it has been viewed over 19,000 times, so far.

We kept pushing with new types of videos such as partner showcases with people like Joyce Park, Adam Rifkin, and Leah Culver. We brought the camera to our various Hack Days and produced a particularly funny recap of the London event. And we recorded tech talks from our own staff at Yahoo! and presentations from guest speakers like Grady Booch, Joe Hewitt and David Weinberger.

By the time we found Ricky, we knew we were building a program that was going to be really interesting. Yet, we hardly spent any money other than a few cheap cameras and some basic editing tools including Camtasia at that point.

The success to date I think has been in large part due to the fact that we haven’t tried to pimp out our videos with any professional plastic gloss or staged demos. We also try to have a little fun with them. Jeremy Zawodny is a really good interviewer. His unassuming yet pointed questions get people to say things they otherwise wouldn’t include on any planned script. And the fact that the videos are raw with few cuts or edits make them feel real, too.

There are some good video program ideas floating around here that could be a lot of fun, but now we’re torn between how much time we want to spend building out the video offering and how much time we want to spend on all the other ways the team can evangelize Yahoo! technologies.

I’m not sure how to measure that decision just yet, but as long as people are consuming these shows we do with such enthusiasm we’ll probably tilt the scale in favor of doing more video whenever possible.

The Hack Day London Video

I’m heading back home from Hack Day London tomorrow. What a spectacular event.

I did my best to capture the behind-the-scenes action this time, as I think the Hack Day event process itself is really interesting, too. Of course, sharing the day-to-day work would be frightengly boring, but you can at least get a sense of what happens the day or so before Hack Day starts in this video here:

What’s easy to forget is that the event process itself is treated like a hack. We break the rules. We invent on the fly. We don’t know if it will work.

Anyhow, there’s more to come, I’m sure.

(Apologies for the horrible editing in this video…it’s my own hack contribution…unpolished, experimental, and a little bit broken.)

The Yahoo! Mail Web Service screencasts

The Yahoo! Mail team rolled out a groundbreaking service today — the Yahoo! Mail Web Service. As Chad says in the announcement:

“With the Yahoo! Mail Web Service, you can connect to the core mail platform to perform typical mailbox tasks for premium users such as list messages and folders, and compose and send messages (you can also build mail preview tools for free users with limited Web Service functionality). In other words, developers outside of Yahoo! can now build mail tools or applications on the same infrastructure we use to build the highly-scaled Yahoo! Mail service that serves nearly 250 million Yahoo! Mail users today.”

Very cool!

Jeremy Zawodny and I spent some time both with lead engineer Ryan Kennedy and then a Hack Day hacker Leah Culver to screencast the tools they each built using the Yahoo! Mail Web Service (Mail Search and Flickr Postcard). Jeremy asked the hard questions while I recorded and produced the video. You can see them both below.

With this screencast we decided to also offer downloadable versions in addition to the web-ready and shareable Yahoo! Video versions. We debated a bit about what downloadable format to offer and decided the ipod-friendly M4V was the best choice. The best solution is probably to offer all formats and posts on all the video sharing sites, but we didn’t have time for that.

Here is the full download for Ryan’s demo, and here is Leah’s.

And here they are embedded:

Any experience you have or thoughts on how we should share these types of videos would be welcome.

How to layer postproduction visuals in a screencast

Jeremy Zawodny and I produced another screencast last week, a look inside Pipes with Pasha Sadri and Ed Ho. The Pipes guys shared their insights while we asked a few questions and recorded the screen and the audio.

I’ve been trying to improve on each screencast with a new trick or some efficiency. This time I tried to mix in some relevant still shots in the editing process to support the voice over.

Camtasia was a little stickier here but still very easy to use. After setting up the production and editing out some bits, I used SnagIt to capture web site screen shots and crop them to focus on a small area. I imported them into the production. Then I added the screen shots to the Picture-in-picture track. Lastly, I zoomed in on each PIP file so it took up the whole screen and slid it along the timeline to get the right positioning with the audio.

There’s a segment toward the end of the video where Pasha is saying some really interesting stuff, however I didn’t have anything relevant to splice in visually. So, I didn’t quite get this right. But you’ll see that it works nicely in certain parts of the video. It keeps the pace going while people are talking. It also allows you to grab additional media that you didn’t think to pull up while recording the original video.

For example, Pasha mentions that there are several sites that have begun creating tutorials for Pipes, so I grabbed screensots of 3 that I found and layered them in.

I don’t think this is what the software was intended to do, so please tell me if you know a better way to accomplish this same effect. Here is the screencast which is also available on Yahoo! Video:

Some fun visuals

Presentation Zen offered up a great collection of visuals today I thought were worth pointing to here including this stunning use of typography running in parallel with Samuel L. Jackson’s briefcase interrogation in Pulp Fiction (not exactly work safe…turn down the volume):

A magazine I would love to read

There’s a magazine that I’d love to read if someone published it (yes, the print kind). Of course, it’s about the Internet. It’s about the stack that makes up the Internet, the platform or, as many people are calling it, the Internet Operating System. It’s mostly technology. But it’s a little bit business. And it’s definitely artful.

It’s not Business 2.0 or Red Herring. It’s not The Industry Standard, though I’d be happy to read that again, too. Those were/are too business-focused and often misunderstand the wider impact of many breakthroughs.

It challenges the people in positions to change things to make changes that matter. It exposes the advances in the market that have negative repurcussions to the Internet as a platform for good.

It’s critical and hard-hitting. It’s accurate. And it is therefore trusted and respected.

It isn’t first to report on anything. It might even be last, but it gets the story right.

It dives into services like Pipes, EC2, and Google Apps. It analyzes algorithms, data formats, developer tools, and interactive design. It studies human behaviors, market trends, new business models, leadership strategies and processes.

It’s not about startups, but it may be about why VCs like certain startups. I love the fact that Brad Burnham of Union Square Ventures disclosed the broader motivations for investing in AdaptiveBlue:

“We are particularly excited about the prospect of AdaptiveBlue developing tools that allow users to build the semantic web from the bottom-up to fill in the gaps and correct the top-down approach when necessary.”

This magzine should be printed monthly with lots of possibilities online that may actually be more successful in the long term. (I can imagine the print magazine turning into a sort of marketing vehicle for the web site. )

It includes longer deep-dive articles that have been throughly researched and copyedited. The editors are paid very well because they are experienced and talented. It also includes samples from the blogosphere and insights from contributors and participants who care deeply about the subject. There are intelligent interviews of people who are innovating and actually doing important things. There are insightful case studies of both the methods and results of certain technology breakthroughs. And there are columns that remind us to keep it real.

What I want from a new magazine about the Internet Operating System is to understand the technology breakthroughs and their meaning in the conext of the history of the Internet. I want to know what we can learn from art and innovation online to understand what lies ahead. The business model breakthroughs matter hugely, but I think they often matter as a result of an innovative technology rather than serve as a driver.

How is the Internet as a platform, operating system, network — whatever you want to call it — evolving? Who and what is influencing change? What are the trends that indicate this progression? How do new online developments impact communication, governments and social organizing principles?

Of course, a lot of this is out on the web in bits and pieces. But I’m too lazy to go through my entire feedreader and follow all the links to all the interesting stories out there. Maybe someone could invent a personalized and distributed Digg that surfaced what mattered to me more efficiently. But even then, I’d still pay a subscription fee and happily browse through endemic advertising for someone to assemble something thoroughly thought through, designed nicely and printed on my favorite portable reading medium — paper (recycled, of course).

And I’d read it in part because I would know everyone in the business would be reading it, too. At least, I suspect I’m not alone in wanting this…?