EMR 4.0.0 in AWS Data Pipeline

I like the AWS Data Pipeline and love to start EMR clusters with it. Unfortunately it’s currently not possible to use the EMR-4.0.0 release when starting a EMR cluster using the pipeline. You currently just find the amiVersion option in the EmrCluster options. Which is bad if you like to use a Hadoop version greater than 2.4.0. This is not supported in the highest amiVersion 3.9.0.

This is the most up-to-date configuration you can get:

"hadoopVersion": "2.4.0",
"amiVersion": "3.9.0"

What does Amazon says to this? I just found this comment in the forum:

currently emr-4.0.0 is not supported on Datapipeline, we are working on it but at the moment I cannot provide and ETA on this.

snippets about… Multiple hosts for assets in Rails

Multiple hosts for assets in Rails

  • One other quick win Rails 2.0 will give you is multiple hosts for assets. Browsers will only have two concurrent connections open for any single host, but an easy way around that is to use multiple subdomains that resolve to the same domain. []
  • config.action_controller.asset_host = ‘assets%d.YOUR_DOMAIN.com’
  • Now, if your page has lots and lots of assets (javascript includes, linked stylesheets, images, and so forth) page download times will decrease when you’re able to fool the browser into thinking it’s talking to multiple hosts (which, again, get 2 concurrent requests each), while it is in fact only talking to a single host. []

snippets about…Google Data (GData)

Google Data

  • GData is a new protocol based on the Atom 1.0 and RSS 2.0 syndication formats, plus the Atom Publishing Protocol. []
  • All sorts of services can provide GData feeds, from public services like blog feeds or news syndication feeds to personalized data like email or calendar events or task-list items. []
  • GData provides a general model for feeds, queries, and results. You can use it to send queries and updates to any service that has a GData interface. []
  • Google Data API Supports JSON []