Thursday, July 31, 2014

Microsoft Big Data– looking into the HDInsight Emulator

As outlined in a previous post Introducing Windows Azure HDInsight, HDInsight is a pay-as-you go solution for Hadoop-based big data processing running on Windows Azure. But you don’t even have to use Azure to develop and test your code. Microsoft has an HDInsight Emulator that runs locally on your machine and that simulates the HDInsight environment in Azure locally.

HDInsight emulator is installed through the Microsoft Web Platform Installer (WebPI) a 64-bit version of Windows (Windows 7 SP1, Windows Server 2008 R2 SP1, Windows 8 or Windows Server 2012). It also seems to install without issues on Windows 8.1.


After you have installed it you will see three shortcuts on your desktop: Hadoop command line, Hadoop Name Node status and Hadoop MapReduce Status. You should also see three folders created on your local disk (Hadoop, HadoopFeaturePackSetup and HadoopInstallFiles) as well as a number of new windows services being added. The HDInsight emulator is actually a single node HDInsight cluster including both name and data node and use local CPU for compute.



One of the sample Mapreduce jobs which is included is WordCount. To try it out we will first download the complete works of Shakespeare from project Gutenberg. The file is called pg100.txt and we copy it into the folder c:\hadoop\hadoop-1.1.0-SNAPSHOT – next just follow the steps as outlined in Run a word count MapReduce job. For most HDFS-file specific commands use hadoop fs –<command name>  where most of the commands are Linux system commands. For a full list of support Hadoop file system commands, type hadoop fs at the Hadoop command prompt, for a full explanation check out the Hadoop FS Shell Guide. The Hadoop command line shell can also run other Hadoop applications such Hive, Pig, HCatalog, Sqoop, Oozie,etc … To actually run the Wordcount sample job on the HDInsight emulator run:

hadoop jar hadoop-examples.jar wordcount /user/hdiuser/input /user/hdiuser/output


Once the job is submitted you can track its progress using MapReduce status webpage. When the job is finished, the results are typically stored in a fixed file named part_r_NNNNN where N is the file counter.


A second sample which is also included is estimating PI using Monte Carlo Simulation – check out this excellent video explaining the principle behind estimating PI using Monte Carlo Simulation. For an explanation of how to use this with an actual HDInsight cluster check out the Pi estimator Hadoop sample in HDInsight

hadoop jar hadoop-examples.jar pi "16","10000000"


References:
Tags van Technorati: ,,,,

No comments: