SE6023 Lab1 Tutorial

tags: hadoop

Remote Hadoop Cluster Environment

Let us speak frankly before you get started :

Handy URLs

Recommended Developing Environment

For Windows users, please search for “Installation of OpenSSH For Windows” or run a linux distribution in a VM.
We will use Ubuntu Desktop in this guide unless other OS is mentioned.

Must-have skills

Basic Linux Command/VSCode-SFTP Integration

Create Your First Hadoop Program: WordCount

If you are not familiar with the working environment, try out using plain text editor first. The section will cover all the command you need to run example code successfully.
More experienced user may want to use IDEs/Maven to speed up development.

Using Plain Text Editor

For example, I logged in with apple account, and get WordCount.java source code from here.

$ ssh apple@pdc19.csie.ncu.edu.tw # Paste WordCount source code to WordCount.java, by scp, wget or your favorite text editor(nano, vim, emacs....etc) $ nano WordCount.java # ^X=Ctrl+X:exit nano # ... # Compile your program $ hadoop com.sun.tools.javac.Main WordCount.java $ jar cf wc.jar WordCount*.class # Create folder in hadoop file system (fs) $ hadoop fs -mkdir -p /user/apple/lab1/input # Create input file in local file system and upload them to hadoop fs $ echo "Hello World Bye World" > file1 $ echo "Hello Hadoop Goodbye Hadoop" > file2 $ hadoop fs -copyFromLocal ./file1 /user/apple/lab1/input/file1 $ hadoop fs -copyFromLocal ./file2 /user/apple/lab1/input/file2 # Run the application $ hadoop jar wc.jar WordCount /user/apple/lab1/input /user/apple/lab1/output # Cat the result $ hadoop fs -cat /user/apple/lab1/output/part-r-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Using IntelliJ with Maven

About Maven
Maven dynamically downloads Java libraries and Maven plug-ins from one or more repositories such as the Maven 2 Central Repository, and stores them in a local cache…

Install maven and IntelliJ

Create a Maven Project

Feel free to fill in

Add the following code between <project>...</project> of pom.xml, making maven automatically download hadoop-related dependencies.

<repositories> <repository> <id>apache</id> <url>http://maven.apache.org</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.1.0</version> </dependency> <dependency> <groupId>org.jetbrains</groupId> <artifactId>annotations</artifactId> <version>RELEASE</version> <scope>compile</scope> </dependency> </dependencies>

Create a new class called WordCount

Paste WordCount code, and click the refresh button. You will notice that maven are downloading dependencies.

Generate .class (Optional)

Before compiling, we recommend the version of bytecode no less than 8.

Compile code, Build > Build Project, the WordCount.class will be generated under target/classes.

Generate .jar (Optional)

Create Artifacts first

Build > Build Artifacts, the WordCount.jar will be generated.

Troubleshooting

Using Eclipse

Google is your BFF.

Upload file to Remote Server (Optional)

Method 1: Using SCP (for win10/linux)

# Example scp file1 file2 apple@pdc19.csie.ncu.edu.tw:./

Use WinSCP to do this in graphical way.
You can integrate scp with VSCode.

Method 2: IntellJ Ultimate

Ultimate version of IntelliJ has a useful feature, can upload the compiled JAR to SSH(SFTP) server directly. You can follow setup guides on Remote Server Configuration and Uploading and Downloading Files.

However, you need a student account to activate Ultimate version, if you do not have one, follow this link and register with NCU mail (or @g.ncu.edu.tw).

Method 3: Copy-and-paste in a text editor

Simple but always useful. Just launch nano/vim and paste by right mouse click or Ctrl+Shift+V(In most Linux graphical terminals).

References for some useful tools

Vim

>>> 終端機和Vim教學 <<<
>>> Vim Tutorial <<<