SE6023 Lab1 Slides

tags: hadoop
author: rageNami

Lab1: Introduction to Hadoop

Important notice and
the detailed tutorial is available at
./lab1


Topology overview


Base Hadoop Framework Modules


Ref:


Map/Reduce Workflow


Compression and Input Splits



Ambari

Login issue has been resolved. Now you only need to log in with username ‘user’ and password ‘resu’.

Hadoop management web UI


YARN

A framework for job scheduling and cluster resource management


HDFS

A distributed file system designed to run on commodity hardware


HDFS Commands Mapping

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html


Common linux command Hadoop command
mkdir apple hadoop fs -mkdir apple
ls apple hadoop fs -ls apple
cat file.txt hadoop fs -cat file.txt

Hadoop command Description
hadoop fs -copyFromLocal [src] [dest] upload data from local to HDFS
hadoop fs -put [src] [dest] similar to above
hadoop fs -copyToLocal [src] [dest] upload data from HDFS to local
hadoop fs -get [src] [dest] similar to above
hadoop fs -cp [src] [dest] internal copy

Reference

Hadoop programs can be written in javascript.


Lab Assignments

Start date: 2019/03/18 PM 00:00
Deadline: 2019/03/25 PM 12:00 (+2 weeks) -> 2019/04/08 PM 12:00

評分依據 Grading Policy


Practice 1

WordEmail Count


Description

Please count the number of emails that end with a particular domain.

For example, input:

abcde@gmail.com
fghijk@yahoo.com.tw
lmnop@msn.com

Output:

gmail.com 1
yahoo.com.tw 1
msn.com 1

Hint: Modify WordCount.java!


Input File

Input file are located in /user/pdclab/lab1/practice1/input/

$ hadoop fs -ls /user/pdclab/lab1/practice1/input
a12345@gmail.com
dadssa@yahoo.com.tw
dsfsadfd@yahoo.com.tw
aaaaa@msn.com

Practice 2

Inverted Index


Indicate where the word located in and frequency of occurrence.

For example, input:

File1

MapReduce is simple

File2

MapReduce is powerful is simple

File3

Hello MapReduce bye MapReduce

Output:

Hello     file3.txt:1;
MapReduce file3.txt:2;file2.txt:1;file1.txt:1;
bye       file3.txt:1;
is        file2.txt:2;file1.txt:1;
powerful  file2.txt:1;
simple    file2.txt:1;file1.txt:1;

We’ll use our (hidden) test file to verify your program.


How to hand in

Save output file(part-r-00000) for practice 1 and source code(.java) for practice 2 in ~/lab1_practice1 & ~/lab1_practice2 directory seperately (not in hadoop file system), we will check them after the deadline.
You can check your Hand-in Status here. If you cannot find your file, try to make the two directories clean and name your files like others.


Hand in per group or per person are both acceptable,
but hand in per person might cause undefined behavior (UB).


↓↓↓↓Tips↓↓↓↓

Notice

IntelliJ