SE6023 Lab1 Slides
author: rageNami
Lab1: Introduction to Hadoop
Important notice and
the detailed tutorial is available at
./lab1
Topology overview
Base Hadoop Framework Modules
Ref:
Map/Reduce Workflow
- Uncompressed
- Compressed
- 1G .gz(DEFLATE)
- Hadoop know it
by looking its file extension
- a single map will process the eight HDFS blocks
- Pig
High-level language for data analysis
- HBase
Table storage for semi-structured data
- Zookeeper
Coordinating distributed applications
Ambari
Login issue has been resolved. Now you only need to log in with username ‘user’ and password ‘resu’.
Hadoop management web UI
YARN
A framework for job scheduling and cluster resource management
HDFS
A distributed file system designed to run on commodity hardware
HDFS Commands Mapping
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Common linux command |
Hadoop command |
mkdir apple |
hadoop fs -mkdir apple |
ls apple |
hadoop fs -ls apple |
cat file.txt |
hadoop fs -cat file.txt |
Hadoop command |
Description |
hadoop fs -copyFromLocal [src] [dest] |
upload data from local to HDFS |
hadoop fs -put [src] [dest] |
similar to above |
hadoop fs -copyToLocal [src] [dest] |
upload data from HDFS to local |
hadoop fs -get [src] [dest] |
similar to above |
hadoop fs -cp [src] [dest] |
internal copy |
Reference
Hadoop programs can be written in javascript.
Lab Assignments
Start date: 2019/03/18 PM 00:00
Deadline: 2019/03/25 PM 12:00 (+2 weeks) -> 2019/04/08 PM 12:00
評分依據 Grading Policy
- 程式碼品質 Code Quality
- 抄襲 Plagiarism
- 繳交時間 Submit Time
Practice 1
WordEmail Count
Description
Please count the number of emails that end with a particular domain.
For example, input:
abcde@gmail.com
fghijk@yahoo.com.tw
lmnop@msn.com
Output:
gmail.com 1
yahoo.com.tw 1
msn.com 1
Hint: Modify WordCount.java
!
Input file are located in /user/pdclab/lab1/practice1/input/
$ hadoop fs -ls /user/pdclab/lab1/practice1/input
a12345@gmail.com
dadssa@yahoo.com.tw
dsfsadfd@yahoo.com.tw
aaaaa@msn.com
Practice 2
Inverted Index
Indicate where the word located in and frequency of occurrence.
For example, input:
File1
MapReduce is simple
File2
MapReduce is powerful is simple
File3
Hello MapReduce bye MapReduce
Output:
Hello file3.txt:1;
MapReduce file3.txt:2;file2.txt:1;file1.txt:1;
bye file3.txt:1;
is file2.txt:2;file1.txt:1;
powerful file2.txt:1;
simple file2.txt:1;file1.txt:1;
We’ll use our (hidden) test file to verify your program.
How to hand in
Save output file(part-r-00000) for practice 1 and source code(.java) for practice 2 in ~/lab1_practice1
& ~/lab1_practice2
directory seperately (not in hadoop file system), we will check them after the deadline.
You can check your Hand-in Status here. If you cannot find your file, try to make the two directories clean and name your files like others.
Hand in per group or per person are both acceptable,
but hand in per person might cause undefined behavior (UB).
Notice
-
Output types of a combiner must match output types of a mapper.
public static class Map
extends Mapper<LongWritable,Text,Text,IntWritable>
public static class Combine
extends Reducer<Text,IntWritable,Text,Text>
public static class Reduce
extends Reducer<Text,Text,Text,FloatWritable>
-
Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{}
public static class Combine
extends Reducer<Text,Text,Text,Text> {}
public static class Reduce
extends Reducer<Text,Text,Text,Text> {}
public static void main(String[] args) throws Exception {
...
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
...
job.setCombinerClass(Combine.class);
job.setReducerClass(Reduce.class);
...
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
...
}
}
IntelliJ
- Enable quick documentation on mouse move
Ctrl+Shift+A
, type settings
SE6023 Lab1 Slides
tags:
hadoop
author:
rageNami
Lab1: Introduction to Hadoop
Important notice and
the detailed tutorial is available at
./lab1
Topology overview
Base Hadoop Framework Modules
Ref:
Map/Reduce Workflow
Compression and Input Splits
Hadoop-related Subprojects
High-level language for data analysis
Table storage for semi-structured data
Coordinating distributed applications
Ambari
Login issue has been resolved. Now you only need to log in with username ‘user’ and password ‘resu’.
Hadoop management web UI
YARN
A framework for job scheduling and cluster resource management
HDFS
A distributed file system designed to run on commodity hardware
HDFS Commands Mapping
Reference
Hadoop programs can be written in javascript.
Lab Assignments
Start date: 2019/03/18 PM 00:00
Deadline:
2019/03/25 PM 12:00(+2 weeks) -> 2019/04/08 PM 12:00評分依據 Grading Policy
Practice 1
WordEmail CountDescription
Please count the number of emails that end with a particular domain.
For example, input:
Output:
Hint: Modify
WordCount.java
!Input File
Input file are located in
/user/pdclab/lab1/practice1/input/
Practice 2
Inverted Index
Indicate where the word located in and frequency of occurrence.
For example, input:
File1
File2
File3
Output:
We’ll use our (hidden) test file to verify your program.
How to hand in
Save output file(part-r-00000) for practice 1 and source code(.java) for practice 2 in
~/lab1_practice1
&~/lab1_practice2
directory seperately (not in hadoop file system), we will check them after the deadline.You can check your Hand-in Status here. If you cannot find your file, try to make the two directories clean and name your files like others.
Hand in per group or per person are both acceptable,
but hand in per person might cause undefined behavior (UB).
↓↓↓↓Tips↓↓↓↓
Notice
Output types of a combiner must match output types of a mapper.
Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
IntelliJ
Ctrl+Shift+A
, typesettings