SE6023 Lab3 HBase

Introducing HBase

HBase is the Hadoop database: a distributed, scalable, big data store.
Can host very large tables – billions of rows X millions of columns – on top of clusters built by commodity hardwares.
Capable of providing random, realtime read/write access to big data
Is a type of NoSQL database.
- But lacks of many of the features you can find in a Relational Database Management System (typed columns, secondary indexes, transactions, advanced query languages, etc.)

Data Model and APIs

Table
An HBase table consists of multiple rows.

row	timestamp	column family	column qualifier	value

Row
A row in HBase consists of a row key and one or more columns with values associated with them; “row key” is known as primary key in tranditional database.
DO NOT have data types and always store/access as byte arrays(byte[]).
Column Family/Qualifier
Column families physically colocate a set of columns and their values.
A column qualifier is added to a column family to provide the index for a given piece of data. Given a column family content, a column qualifier might be content:html or content:pdf.
Cell
A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version.
Timestamp
A timestamp is written alongside each value, and is the identifier for a given version of a value.

For furthur details, please read the documents.

Example of a table

row	timestamp	columnFamily	columnQualifier	value
warmpc	t1	CPU	BRAND	AMD
warmpc	t2	CPU	ITEM	TR2-2990WX
warmpc	t3	CPU	PRICE	57999
warmpc	t1	HD	BRAND	TOSHIBA
warmpc	t2	HD	ITEM	MG06ACA10TE-10TB
warmpc	t3	HD	PRICE	11800
warmpc	t1	RAM	BRAND	GSKILL
warmpc	t2	RAM	ITEM	F4-3200C14Q-32GTZRX
warmpc	t3	RAM	PRICE	15990

HBase API Quick Links

Database-Level
Admin, HBaseConfiguration, Connection
Table-Level
Table
Column Family-Level
TableDescriptor, TableDescriptorBuilder, ColumnFamilyDescriptor, ColumnFamilyDescriptorBuilder
Column Qualifier-Level
Put, Get
Value-Level
Bytes

HBase Hands-on

Note

Replace all occurrences of a000000000 with your username.
There are API changes in newer HBase (>=3.0.0).
If you have learned HBase before, you may need to read the docs again, and check whether your code can run properly on our server or not.
Make sure the documents or instructions you googled are using the latest API.

HBase interactive Shell

Although We will focus on using Java/Mapred to work with HBase, you can use the HBase shell for debugging/testing.

$ hbase shell
hbase(main)> help
...

Available Commands Introduction

HBase Cluster Status

(use your UNIX password to log in)
https://pdc19.csie.ncu.edu.tw:18766

Create Table and Columns using Java

Upload the source (name it createHBaseTable.java) to your home directory:

















































/*** createHBaseTable.java ***/
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

public class createHBaseTable {
    public static void main(String[] args) throws Exception {
        if (args.length < 2) {
            System.out.println("Arguments: [newTableName] [Family1] [Family2] ... [FamilyN]");
            System.out.println("Existing table with the same name will be deleted!");
            System.exit(1);
        }
        TableName tableName = TableName.valueOf(args[0]);
        String newColumnFamilies[] = Arrays.copyOfRange(args, 1, args.length);
        ArrayList<ColumnFamilyDescriptor> newColumnFamilyDescriptors = new ArrayList<>();

        TableDescriptor tableDescriptor = null;
        TableDescriptorBuilder tableDescriptorBuilder = TableDescriptorBuilder.newBuilder(tableName);
        Configuration config = HBaseConfiguration.create();
        Connection connection = ConnectionFactory.createConnection(config);
        Admin admin = connection.getAdmin();

        /* Delete the table if it exists */
        if (admin.tableExists(tableName)) {
            System.out.println(tableName + " exists");
            System.out.println("disabling " + tableName + "...");
            admin.disableTable(tableName);
            admin.deleteTable(tableName);
            System.out.println("deleting " + tableName + "...");
        }

        /* Create column families */
        for (String newColumnFamily : newColumnFamilies) {
            newColumnFamilyDescriptors
                    .add(ColumnFamilyDescriptorBuilder.newBuilder(Bytes.toBytes(newColumnFamily)).build());
        }
        tableDescriptorBuilder.setColumnFamilies(newColumnFamilyDescriptors);

        System.out.println("creating " + tableName + "...");
        tableDescriptor = tableDescriptorBuilder.build();
        try {
            admin.createTable(tableDescriptor);
        } finally {
            admin.close();
        }
    }
};

Build it then execute with HBase:

` means Command Substitution.

$ javac -cp `hbase classpath` createHBaseTable.java
## Expect createHBaseTable.class is in the same directory
$ hbase createHBaseTable a000000000 CPU RAM HD
... INFO  [main] client.HBaseAdmin: Operation: CREATE, Table Name: default:a000000000, .... completed

You will be able to see the created table at https://pdc19.csie.ncu.edu.tw:18766/tablesDetailed.jsp (login with your UNIX password!) now.

Create Table Data

Upload the source (name it putDataToHBaseTable.java) to your home directory:


































/*** putDataToHBaseTable ***/
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

public class putDataToHBaseTable {
    public static void main(String[] args) throws Exception {
        if (args.length != 5) {
            System.out.println("Arguments: [TableName] [Row] [Family] [Qualifier] [Value]");
            System.exit(1);
        }
        TableName tableName = TableName.valueOf(args[0]);
        byte[] rowKey = Bytes.toBytes(args[1]);
        byte[] family = Bytes.toBytes(args[2]);
        byte[] qualifier = Bytes.toBytes(args[3]);
        byte[] value = Bytes.toBytes(args[4]);

        Configuration config = HBaseConfiguration.create();
        Connection connection = ConnectionFactory.createConnection(config);
        Table table = connection.getTable(tableName);

        Put put = new Put(rowKey);
        put.addColumn(family, qualifier, value);

        try{
            table.put(put);
        }finally{
            table.close();
            connection.close();
        }
    }
};

Build it:

$ javac -cp `hbase classpath` putDataToHBaseTable.java

Write a simple shell script using text editor and save it as putToTable.sh for easy putting:

#!/bin/bash
table="a000000000"
hbase putDataToHBaseTable $table warmpc CPU BRAND AMD
hbase putDataToHBaseTable $table warmpc CPU ITEM TR2-2990WX
hbase putDataToHBaseTable $table warmpc CPU PRICE 57999
hbase putDataToHBaseTable $table warmpc HD BRAND TOSHIBA
hbase putDataToHBaseTable $table warmpc HD ITEM MG06ACA10TE-10TB
hbase putDataToHBaseTable $table warmpc HD PRICE 11800
hbase putDataToHBaseTable $table warmpc RAM BRAND GSKILL
hbase putDataToHBaseTable $table warmpc RAM ITEM F4-3200C14Q-32GTZRX
hbase putDataToHBaseTable $table warmpc RAM PRICE 15990

Put:

$ chmod +x ./putToTable.sh
$ ./putToTable.sh

Warmpc has nothing to do with the famous computer components seller in Taiwan.

Get Value in Table

Upload the source (name it getHBaseTableValue.java) to your home directory:


































/*** getHBaseTableValue ***/
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

public class getHBaseTableValue {
    public static void main(String[] args) throws Exception {
        if (args.length != 4) {
            System.out.println("Arguments: [TableName] [Row] [Family] [Qualifier]");
            System.exit(1);
        }
        TableName tableName = TableName.valueOf(args[0]);
        byte[] rowKey = Bytes.toBytes(args[1]);
        byte[] family = Bytes.toBytes(args[2]);
        byte[] qualifier = Bytes.toBytes(args[3]);
        
        Configuration config = HBaseConfiguration.create();
        Connection connection = ConnectionFactory.createConnection(config);
        Table table = connection.getTable(tableName);

        Get get = new Get(rowKey);        

        try{
            Result rowResult = table.get(get);            
            byte[] value = rowResult.getValue(family, qualifier);
            System.out.println("Value: " + Bytes.toString(value));
        }finally{
            table.close();
            connection.close();
        }
    }
};

Build it then execute with HBase:

$ javac -cp `hbase mapredcp`:`hadoop classpath` getHBaseTableValue.java
$ hbase getHBaseTableValue a000000000 warmpc CPU ITEM
...
Value: TR2-2990WX
...

Working with MapReduce

HBase can be access by Hadoop/MapReduce. You can read/write from HDFS to HBase, and vice versa.

From HDFS to HBase

Upload the data (name it hbase.txt) to your home directory of hadoop(on HDFS):









coldpc CPU BRAND AMD
coldpc CPU ITEM R7-2700X
coldpc CPU PRICE 10490
coldpc RAM BRAND KINGSTON
coldpc RAM ITEM HyperX-D43200-8Gx4
coldpc RAM PRICE 6988
coldpc HD BRAND WD
coldpc HD ITEM UltrastarDCHC520
coldpc HD PRICE 15890

## copy-paste with use your favorite text editor or use sftp
$ nano hbase.txt
$ hadoop fs -put hbase.txt /user/a000000000/hbase.txt

Upload the source (name it hbaseMapredInput.java) to your home directory:

Remember to change the username.































































































import java.io.IOException;
import java.util.*;

import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.hbase.mapreduce.TableReducer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class hbaseMapredInput {
    public static class hbaseMapredInputMapper extends Mapper<LongWritable, Text, Text, MapWritable> {
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());

            int i = 0;
            MapWritable outputMap = new MapWritable();
            Text outputKey = new Text();

            while (itr.hasMoreTokens()) {
                switch (i) {
                case 0:
                    outputKey.set(itr.nextToken());
                    break;
                case 1:
                    outputMap.put(new Text("family"), new Text(itr.nextToken()));
                    break;
                case 2:
                    outputMap.put(new Text("qualifier"), new Text(itr.nextToken()));
                    break;
                case 3:
                    outputMap.put(new Text("value"), new Text(itr.nextToken()));
                    context.write(outputKey, outputMap);
                    outputMap.clear();
                    break;
                }
                i++;
            }
        }
    }

    public static class hbaseMapredInputReducer extends TableReducer<Text, MapWritable, NullWritable> {
        public void reduce(Text key, Iterable<MapWritable> values, Context context)
                throws IOException, InterruptedException {
            for (MapWritable valueObject : values) {
                String family = valueObject.get(new Text("family")).toString();
                String qualifier = valueObject.get(new Text("qualifier")).toString();
                String value = valueObject.get(new Text("value")).toString();
                Put put = new Put(Bytes.toBytes(key.toString()));
                put.addColumn(Bytes.toBytes(family), Bytes.toBytes(qualifier), Bytes.toBytes(value));
                context.write(NullWritable.get(), put);
            }
        }
    }

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.out.println("please enter input, tablename");
            System.exit(0);
        }

        String input = args[0];
        String tablename = args[1];
        String username = "a000000000"; //TODO change me
        Configuration config = HBaseConfiguration.create();

        config.set(TableOutputFormat.OUTPUT_TABLE, tablename);
        config.set("hbase.zookeeper.quorum", "hadoop-slave1,hadoop-slave2,hadoop-slave3,hadoop-master");
        config.set("zookeeper.znode.parent", "/hbase-unsecure");

        Job job = Job.getInstance(config);
        job.setJobName("hbaseMapredInput_" + username);

        job.setJarByClass(hbaseMapredInput.class);
        job.setMapperClass(hbaseMapredInputMapper.class);
        job.setReducerClass(hbaseMapredInputReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(MapWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TableOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(input));
        job.waitForCompletion(true);
    }
}

build it, pack into a jar then run with hadoop:

Note
The program need to package as an JAR portable executable, so hadoop can distribute it to nodes, making map-reduce possible.

$ javac -cp `hbase classpath` hbaseMapredInput.java 
$ jar cf hbaseMapredInput.jar hbaseMapredInput*.class
## Change username to yours
$ hadoop jar hbaseMapredInput.jar hbaseMapredInput hbase.txt a000000000
...

From HBase to HDFS

Upload the source (name it hbaseMapredOutput.java) to your home directory:




















































































import java.io.IOException;
import java.util.*;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapper;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class hbaseMapredOutput {
    public static class hbaseMapredOutputMapper extends TableMapper<Text, MapWritable> {
        public void map(ImmutableBytesWritable key, Result value, Context context)
                throws IOException, InterruptedException {
            MapWritable outputMap = new MapWritable();
            for (Cell kv : value.rawCells()) {
                outputMap.put(new Text("family"), new Text(Arrays.copyOfRange(kv.getFamilyArray(), kv.getFamilyOffset(),
                        kv.getFamilyOffset() + kv.getFamilyLength())));
                outputMap.put(new Text("qualifier"), new Text(Arrays.copyOfRange(kv.getQualifierArray(),
                        kv.getQualifierOffset(), kv.getQualifierOffset() + kv.getQualifierLength())));
                outputMap.put(new Text("value"), new Text(Arrays.copyOfRange(kv.getValueArray(), kv.getValueOffset(),
                        kv.getValueOffset() + kv.getValueLength())));
                context.write(new Text(
                        Arrays.copyOfRange(kv.getRowArray(), kv.getRowOffset(), kv.getRowOffset() + kv.getRowLength())),
                        outputMap);
                outputMap.clear();
            }
        }
    }

    public static class hbaseMapredOutputReducer extends Reducer<Text, MapWritable, Text, Text> {
        public void reduce(Text key, Iterable<MapWritable> values, Context context)
                throws IOException, InterruptedException {
            for (MapWritable valueObject : values) {
                String family = valueObject.get(new Text("family")).toString();
                String qualifier = valueObject.get(new Text("qualifier")).toString();
                String value = valueObject.get(new Text("value")).toString();
                context.write(key, new Text(family + " " + qualifier + " " + value));
            }
        }
    }

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.out.println("please enter output, tablename");
            System.exit(0);
        }

        String output = args[0];
        String tablename = args[1];
        String username = "a000000000"; // TODO change this

        Configuration config = HBaseConfiguration.create();
        config.set(TableInputFormat.INPUT_TABLE, tablename);

        // These two lines are only needed when you run on the server
        config.set("hbase.zookeeper.quorum", "hadoop-slave1,hadoop-slave2,hadoop-slave3,hadoop-master");
        config.set("zookeeper.znode.parent", "/hbase-unsecure");

        Job job = Job.getInstance(config);
        job.setJobName("hbaseMapredOutput_" + username);

        job.setJarByClass(hbaseMapredOutput.class);
        job.setMapperClass(hbaseMapredOutputMapper.class);
        job.setReducerClass(hbaseMapredOutputReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(MapWritable.class);
        job.setInputFormatClass(TableInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(output));
        job.waitForCompletion(true);
    }
}

## Run this line if you have already run the program once. Change the username!
$ hadoop fs -rm -r /user/a000000000/lab3-out
## It's normal to see the error if you are first time executing the program
rm: '/user/a000000000/lab3-out': No such file or directory 

$ javac -cp `hbase classpath` hbaseMapredOutput.java
$ jar cf hbaseMapredOutput.jar hbaseMapredOutput*.class
$ hadoop jar hbaseMapredOutput.jar hbaseMapredOutput lab3-out a000000000
$ hadoop fs -cat /user/a000000000/lab3-out/part-r-00000
......
coldpc  RAM PRICE 6988
coldpc  RAM ITEM HyperX-D43200-8Gx4
coldpc  RAM BRAND KINGSTON
coldpc  HD PRICE 15890
coldpc  HD ITEM UltrastarDCHC520
coldpc  HD BRAND WD
coldpc  CPU PRICE 10490
coldpc  CPU ITEM R7-2700X
coldpc  CPU BRAND AMD
......

Practices

Due date
2019/05/31 23:59 (GMT+8:00)

Requirements Updated 2019/5/27

Practice 1

Put the following data into the table you have created during the hands-on using HBase Java APIs. Data can be hard-coded into your java program (will get less score since it is easier) or be provided by a shell script like what we have done during the hands-on.
透過HBase Java APIs將下列的資料放進你在hands-on時建立的table。可接受將資料寫死（但分數會較低）或透過shell script方式提供（如在範例程式的作法，可獲得完整的分數）。

row	columnFamily	qualifier	value
Yaya	MB	BRAND	ASUS
Yaya	MB	ITEM	RogZenithExtremeAlpha
Yaya	MB	PRICE	17990

How to hand in 如何繳交
Save your code named username_practice1.java in your home directory/lab3 (~/lab3) and make sure it can be compiled with the filename. (We will rename the code file with your public static class name and then compile it.)
If you use a shell script to provide data, name it username_practice1.sh and put in the same folder as java code.
將你的程式以username_practice1.java（如：s107123456_practice1.java）為檔名存檔在家目錄/lab3底下，並確定可以正常用該檔案/類別成功編譯(評分程式會以public static的類別名稱重新命名程式碼檔案再編譯)。

How will we test the program

If only a java file is provided, we will modify the username in your code, run it without any arguments and check if the data has been correctly added to the table.
If a java file and a shell script are provided, we will modify the username in your shell script, run the shell script and check the results.

Tips

You will need to modify the code in the hands-on to add a new column family into the existing table. Copy-paste the sample code will only get basic score.
你需要修改程式碼以新增原本不存在的Column Family到已經存在的table中。 只是複製貼上範例程式碼只會獲得基本分數。
Be aware for output messages when you are running shell scripts. You may need to set your terminal scrollback to Unlimited or tee the output to a file to check if there were something wrong (exceptions…, etc).

Practice 2

Try to add this data into the table you created during the hands-on with MapReduce Java API. You should read the data from a text file located at the HDFS, and the arguments of the java program should be [programName] [dataFilePathOnHDFS] [tableName] or [programName] [dataFilePathOnHDFS].
透過MapReduce Java API將下列的資料放進你在hands-on時建立的table。資料來源必須是HDFS中的檔案，程式的arguments需為[programName] [dataFilePathOnHDFS] [tableName] 或 [programName] [dataFilePathOnHDFS]。

row	columnFamily	qualifier	value
P100	MB	BRAND	ASUS
P100	MB	ITEM	X399
P100	MB	PRICE	8787

How to hand in
Save your code named username_practice2.java in your home directory/lab3 (~/lab3).

How will we test the program
We will modify the username in your code(only when necessary), compile it, run with the test data (and a test table) as arguments of the program and check the result.

Tips
To add a new column into the existing table, you may need to:

Connect HBase again(with zookeeper & config.set() lines in the hand-on) and get Admin instance inside the reducer.
Since reducer part of code may be executed on the data nodes, the reducer will not be able to access variable/objects stored in your original class(which will be stored on the name node). Without the code of adding new column family, you would get less score.
需要在Reducer中重新連線並透過Admin的instance以新增新的Column Family到table中。沒有新增column family將會獲得較低的分數。
Catch org.apache.hadoop.hbase.InvalidFamilyOperationException will make you easier to done the task without checking whether the column exists or not.
If the column provided in input already exists, just skip it!
If you want to hard-code your username in the program, please let us know by adding a //TODO modify here comment. Hard-code username wont get less score in practice 3-2. 把username寫死在 3-2 這題並不會影響分數。
~~Howerver, TODO is a pretty liar, use it carefully.~~
You can use HBase shell to re-enable the disabled table conveniently when debugging. Try enable 'tableName' in the shell.

Tips

Working with IDEs: Maven Project

For intellJ/Eclipse/VSCode…etc with maven users, add related repo and dependency in pom.xml then you can get code complete features:

<repositories>
	...
	<repository>
		<id>mvnrepository</id>
		<url>https://mvnrepository.com/repos/central</url>
	</repository>
	...
</repositories>
<dependencies>
	...
	<dependency>
		<groupId>org.apache.hbase</groupId>
		<artifactId>hbase-client</artifactId>
		<version>2.1.4</version>
	</dependency>
	<dependency>
		<groupId>org.apache.hbase</groupId>
		<artifactId>hbase-common</artifactId>
		<version>2.1.4</version>
	</dependency>
	...
</dependencies>

LOG NEEDED?

You can see the logs of slaves from https://pdc19.csie.ncu.edu.tw:18764/cluster/ . Just click your programName_studentID for more details, and click logs in the page.
But you will got error/NXDOMAIN page since the slaves are behind a firewall, to solve this problem, copy the URL of the log page, and use text browser from the SSH shell:

## Replace the URL with your log URL
$ lynx http://hadoop-slave1:8042/node/containerlogs/container_e27_1555285318657_0022_01_000001/s107525003

Use Page Down/Up to browse the page
/ to find with keywords
q to quit