Hadoop on Ubuntu - A Novice's Guide by a Novice

Monday, 18 September 2017

Installation of Apache Pig 0.17.0 on Ubuntu 16.04.3

Step 1: Download Pig tar file.

hduser@Soumitra-PC:~$ wget http://www-us.apache.org/dist/pig/pig-0.17.0/pig-0.17.0.tar.gz

Step 2: Extract the tar file using tar command.

In below tar command, x means extract an archive file, z means filter an archive through gzip, f means filename of an archive file.

hduser@Soumitra-PC:~$ tar -xzf pig-0.17.0.tar.gz

hduser@Soumitra-PC:~$ ls

Step 3: Move the extracted file to /usr/local/ directory

hduser@Soumitra-PC:~$ sudo mv /home/hduser/pig-0.17.0 /usr/local

Step 4: Edit the “~.bashrc” file to update the environment variables of Apache Pig.

hduser@Soumitra-PC:~$ sudo gedit ~/.bashrc

Add the following at the end of the file:

#PIG VARIABLES START

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PIG_HOME=/usr/local/pig-0.17.0

export PATH=$PATH:$PIG_HOME/bin

export PIG_CLASSPATH=$HADOOP_HOME_conf

export HADOOP_USER_CLASSPATH_FIRST=true

#PIG VARIABLES END

Also, make sure that hadoop path is also set.

Run the source command to make sure the changes get updated in the ~/.bashrc file.

hduser@Soumitra-PC:~$ source ~/.bashrc

Step 5: Check pig version.

hduser@Soumitra-PC:~$ pig -version

Step 6: Run Pig.

The Grunt shell can be started using the following command:

hduser@Soumitra-PC:~$ pig

Grunt Shell is used to run Pig Latin scripts.

Apache Pig can run in two modes, by default it chooses MapReduce mode. We can also run in MapReduce mode by writing the below command:

hduser@Soumitra-PC:~$ pig -x mapreduce

The other mode, local, can be run as:

hduser@Soumitra-PC:~$ pig -x local

References

https://www.edureka.co/blog/apache-pig-installation

Document prepared by Mr. Soumitra Ghosh

Assistant Professor, Information Technology,

C.V.Raman College of Engineering, Bhubaneswar

Contact: soumitraghosh@cvrce.edu.in

Sunday, 17 September 2017

Running a Hive Script

Step 1: We need to create an input file which contains the records that need to be inserted in the table. Let us create an input file.

hduser@Soumitra-PC:~$ sudo gedit input.txt

Edit the contents in the file as shown in the figure.

Step 2: To write the Hive Script the file should be saved with .sql extension. In this script, a table will be created, described and data will be loaded and retrieved from the table.

hduser@Soumitra-PC:~$ sudo gedit hivescript.sql

On executing the above command, it will open the file with the list of all the Hive commands that need to be executed.

Now, we write the below lines inside the hivescript.sql file:

create table employees ( id INT, name STRING, sal DOUBLE ) row format delimited fields terminated by ',';

describe employees;

load data local inpath '/home/hduser/input.txt' into table employees;

select * from employees;

Significance of each line explained below:

Line 1 -> Creating the Table in Hive:

Command: create table employees ( id INT, name STRING, sal DOUBLE ) row format delimited fields terminated by ',';

Here, employees is the table name and { id, name, sal} are the columns of this table. Fields terminated by ‘,’ indicate that the columns in the input file are separated by the symbol ‘,’. By default, the records in the input file are separated by a new line.

Line 2 -> Describing the Table:

Command: describe employees;

Line 3 -> Loading the Data into the Table.

Command : load data local inpath '/home/hduser/input.txt' into table employees;

To load the input data into the table that we have created earlier in input.txt file.

Line 4 -> Retrieving the Data:

Command : Select * from product;

The above command is used to retrieve the value of all the columns present in the table.

The script should be like as it is shown in the below image:

Now, we are done with writing the Hive script. The file hivescript.sql can now be saved.

Step 3: Running the Hive Script

The following is the command to run the Hive script:

hduser@Soumitra-PC:~$ hive –f /home/hduser/hivescript.sql

While executing the script, make sure that the entire path of the location of the Script file is present.

We can see that all the commands are executed successfully. This is how Hive scripts are run and executed in Hadoop.

References

https://www.edureka.co/blog/how-to-run-hive-scripts/

Document prepared by Mr. Soumitra Ghosh

Assistant Professor, Information Technology,

C.V.Raman College of Engineering, Bhubaneswar

Contact: soumitraghosh@cvrce.edu.in

Installation of Hive 2.3.0 on Ubuntu 16.04.3

Step-by-Step tutorial to Install

Hive on Ubuntu

(with detailed Screenshots and Explanations)

[Note: Here, Hive-2.3.0 is being installed on Ubuntu-16.04.3, but this document can be referred for installation of any version of Hadoop in any version of Ubuntu(14.04 or above)]

Prerequisites:

i) Login as hduser, by running the command $ su hduser.

ii) Start all Hadoop components, and check by jps that all the components are running or not.

Please follow the below steps to install Apache Hive on Ubuntu:

Step 1: Download Hive tar.

hduser@Soumitra-PC:/tmp$ wget http://archive.apache.org/dist/hive/hive-2.3.0/apache-hive-2.3.0-bin.tar.gz

Step 2: Extract the tar file.

hduser@Soumitra-PC:/tmp$ sudo tar -xzf apache-hive-2.3.0-bin.tar.gz -C /usr/local

Step 3: Edit the “~/.bashrc” file.

hduser@Soumitra-PC:/tmp$ sudo gedit ~/.bashrc

Note: In systems where sudo gedit is not working. the file editing can be done using 'vi' or 'nano' command also.

Add the following at the end of the file:

#Hive Variables Start

export HIVE_HOME=/usr/local/apache-hive-2.3.0-bin
export HIVE_CONF_DIR=/usr/local/apache-hive-2.3.0-bin/conf
export PATH=$HIVE_HOME/bin:$PATH
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/apache-hive-2.3.0-bin/lib/*:.

#Hive Variables End

Also, make sure the Hadoop path is also set.

Run below command to make the changes work in same terminal.

hduser@laptop:/tmp$ source ~/.bashrc

Step 4: Check hive version.

hduser@laptop:/tmp$ hive --version

Step 5: Create Hive directories within HDFS.

The directory ‘warehouse’ is the location to store the table or data related to hive.

hduser@Soumitra-PC:~$ hdfs dfs -mkdir -p /user/hive/warehouse
hduser@Soumitra-PC:~$ hdfs dfs -mkdir -p /tmp

Set read/write permissions for table.

hduser@Soumitra-PC:~$ hdfs dfs -chmod g+w /tmp

hduser@Soumitra-PC:~$ hdfs dfs -chmod g+w /user/hive/warehouse

Step 6: Configuring Hive

To configure Hive with Hadoop, we need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive conf folder and copy the template file:
hduser@Soumitra-PC:/tmp$ cd $HIVE_HOME/conf
hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin/conf$ sudo cp hive-env.sh.template hive-env.sh

Edit the “hive-env.sh” file to update the environment variables for user.
hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin/conf$ sudo gedit hive-env.sh

Note: In systems where sudo gedit is not working. the file editing can be done using 'vi' or 'nano' command also.

Set the parameters as shown in the below snapshot.

export HADOOP_HOME=/usr/local/hadoop

Step 7: Edit hive-site.xml

hduser@Soumitra-PC:/tmp$ cd $HIVE_HOME/conf

hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin/conf$ sudo cp hive-default.xml.template hive-site.xml

Edit the “hive-site.xml” file to update the environment variables for user.

hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin/conf$ sudo gedit hive-site.xml

Note: In systems where sudo gedit is not working. the file editing can be done using 'vi' or 'nano' command also.

Replace everything inside hive-site.xml with the below code snippet:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--

Licensed to the Apache Software Foundation (ASF) under one or more

contributor license agreements. See the NOTICE file distributed with

this work for additional information regarding copyright ownership.

The ASF licenses this file to You under the Apache License, Version 2.0

(the "License"); you may not use this file except in compliance with

the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

-->

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=/home/hduser/apache-hive-2.3.0-bin/metastore_db;create=true</value>

JDBC connect string for a JDBC metastore.

To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

</description>

</property>

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

<description>location of default database for the warehouse</description>

</property>

<name>hive.metastore.uris</name>

<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.EmbeddedDriver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

<name>javax.jdo.PersistenceManagerFactoryClass</name>

<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>

<description>class implementing the jdo persistence</description>

</property>

</configuration>

Create a file named jpox.properties inside $HIVE_HOME/conf and add the following lines into it:


javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl

org.jpox.autoCreateSchema = false

org.jpox.validateTables = false

org.jpox.validateColumns = false

org.jpox.validateConstraints = false

org.jpox.storeManagerType = rdbms

org.jpox.autoCreateSchema = true

org.jpox.autoStartMechanismMode = checked

org.jpox.transactionIsolation = read_committed

javax.jdo.option.DetachAllOnCommit = true

javax.jdo.option.NontransactionalRead = true

javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver

javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true

javax.jdo.option.ConnectionUserName = APP

javax.jdo.option.ConnectionPassword = mine



We need to  set permission to Hive folder

hduser@Soumitra-PC:/usr/local$ sudo chown -R hduser:hadoop apache-hive-2.3.0-bin

Step 8: Downloading and Installing Apache Derby


hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin/conf$ cd /tmp

hduser@Soumitra-PC:/tmp$ wget http://archive.apache.org/dist/db/derby/db-derby-10.13.1.1/db-derby-10.13.1.1-bin.tar.gz


hduser@Soumitra-PC:/tmp$ sudo tar xvzf db-derby-10.13.1.1-bin.tar.gz -C /usr/local

Let's set up the Derby environment by appending the following lines to ~/.bashrc file:

#DERBY Variables Start
export DERBY_HOME=/usr/local/db-derby-10.13.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
#DERBY Variables End

We need to create a directory named data into store Metastore data.

hduser@Soumitra-PC:/tmp$ sudo mkdir $DERBY_HOME/data

Now we completed Derby installation and environmental setup.

Step 9: By default, Hive uses Derby database. Initialize Derby database.

hduser@Soumitra-PC:/usr/local$ cd $HIVE_HOME/bin

hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin/bin$ schematool -dbType derby -initSchema

Step 10: Launch Hive.

hduser@Soumitra-PC:/usr/local/apache-hive-2.3.0-bin$ hive

References

1. http://www.bogotobogo.com/Hadoop/BigData_hadoop_Hive_Install_On_Ubuntu_16_04.php

2. https://www.edureka.co/blog/apache-hive-installation-on-ubuntu

Document prepared by Mr. Soumitra Ghosh

Assistant Professor, Information Technology,

C.V.Raman College of Engineering, Bhubaneswar

Contact: soumitraghosh@cvrce.edu.in

Hadoop on Ubuntu - A Novice's Guide by a Novice

Like the Blog?

Followers

Monday, 18 September 2017

Installation of Apache Pig 0.17.0 on Ubuntu 16.04.3

Step 1: Download Pig tar file.

Sunday, 17 September 2017

Running a Hive Script

Line 3 -> Loading the Data into the Table.

Installation of Hive 2.3.0 on Ubuntu 16.04.3

Step 3: Edit the “~/.bashrc” file.

Edit the “hive-site.xml” file to update the environment variables for user.

Like the Blog?

About Me

Like the Blog?

Followers

Monday, 18 September 2017

Installation of Apache Pig 0.17.0 on Ubuntu 16.04.3

Step 1: Download Pig tar file.

Sunday, 17 September 2017

Running a Hive Script

Line 3 -> Loading the Data into the Table.

Installation of Hive 2.3.0 on Ubuntu 16.04.3

Step 3: Edit the “~/.bashrc” file.

Edit the “hive-site.xml” file to update the environment variables for user.

Like the Blog?

Subscribe To

About Me