What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets ofdata representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a highlevel language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading, writing,
and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language.
All scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle
working with Hadoop, especially while performing any MapReduce tasks. Apache
Pig is a boon for all such programmers.
•
Using Pig Latin, programmers can perform
MapReduce tasks easily without having to type complex codes in Java.
•
Apache Pig uses multi-query approach, thereby reducing the length of codes.
For example, an operation that would
require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the
development time by almost 16 times.
•
Pig Latin is SQL-like language and it is easy to
learn Apache Pig when you are familiar with SQL.
•
Apache Pig
provides many built-in operators to support data
operations like joins, filters, ordering, etc. In addition, it also provides
nested data types like tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features:
•
Rich set of operators: It provides many operators to perform
operations like join, sort, filer, etc.
•
Ease of programming: Pig Latin is
similar to SQL and it is easy to write a Pig script if you are good at SQL.
•
Optimization opportunities:
The tasks in Apache Pig optimize their execution automatically, so the
programmers need to focus only on semantics of the language.
•
Extensibility: Using the
existing operators, users can develop their own functions to read, process, and
write data.
•
UDF’s: Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
•
Handles all kinds of data: Apache
Pig analyses all kinds of data, both structured as well as unstructured. It
stores the results in HDFS.
Apache Pig Vs MapReduce
Listed below are the major differences
between Apache Pig and MapReduce.
Apache Pig
|
MapReduce
|
Apache Pig is a data flow language.
|
MapReduce is a data processing paradigm.
|
It is a high level language.
|
MapReduce is low level and rigid.
|
Performing a
Join operation in Apache Pig is pretty simple.
|
It is quite difficult in MapReduce to
perform a Join operation between datasets.
|
Any novice
programmer with a basic knowledge of SQL can work conveniently with Apache
Pig.
|
Exposure to
Java is must to work with MapReduce.
|
Apache Pig uses multi-query approach,
thereby reducing the length of the codes to a great extent.
|
MapReduce
will require almost 20 times more the number of lines to perform the same
task.
|
There is no
need for compilation. On execution, every Apache Pig operator is converted
internally into a MapReduce job.
|
MapReduce
jobs have a long compilation process.
|
Apache Pig Vs SQL
Listed below are the major differences
between Apache Pig and SQL.
Pig
|
SQL
|
Pig Latin is a procedural language.
|
SQL is a declarative
language.
|
In Apache Pig, schema is
optional. We can store data without designing a schema
(values are stored as $01, $02 etc.)
|
Schema is mandatory in SQL.
|
The data model in Apache Pig is nested relational.
|
The data model used in SQL is flat relational.
|
Apache Pig provides limited opportunity
for Query optimization.
|
There is
more opportunity for query optimization in SQL.
|
In addition to above differences, Apache Pig Latin;
•
Allows splits in
the pipeline.
•
Allows developers
to store data anywhere in the pipeline.
•
Declares
execution plans.
•
Provides
operators to perform ETL (Extract, Transform, and Load) functions.
Apache Pig Vs Hive
Both Apache Pig and Hive are used to
create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar
way Apache Pig does. In the following table, we have listed a few significant
points that set Apache Pig apart from Hive.
Apache Pig
|
Hive
|
Apache Pig
uses a language called Pig Latin.
It was originally created at Yahoo.
|
Hive uses a language called HiveQL. It was originally created at Facebook.
|
Pig Latin is a data flow language.
|
HiveQL is a query processing language.
|
Pig Latin is a procedural language and it
fits in pipeline paradigm.
|
HiveQL is a declarative language.
|
Apache Pig can handle structured, unstructured, and semi-structured
data.
|
Hive is mostly for structured data.
|
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks
involving ad-hoc processing and quick prototyping. Apache Pig is used;
•
To process huge
data sources such as web logs.
•
To perform data
processing for search platforms.
•
To process time sensitive
data loads.
Apache Pig Architecture
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high level data
processing language which provides a rich set of data types and operators to
perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need
to write a Pig script using the Pig Latin language, and execute them using any
of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution,
these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture
of Apache Pig is shown below.
Apache Pig – Components
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous checks. The
output of the parser will be a DAG (directed acyclic graph), which represents
the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
Execution
engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the desired
results.
Pig Latin – Data Model
The data model of Pig Latin is fully nested and
it allows complex non-atomic datatypes such as map and tuple. Given
below is the diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is
known as an Atom. It is stored as
string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig.
A piece of data or a simple atomic value is known as a field.
Ex: ‘001’ or ‘rajiv’ or ‘Hyderabad’
Tuple
A record that is formed by an ordered set of fields is known as a
tuple, the fields can be of any type. A tuple is similar to a row in a table of
RDBMS.
Ex: (001, rajiv,
hyd)
Bag
A bag is an unordered set of tuples. In other words, a collection of
tuples (non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.
Ex: cat
emp
ravi,m,10000
rani,f,40000
ram,m,50000
vani,f,60000
mani,m,90000
bags are two types:-
i)
Outerbag
ii)
Innerbag
Outerbag:-
collection all tuples of a dataset is called
outerbag.
outer bag is referenced by "Relation
name" simply called as "Alias of Relation"
Relation
A relation is a
bag of tuples. The relations in Pig
Latin are unordered (there is no guarantee that tuples are processed in any
particular order).
emp ---> relation
___________________
(ravi,m,10000)
(rani,f,40000)
(ram,m,50000)
(vani,f,60000)
(mani,m,90000)
____________________
Innerbag:-
A bag
placed as a field is called inner bag
grp =
group emp by sex;
grp
___________________________
group:chararray ,
emp:bag
________________________________
(f,{(rani,f,40000),(vani,f,60000)})
(m,{(ravi,m,10000),(ram,m,50000),(mani,m,90000)})
{(rani,f,40000),(vani,f,60000)}---> innner
bag.
when you group data, you get inner bags.
Pig has two start-up modes:
1.
Local mode- pig -x local
2.
Hdfs mode- pig -x mapreduce
Pig Latin – Data Model
As discussed in the above, the data model of Pig is fully nested. A Relation is the outermost structure of
the Pig Latin data model. And it is a bag
where -
•
A bag is a
collection of tuples.
•
A tuple is an
ordered set of fields.
•
A field is a
piece of data.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
•
These statements
work with relations. They include expressions and schemas.
•
Every statement
ends with a semicolon (;).
•
We will perform
various operations using operators provided by Pig Latin, through statements.
•
Except LOAD and
STORE, while performing all other operations, Pig Latin statements take a
relation as input and produce another relation as output.
As soon as you enter a Load statement in the Grunt shell, its
semantic checking will be carried out. To see the contents of the schema, you
need to use the Dump operator. Only
after performing the dump operation,
the MapReduce job for loading the data into the file system will be carried
out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
Student_data
= LOAD 'student_data.txt' USING PigStorage(',')as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Pig Latin – Data types
Data Type
|
Description and Example
|
int
|
Represents a signed 32-bit integer.
Example: 8
|
long
|
Represents a signed 64-bit integer.
Example: 5L
|
float
|
Represents a signed 32-bit floating point.
Example: 5.5F
|
double
|
Represents a 64-bit floating point.
Example: 10.5
|
chararray
|
Represents a
character array (string) in Unicode UTF-8 format. Example: ‘tutorials point’
|
Bytearray
|
Represents a Byte array (blob).
|
Boolean
|
Represents a Boolean value.
Example: true/ false.
|
Datetime
|
Represents a date-time.
Example:1970-01-01T00:00:00.000+00:00
|
Biginteger
|
Represents a Java BigInteger.
Example: 60708090709
|
Bigdecimal
|
Represents a Java BigDecimal
Example: 185.98376256272893883
|
Complex Types
|
Tuple
|
A tuple is an ordered set of fields.
Example: (raja, 30)
|
Bag
|
A bag is a collection of tuples.
Example: {(raju,30),(Mohhammad,45)}
|
Map
|
A Map is a set of key-value pairs.
Example:[ ‘name’#’Raju’, ‘age’#30]
|
Pig Latin – Arithmetic
Operators
The following table describes the
arithmetic operators of Pig Latin. Suppose a=10 and b=20.
Operator
|
Description
|
Example
|
+
|
Addition - Adds values on either side of the
operator
|
a + b will give 30
|
-
|
Subtraction - Subtracts right hand operand from left
hand operand
|
a - b will give -10
|
*
|
Multiplication - Multiplies values on either side of
the operator
|
a * b will give 200
|
/
|
Division – Divides left hand operand by right
hand operand
|
b / a will give 2
|
%
|
Modulus – Divides left hand operand by right
hand operand and returns remainder
|
b % a will give 0
|
? :
|
Bincond – Evaluates the
Boolean operators. It has three operands as shown below.
variable x = (expression) ?
value1 if true : value2 if false.
|
b = (a == 1)? 20: 30; if a=1 the value of
b is 20. if a!=1 the value of b is 30.
|
CASE
WHEN
THEN
ELSE END
|
Case - The case operator is equivalent to nested
bincond operator.
|
CASE f2 % 2
WHEN 0 THEN 'even'
WHEN 1 THEN 'odd'
END
|
Pig Latin – Comparison
Operators
Operator
|
Description
|
Example
|
==
|
Equal – Checks if the
values of two operands are equal or not; if yes, then the condition becomes
true.
|
(a = b) is not true.
|
!=
|
Not Equal – Checks if
the values of two operands are equal or not. If the values are not equal,
then condition becomes true.
|
(a != b) is true.
|
>
|
Greater than – Checks if
the value of the left operand is greater than the value of the right operand.
If yes, then the condition becomes true.
|
(a > b) is not true.
|
<
|
Less than – Checks if
the value of the left operand is less than the value of the right operand. If
yes, then the condition becomes true.
|
(a < b) is true.
|
>=
|
Greater than or equal to – Checks if
the value of the left operand is greater than or equal to the value of the
right operand. If yes, then the condition becomes true.
|
(a >= b) is not true.
|
<=
|
Less than or equal to – Checks if
the value of the left operand is less than or equal to the value of the right
operand. If yes, then the condition becomes true.
|
(a <= b) is true.
|
matches
|
Pattern matching – Checks
whether the string in the left-hand side matches with the constant in the
right-hand side.
|
f1 matches '.*tutorial.*'
|
Pig Latin – Relational
Operations
The following table describes the
relational operators of Pig Latin.
Operator
|
Description
|
|
|
|
Loading and
Storing
|
LOAD
|
To Load the data from the file system
(local/HDFS) into a relation.
|
STORE
|
To save a relation to the file system
(local/HDFS).
|
|
Filtering
|
FILTER
|
To remove unwanted rows from a relation.
|
DISTINCT
|
To remove duplicate rows from a relation.
|
FOREACH… GENERATE:
|
To generate data transformations based on
columns of data.
|
STREAM
|
To transform a relation using an external
program.
|
|
Grouping and
Joining
|
JOIN
|
To join two or more relations.
|
COGROUP
|
To group the data in two or more
relations.
|
GROUP
|
To group the data in a single relation.
|
CROSS
|
To create the cross product of two or
more relations.
|
|
Sorting
|
ORDER
|
To arrange a relation in a sorted order
based on one or more fields (ascending or descending).
|
LIMIT
|
To get a limited number of tuples from a
relation.
|
|
Combining
and Splitting
|
|
|
UNION
|
To combine two or more relations into a
single relation.
|
SPLIT
|
To split a single relation into two or
more relations.
|
|
Diagnostic
Operators
|
DUMP
|
To print the contents of a relation on
the console.
|
|
To describe the schema of a relation.
|
|
To view the logical, physical, or MapReduce execution plans to compute
a relation.
|
|
To view the step-by-step execution of a
series of statements.
|
The Load Operator
You can load data into Apache Pig from the file
system (HDFS/ Local) using LOAD
operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand side, we
need to mention the name of the relation where
we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is
the syntax of the Load operator.
Ex: cat student_data.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
Relation_name = LOAD
'Input file path' USING function as schema;
Schema: (column1 : data type,
column2 : data type, column3 : data type);
1.
PigStorage()—TextInputFormat
2.
BinStoarge()—SequenceInputFormat(BinaryFiles)
Default storage method is PigStorage(). Default
delimiter is ‘\t’.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',')as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
grunt>emp = load ‘emp’ using PigStorage
(‘,’) as (ecode:int, ename:chararray, esal:int, sex:chararray, dno:int);
Store
operator
This chapter explains how to store data in Apache Pig using the Store operator.
Syntax
STORE Relation_name INTO '
required_directory_path ' [USING function];
Ex: cat
student_data.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray
);
let us store the relation in the HDFS
directory “pig_Output” as shown below.
grunt> STORE student INTO
'pig_Output/' USING PigStorage (',');
Output
After executing the store
statement, you will get the following output. A directory is created with the
specified name and the data will be stored in it.
--------------------------------------------------------------------------------------------------
hdfs dfs -ls 'pig_Output/' Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03 pig_Output/_SUCCESS
rw-r--r- 1
Hadoop supergroup 224 2015-10-05
13:03 pig_Output/part-m-00000
You can observe that two files were created after executing the store statement.
Using cat command, list the contents of the file named part-m-00000 as shown below.
$ hdfs dfs -cat
'pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai
Diagnostic Operators
•
Dump operator
•
Describe operator
•
Explanation
operator
•
Illustration
operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display
the results on the screen. It is generally used for debugging Purpose.
Syntax
grunt>Dump Relation_Name
Example:
we have a file student_data.txt in HDFS with the following content.
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Output
Once you execute the above Pig Latin statement, it will start a
MapReduce job to read data from HDFS. It will produce the following output. Output
will be terminal.
grunt>Dump student
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai.
Describe Operator
The describe operator is
used to view the schema of a relation.
Syntax:
grunt>describe Relation_Name
grunt>describe student;
Output
Once you execute the above Pig
Latin statement, it will produce the following output.
grunt> student: { id: int,firstname:
chararray,lastname: chararray,phone:
chararray,city: chararray }
Explain Operator
The explain operator is
used to display the logical, physical, and MapReduce execution plans of a
relation.
Syntax
Given below is the syntax of the explain
operator.
grunt> explain Relation_name;
Example
Assume we have a file student_data.txt
in HDFS.
Illustrate
operator
The illustrate operator
gives you the step-by-step execution of a sequence of statements.
Syntax:
grunt> illustrate Relation_name;
Example
Assume we have a file student_data.txt in HDFS.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
grunt> illustrate student;
Output
On executing the above statement, you will get the following output.
grunt> illustrate student;
INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M
ap - Aliases being processed per job phase (AliasName[line,offset]): M:
student[1,10] C: R:
-------------------------------------------------------------------------------
| student | id:int |
firstname:chararray| lastname:chararray| phone:chararray | city:chararray |
-------------------------------------------------------------------------------
| | 002 | siddarth
| Battacharya | 9848022338 |
Kolkata |
-------------------------------------------------------------------------------
Group Operator
The group operator is used
to group the data in one or more relations. It collects the data having the
same key.
Syntax
Given below is the syntax of the group
operator.
Group_data = GROUP Relation_name BY
age;
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below. student_details.txt
1,Rajiv,Reddy,21,9848022337,Hyderabad
2,siddarth,Battacharya,22,9848022338,Kolkata
3,Rajesh,Khanna,22,9848022339,Delhi
4,Preethi,Agarwal,21,9848022330,Pune
5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
6,Archana,Mishra,23,9848022335,Chennai
7,Komal,Nayak,24,9848022334,trivendram
8,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the schema name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')as
(id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
grunt> group_data = GROUP
student_details by age;
Output
Then you will get output displaying the contents of the relation named
groyp_data as shown below. Here you
can observe that the resulting schema has two columns –
•
One is age, by which we have grouped the
relation.
•
The other is a bag, which contains the group of
tuples, student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera
bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336
,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,
trivendram)})
You can see the schema of the table after
grouping the data using the describe
command as shown below.
grunt>
Describe group_data;
group_data:{group:int,student_details:
{(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}}
|
In the same way, you can get the sample illustration of the schema using the illustrate command as shown below.
It will produce the following output:
-------------------------------------------------------------------------------
|group_data|group:int||student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:i
nt,phone:chararray,city:chararray)}|
|
| 21| { 4, Preethi, Agarwal, 21,
9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}|
|
| 22 |
{(2,siddarth,Battacharya,22,9848022338,Kolkata),
(003,Rajesh,Khanna,22,9848022339,Delhi)}|
-------------------------------------------------------------------------------
Grouping by Multiple Columns
Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP
student_details by (age, city);
You can verify the content of the schema named group_multiple using the Dump operator as shown below.
grunt> Dump group_multiple;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details
All;
Now, verify the content of the schema group_all as shown below.
grunt> Dump group_all;
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334
,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw
aneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd
erabad)})
Cogroup Operator
Cogrop is used for to group Two or more
relations
Assume
that we have two files namely student_details.txt
and employee_details.txt in the
HDFS directory /pig_data/ as shown
below. student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
employee_details.txt
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
And we have loaded these files into Pig with the schema names student_details and employee_details respectively, as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')as
(id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
employee_details = LOAD
'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')as
(id:int, name:chararray, age:int, city:chararray); Now,
let us group the records/tuples of the relations student_details and employee_details
with the key age, as shown below.
grunt> cogroup_data = COGROUP
student_details by age, employee_details by age;
Output :
21,{(4,Preethi,Agarwal,21,9848022330,Pune),
(1,Rajiv,Reddy,21,9848022337,Hyderabad)},
{
})
(22,{
(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata)
},
{
(6,Maggy,22,Chennai),(1,Robin,22,newyork) })
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336
,Bhuwaneshwar)},
{(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,
trivendram)},
{ })
(25,{ },
{(4,Sara,25,London)})
The cogroup operator groups
the tuples from each schema according to age where each group depicts a
particular age value.
For example, if we consider the 1st tuple of the result, it
is grouped by age 21. And it contains two bags –
•
the first bag
holds all the tuples from the first schema (student_details in this case) having age 21, and
•
the second bag
contains all the tuples from the second schema
(employee_details in this
case) having age 21.
Join
Operator
The join operator is used
to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys.
When these keys match, the two particular tuples are matched, else the records
are dropped. Joins can be of the following types:
•
Inner-join
•
Outer-join : left
join, right join, and full join
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
Load these two files into Pig with the
schemas customers and orders..
Inner Join
An inner join returns rows when there is
a match in both tables.
Syntax
Here is the syntax of performing inner
join operation using the JOIN
operator.
Relation3_name = JOIN Relation1_name
BY key, Relation2_name BY key;
Example
Let us perform inner join
operation on the two relations customers
and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY
customer_id;
Output:
Verify the relation coustomer_orders
using the DUMP
operator as shown below.
You will get the following output that will the contents of the
relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
Outer Join
An outer join operation is carried out in three ways –
•
Left outer join
•
Right outer join
•
Full outer join
Left
Outer Join
The left
outer Join operation returns all rows from the left
table, even if there are no matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using the JOIN operator.
Relation3_name = JOIN Relation1_name
BY id LEFT OUTER, Relation2_name BY customer_id;
Example
Let us perform left outer join
operation on the two relations customers
and orders as shown below.
grunt> outer_left = JOIN
customers BY id LEFT OUTER, orders BY customer_id;
Output
Verify the relation outer_left
using the DUMP operator as shown
below.
It will produce the following output, displaying the contents of the
relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Right
Outer Join
The right outer join
operation returns all rows from the right table, even if there are no matches
in the left table.
Syntax
Given below is the syntax of performing right outer join operation using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY
customer_id;
Example
Let us perform right outer join
operation on the two relations customers
and orders as shown below.
grunt> outer_right = JOIN
customers BY id RIGHT, orders BY customer_id; outer_right using the DUMP operator as shown below.
grunt> Dump
outer_right;
Output
It will produce the following output, displaying the contents of the
relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
Full
Outer Join
The full outer join operation returns rows when
there is a match in one of the relations.
Syntax
Given below is the syntax of performing full outer join using the JOIN
operator.
grunt> outer_full = JOIN
customers BY id FULL OUTER, orders BY customer_id;
Example
Let us perform full outer join
operation on the two relations customers
and orders as shown below.
grunt> outer_full = JOIN
customers BY id FULL OUTER, orders BY customer_id;
Output
Verify the relation outer_full
using the DUMP operator as shown
below.
It will produce the following output, displaying the contents of the
relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Cross Operator
The cross operator computes
the cross-product of two or more relations. This chapter explains with example
how to use the cross operator in Pig Latin.
Syntax
Given below is the syntax of the Cross operator.
Relation3_name = CROSS
Relation1_name, Relation2_name;
Example
Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below. customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the schemas customers and orders as shown below.
customers = LOAD
'pig_data/customers.txt' USING PigStorage(',')as (id:int, name:chararray,
age:int, address:chararray, salary:int);
orders = LOAD 'pig_data/orders.txt'
USING
PigStorage(',')as (oid:int,
date:chararray, customer_id:int, amount:int);
Let us now get the cross-product of these two schemas using the cross operator on these two schemas as
shown below.
cross_data = CROSS customers,
orders;
Output
It will produce the following output, displaying the contents of the
relation cross_data.
(7,Muffy,24,Indore,10000,103,2008-05-20
00:00:00,4,2060)
(7,Muffy,24,Indore,10000,101,2009-11-20
00:00:00,2,1560)
(7,Muffy,24,Indore,10000,100,2009-10-08
00:00:00,3,1500)
(7,Muffy,24,Indore,10000,102,2009-10-08
00:00:00,3,3000)
(6,Komal,22,MP,4500,103,2008-05-20
00:00:00,4,2060)
(6,Komal,22,MP,4500,101,2009-11-20
00:00:00,2,1560)
(6,Komal,22,MP,4500,100,2009-10-08
00:00:00,3,1500)
(6,Komal,22,MP,4500,102,2009-10-08
00:00:00,3,3000)
(5,Hardik,27,Bhopal,8500,103,2008-05-20
00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,101,2009-11-20
00:00:00,2,1560)
(5,Hardik,27,Bhopal,8500,100,2009-10-08
00:00:00,3,1500)
(5,Hardik,27,Bhopal,8500,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20
00:00:00,4,2060)
(4,Chaitali,25,Mumbai,6500,101,2009-20
00:00:00,4,2060)
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(2,Khilan,25,Delhi,1500,100,2009-10-08
00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08
00:00:00,3,3000)
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20
00:00:00,4,2060)
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20
00:00:00,2,1560)
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08
00:00:00,3,1500)
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,100,2009-10-08
00:00:00,3,1500)
(4,Chaitali,25,Mumbai,6500,102,2009-10-08
00:00:00,3,3000)
(3,kaushik,23,Kota,2000,103,2008-05-20
00:00:00,4,2060)
(3,kaushik,23,Kota,2000,101,2009-11-20
00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,3000)
(2,Khilan,25,Delhi,1500,103,2008-05-20
00:00:00,4,2060)
(2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,2,1560)
(2,Khilan,25,Delhi,1500,100,2009-10-08
00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000)
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20
00:00:00,4,2060)
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20
00:00:00,2,1560)
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08
00:00:00,3,1500)
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08
00:00:00,3,3000)
Union Operator
The UNION operator of Pig Latin is used
to merge the content of two relations. To perform UNION operation on two
relations, their columns and domains must be identical
Syntax
Given below is the syntax of the UNION operator.
grunt> Relation_name3 = UNION
Relation_name1, Relation_name2;
Example
Assume that we have two files namely student_data1.txt and student_data2.txt
in the /pig_data/ directory of
HDFS as shown below.
Student_data1.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Student_data2.txt
7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai.
And we have loaded these two files into Pig with the schemas student1 and student2 as shown below.
student1 = LOAD
'hdfs://localhost:9000/pig_data/student_data1.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray);
student2 = LOAD
'hdfs://localhost:9000/pig_data/student_data2.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray);
Let us now merge the contents of these two relations using the UNION
operator as shown below.
student = UNION student1, student2;
Output
Verify the relation student
using the DUMP operator as shown
below.
It will display the following output, displaying the contents of the
relation student.
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
(7,Komal,Nayak,9848022334,trivendram)
(8,Bharathi,Nambiayar,9848022333,Chennai)
Split Operator
The Split operator is used to split a relation into two or more
relations.
Syntax
Given below is the syntax of the SPLIT
operator.
grunt> SPLIT Relation1_name INTO
Relation2_name IF (condition1), Relation2_name (condition2),
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')as
(id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
Let us now split the relation into two, one listing the employees of
age less than 23, and the other listing the employees having the age between 22
and 25.
SPLIT student_details into
student_details1 if age<23, student_details2 if (22<age and age<25);
Output
Verify the relations student_details1
and student_details2 using the DUMP operator as shown below.
Dump student_details1;
Dump student_details2;
It will produce the following output, displaying the contents of the
relations student_details1 and student_details2 respectively.
Dump student_details1;
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
Dump student_details2;
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
Filter Operator
The filter operator is used
to select the required tuples from a relation based on a condition.
Syntax
Given below is the syntax of the FILTER
operator.
grunt> Relation2_name = FILTER
Relation1_name BY (condition);
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below. student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')as
(id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
Let us now use the Filter operator to get the details of the students
who belong to the city Chennai.
filter_data = FILTER student_details
BY city == 'Chennai';
Output
Verify the relation filter_data
using the DUMP operator as shown
below.
It will produce the following filter_data as follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
Distinct operator
The Distinct operator is used to remove redundant (duplicate) tuples
from a relation.
Syntax
Given below is the syntax of the DISTINCT
operator.
grunt> Relation_name2 = DISTINCT
Relatin_name1;
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below. student_details.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai
And we have loaded this file into Pig with the schema name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray);
Let us now remove the redundant (duplicate) tuples from the relation
named student_details using the DISTINCT operator, and store it as
another relation named data as shown
below.
distinct_data = DISTINCT
student_details;
OUTPUT
Verify the relation distinct_data
using the DUMP operator as shown
below.
It will produce the following distinct_data as follows.
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
Foreach operator
The FOREACH operator is
used to generate specified data transformations based on the column data.
Syntax
Given below is the syntax of foreach
operator.
grunt> Relation_name2 = FOREACH
Relatin_name1 GENERATE (required data);
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below. student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown below.
student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray,age:int,
phone:chararray, city:chararray);
Let us now get the id, age, and city values of each student from the
relation student_details and store
it into another relation named data
using the foreach operator as shown
below.
foreach_data = FOREACH
student_details GENERATE id,age,city;
Out Put:
Verify the relation foreach_data
using the DUMP operator as shown
below.
Dump foreach_data;
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Order By Operator
The ORDER BY operator is used to display the contents of a relation in
a sorted order based on one or more fields.
Syntax
Given below is the syntax of the ORDER
BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below. student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
Let us now sort the relation in a descending order based on the age of
the student and store it into another relation named data using the ORDER BY
operator as shown below.
order_by_data = ORDER student_details
BY age DESC;
Output
Verify the relation order_by_data
using the DUMP operator as shown
below.
It
will produce the following output, displaying the contents of the relation order_by_data.
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Limit
Operator
The
LIMIT operator is used to get a limited number of tuples from a relation.
Syntax
Given below is the syntax of the LIMIT
operator.
grunt> Result = LIMIT
Relation_name required number of tuples;
Example
Assume that we have a file named student_details.txt
in the HDFS directory /pig_data/ as
shown below. student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
Now, let’s sort the relation in descending order based on the age of
the student and store it into another relation named limit_data using the ORDER
BY operator as shown below.
limit_data = LIMIT student_details
4;
Output
Verify the relation limit_data
using the DUMP operator as shown
below.
It will produce the following output, displaying the contents of the
relation limit_data as follows.
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)