HADOOPEXAMPLES: December 2015

Importing a table from teradata into hadoop hdfs /hive using TDCH command line interface .(In two different environments)

Explanation of importing table from Teradata into Hive, Hdfs

-----------------------------------------
a)Teradata database in 192.168.229.130 and Cloudera /hortonworks is in different operating system.
b) Teradata connector for hadoop is installed in Cloudera where library files are stored in lib folder and configuration files "*.xml" stored in hadoop home($HADOOP_HOME) "CONF" directory.

c)teradata-export-properties.xml.template
teradata-import-properties.xml.template

d)tdgssconfig.jar , teradata-connector-1.4.1.jar, terajdbc4.jar in lib folder of tdch .

e) ConnectorImportTool for import jobs
f)ConnectorExportTool for export jobs
g)ConnectorPluginTool is also another method to do the desired import/export where the job is identified by
command line parameters sourceplugin/targetplugin and plugin-in specific parameters identified by -D< ..> option

Step 1:
cli>export TDCH_JAR=/usr/lib/tdch/1.4/lib/teradata-connector-1.4.1.jar

cli>
hadoop jar $TDCH_JAR com.teradata.connector.common.tool.ConnectorImportTool
-classname com.teradata.jdbc.TeraDriver
-url jdbc:teradata://192.168.229.130/DATABASE=dbc
-sourcetable tables
-username dbc -password dbc
-jobtype hdfs -fileformat textfile
-method split.by.hash
-separator ","
-splitbycolumn DatabaseName
-targetpaths /user/hadoop/sales_transaction

NOTES:
Database in teradata : DBC
(sourcetable)Table name in teradata : tables
Username for teradata : dbc
Password for teradata : dbc
jobtype : HDFS (where the file has to be imported to hdfs that is the reason the job type is hdfs)
Target paths : hdfs path where the file has to be stored.
seperator: the field names data is seperated by ',' in hdfs
fileformat : the import file format has to be stored as textfile
method : split.by.hash
splitbycloumn: in table which column you want to split data by for the mapper/reducer(to identify as key)

Ouput "n" no of part files instead of one.

Splitting a huge GB file into smaller part files for each accessing and export operations.(hands on example)

Difference in two BAG elements comparision/Two fields comparison when not in a BAG:DIFF()

Explanation:

DIFF takes two bags as arguments and compares them. Any tuples that are in one bag but not the other are returned.

- If the fields are not bags then they will be returned if they do not match,

-or an empty bag will be returned if the two records match.

The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously. If that is not the case the UDF will still function, but it will be very slow.

Counting records with nulls, and counting without nulls.COUNT_STAR(),COUNT()

Pig COUNT() And COUNT_STAR() hands on example

SUBSTRING() of a String in pig

In Pig Starting from count of zero as position value will fetch the length of five as given above.

Spilt a string based on delimiter specified (STRSPLIT())

PIG STRSPLIT() example

STRSPLIT() will split the string based on parameters passed to it .
Explanation of STRSPLIT(line , '   ',5) is
a)splits the record named "line" based on first parameter to the function passed
b) based on second parameter '   ' (space ) the delimiter the action takes place.
C)So starting from a count of one it splits " line" into 5 words based on the delimiter '   ' space.

A bag of word into individual words in Pig. Spliiting into words. TOKENIZE()

Pig TOKENIZE() will output a bag of words hands on example

Based on condition Splitting one file to multiple Files ,

In PIG Output number of files based on condition and prepare a view for easy access in hive

$PIG_HOME/bin> ./pig -x local /home/tradernyse/pigscripts/nysesplitout.pig

nysesplitout.pig script

x = load '/home/tradernyse/clientTraderDatasets_12216/nyseflumeprocessed.csv' using PigStorage(',') As (f1:chararray,f2:chararray,f3:chararray,f4:chararray,f5:chararray,f6:chararray,f7:chararray,f8:chararray,
f9:float);
split x into x1 if f9 < 50, x12 if f9 >= 50 and f9 < 60, x23 if f9 >= 60 and f9 < 70, x34 if f9 >= 70 and f9 < 80, x45 if f9 >= 80 and f9 <= 90, x5 if f9 >= 90;

store x1 into '/home/tradernyse/clientTraderready_12216/nyselessthan50' using PigStorage(',') ;

store x12 into '/home/tradernyse/clientTraderready_12216/nysebtw50_60' using PigStorage(',') ;

store x23 into '/home/tradernyse/clientTraderready_12216/nysebtw60_70' using PigStorage(',') ;

store x34 into '/home/tradernyse/clientTraderDatasets_12216/nysebtw70_80' using PigStorage(',') ;

store x45 into '/home/tradernyse/clientTraderDatasets_12216/nysebtw80_90' using PigStorage(',') ;

store x5 into 'home/tradernyse/clientTraderDatasets_12216/nysegreater90' using PigStorage(',') ;

Counting the no of elements in a BAG,TUPLE .SIZE() in pig .

Using Pig SIZE() hands on example

Excluding nulls and zero values. MAX(),SUM(),MIN(),AVG() in pig.

Pig Hands on example with excluding nulls and calculating aggregates.

Binary condition in pig ( If -then -else)

Cleaning a Data file having braces,brackets or any symbols and replacing them with comma (Save as comma delimited file)

Replacing symbols (,),{,},\,@with any specified delimiters in pig (cleaning a data file to desired format)

sample records: to clean
(7,98243),7,34842,09K Video,1,Allen
(11,77623),11,34843,J Case 1500,2,John
(19,88734),19,34857,DVD J INT,7,Hubert
(24,45641),24,34856,500 GB HD T,5,Roger
(47,92387),47,34854,J Power 300W,1,Cliff
(64,77624),64,34833,J Case 1501,17,Mello
(64,92387),64,34847,J Power 300W,4,Mello

cli>pig -x local
--laoding the entire record as LINE of datatype chararray
--loading maxtemp* file in path /home/hadoop/datasets

grunt>fstep1 = load '/home/hadoop/datasets/maxtemp* using PigStorage('\n') as (line:chararray);

grunt>r1 = foreach fstep1 generate REPLACE(line, '\\/' , ' ,') as (line:chararray);
grunt>r2 = foreach r1 generate REPLACE(line, '\$' , ' ,') as (line:chararray);
grunt>r3 = foreach r2 generate REPLACE(line, '\$' , ' ,') as (line:chararray);
grunt>r4 = foreach r3 generate REPLACE(line, '\\{' , ' ,') as (line:chararray);
grunt>r5 = foreach r4 generate REPLACE(line, '\\}' , ' ,') as (line:chararray);

grunt>store r5 into '/user/hadoop/pigresults' using PigStorage(',');

Weather