Options to Upload data to Hadoop Distributed File System(HDFS) using Oracle R Connector for Hadoop

APPLIES TO:

Oracle R Connector for Hadoop - Version 1.0 to 1.0 [Release 1.0]
Linux x86-64

PURPOSE

This document provides sample code on how to upload data to Hadoop Distributed File System (HDFS) from OS files, database tables, and ORE/Data frames using Oracle R Connector for Hadoop(ORCH).

Oracle R Connector for Hadoop provides an interface between a local R environment, Oracle Database, and Hadoop Distributed File System(HDFS), allowing speed-of-thought, interactive analysis on all three platforms.

Oracle R Connector for Hadoop(ORCH) is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise (ORE).

REQUIREMENTS

Generic R console with ORE and ORCH packages installed

CONFIGURING

For more information on installing R, ORE, and ORCH on the client server refer to Doc ID 1477347.1

INSTRUCTIONS

 Open R command line console. You can paste the content of the *.R files into R console or execute using the source command.

CAUTION

This sample code is provided for educational purposes only and not supported by Oracle Support Services. It has been tested internally, however, and works as documented. We do not guarantee that it will work for you, so be sure to test it in your environment before relying on it.

Proofread this sample code before using it! Due to the differences in the way text editors, e-mail packages and operating systems handle text formatting (spaces, tabs and carriage returns), this sample code may not be in an executable state when you first receive it. Check over the sample code to ensure that errors of this type are corrected.

SAMPLE CODE

SAMPLE TO UPLOAD OS FILE TO HDFS

Here is sample code to upload a .dat file from OS file to HDFS.

Note:- This sample code is executed as Oracle OS user on BDA. If you intend to execute the sample code as a different OS user then set hdfs.setroot("/user/<OSUserName>") to point to that OS user home directory.

CUpload.R

cat("Using generic R and ORCH functions.\n")
cat("Check the current OS directory and list the contents ..\n")
print(getwd())
print(list.files())
cat("Create an OS directory ..\n")
dir.create("orchtest")
print(list.files())

cat("cd to the newly created directory ..\n")
setwd("orchtest")
print(getwd())

cat("cars is a sample data frame \n")
class(cars)
print(names(cars))

cat("write cars data frame to an OS File \n")
write.csv(cars, "cars_test.dat", row.names = FALSE)
print(list.files())

cat("Load ORCH library ...\n")
library(ORCH)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample1 directory on HDFS ...\n")
hdfs.rmdir('csample1')

cat("Create a new csample1 directory on HDFS ...\n")
hdfs.mkdir('csample1', cd=T)
print(hdfs.pwd())

cat("Upload the dat file to HDFS ...\n")
irs.dfs_File <- hdfs.upload('cars_test.dat', dfs.name='cars_F', header=T)

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.ls())
print(hdfs.size("cars_F"))
print(hdfs.parts("cars_F"))
print(hdfs.sample("cars_F",lines=3))

SAMPLE TO UPLOAD OS FILE TO DATA FRAME AND THEN TO HDFS

Here is sample code to upload a .dat file from OS file to Data Frame and then to HDFS.

CUpload2.R

cat("Using generic R and ORCH functions.\n")
cat("Commands to cd to directory where the .dat/csv file resides ..\n")
getwd()
setwd("orchtest")
print(getwd())
print(list.files())

cat("Create data frame from OS File  \n")
dcars <- read.csv(file="cars_test.dat",head=TRUE,sep=",")
print(names(dcars))

cat("Load ORCH library ...\n")
library(ORCH)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample2 directory on HDFS ...\n")
hdfs.rmdir('csample2')

cat("Create a new csample2 directory on HDFS ...\n")
hdfs.mkdir('csample2', cd=T)
print(hdfs.pwd())

cat("Upload Data Frame to HDFS ...\n")
myfile <- hdfs.put(dcars, dfs.name='cars_F2')

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.exists("cars_F2"))
print(hdfs.size("cars_F2"))
print(hdfs.parts("cars_F2"))
print(hdfs.sample("cars_F2",lines=3))
The data frame (diris) created from OS file can be used with ore.create to create a table in the database. Refer to Oracle R Enterprise User's Guide for sample code.

SAMPLE TO UPLOAD DATA FRAME TO HDFS

Here is the code to upload a Data Frame to HDFS.

DUpload.R

cat("Using generic R and ORCH functions.\n")
cat("cars is a sample data frame \n")
class(cars)

cat("Load ORCH library ...\n")
library(ORCH)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample3 directory on HDFS ...\n")
hdfs.rmdir('csample3')

cat("Create a new csample3 directory on HDFS ...\n")
hdfs.mkdir('csample3', cd=T)
print(hdfs.pwd())

cat("Upload Data Frame to HDFS ...\n")
myfile <- hdfs.put(cars, dfs.name='cars_D')

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.ls())
print(hdfs.size("cars_D"))
print(hdfs.parts("cars_D"))
print(hdfs.sample("cars_D",lines=3))

 

SAMPLE TO UPLOAD DATABASE TABLE TO HDFS

Here is the sample code to create a ORE/Data Frame from database table. Then upload ORE/Data to HDFS.

In this sample along with generic R and ORCH functions used Oracle R Enterprise functions.

For sample code on how to create database tables using ore.create from R Data Frames refer to Oracle R Enterprise User's Guide

Refer to Doc ID 1490291.1 for sample code on how to create the table(DF_TABLE) used in this sample.

Modify dbsid, dbhost, port and RQPASS to match your environment. RQUSER is the user created using demo_user.sh, which is created as part of ORE Server install. Username and Password may differ in your environment.

Also when executing this script in ORE server environment uncomment .libpaths and change <ORACLE_HOME> to absolute path of Oracle Home. ORE server installs needed R libraries/packages in $ORACLE_HOME/R/library , where as ORE Client installs R libraries/packages in $R_HOME/library.

TUpload.R

cat("Using generic R, ORE and ORCH functions.\n")
cat("Load ORE and connect.\n")
# .libPaths("<ORACLE_HOME>/R/library")
library(ORE)
ore.connect("RQUSER","<dbsid>","<dbhost>","RQPASS", <port>)
ore.sync()
ore.attach()
cat("List the tables in RQUSER schema.\n")
print(ore.ls())

cat("Load ORCH and connect.\n")
library(ORCH)
orch.connect("<dbhost>","RQUSER","<dbsid>","RQPASS", <port> , secure=F)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample4 directory on HDFS ...\n")
hdfs.rmdir('csample4')

cat("Create a new csample4 directory on HDFS ...\n")
hdfs.mkdir('csample4')
hdfs.cd('csample4')
print(hdfs.pwd())

cat("Create ORE Frame for DF_TABLE \n")
df_t <- DF_TABLE
print(class(df_t))
print(names(df_t))

cat("Upload ORE Frame to HDFS .. \n")
df.dfs <-  hdfs.push(df_t,  dfs.name='df_T', split.by="A")

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.exists("df_T"))
print(hdfs.size("df_T"))
print(hdfs.parts("df_T"))
print(hdfs.sample("df_T",lines=3))

 

SAMPLE OUTPUT

 

Sample Output of Uploading OS file to HDFS

Open R command line console. You can paste the content of the CUpload.R into R console or execute using the source command.

> dir()
[1] "CUpload.R"
> source("CUpload.R")
Using generic R and ORCH functions.
Check the current OS directory and list the contents ..
[1] "/refresh/home/RTest"
[1] "CUpload.R"
Create an OS directory ..
[1] "CUpload.R" "orchtest" 
cd to the newly created directory ..
[1] "/refresh/home/RTest/orchtest"
cars is a sample data frame 
[1] "speed" "dist" 
write cars data frame to an OS File 
[1] "cars_test.dat"
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1"
Command to remove csample1 directory on HDFS ...
Create a new csample1 directory on HDFS ...
[1] "/user/oracle/RTest/csample1"
Upload the dat file to HDFS ...
ORCH commands to check the file size and sample data ...
[1] "cars_F"
[1] 293
[1] 1
  val1 val2
1   24   93
2   24  120
3   25   85

 

Sample Output of Uploading OS file to Data Frame and then to HDFS

Open R command line console . You can paste the content of the CUpload2.R into R console or execute using the source command.

> dir()
[1] "CUpload2.R" "CUpload.R"  "orchtest"  
> source("CUpload2.R")
Using generic R and ORCH functions.
Commands to cd to directory where the .dat/csv file resides ..
[1] "/refresh/home/RTest/orchtest"
[1] "cars_test.dat"
Create data frame from OS File  
[1] "speed" "dist" 
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2"
Command to remove csample2 directory on HDFS ...
Create a new csample2 directory on HDFS ...
[1] "/user/oracle/RTest/csample2"
Upload Data Frame to HDFS ...
ORCH commands to check the file size and sample data ...
[1] TRUE
[1] 343
[1] 1
  speed dist
1    24   93
2    24  120
3    25   85

 

Sample Output of Uploading Data Frame to HDFS

Open R command line console. You can paste the content of the DUpload.R into R console or execute using the source command.

> dir()
[1] "CUpload2.R" "CUpload.R"  "DUpload.R"  "orchtest"  
> source("DUpload.R")
Using generic R and ORCH functions.
cars is a sample data frame 
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2" "csample3"
Command to remove csample3 directory on HDFS ...
DBG: 21:54:29 [ER] failed to remove "/user/oracle/RTest/csample3"
Create a new csample3 directory on HDFS ...
[1] "/user/oracle/RTest/csample3"
Upload Data Frame to HDFS ...
ORCH commands to check the file size and sample data ...
[1] "cars_D"
[1] 343
[1] 1
  speed dist
1    24   93
2    24  120
3    25   85

 

Sample Output of Uploading Database Table to HDFS

Open R command line console. You can paste the content of the TUpload.R into R console or execute using the source command.

> dir()
[1] "CTab.R"        "CUpload2.R"    "CUpload.R"     "DUpload.R"    
[5] "orchtest"      "TUpload1.R"    "TUpload.R"     "TUpload.R.old"
> source("TUpload.R")
Using generic R, ORE and ORCH functions.
Load ORE and connect.
Loading required package: OREbase
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: ORExml
List the tables in RQUSER schema.
[1] "CARS_TABLE"   "CARS_VTAB"    "CARS_VTAB1"   "DF_TABLE"     "IRIS_TABLE"  
[6] "ONTIME_S"     "ONTIME_S2000" "WADERS_TABLE"
Load ORCH and connect.
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Connecting ORCH to RDBMS via [sqoop]
    Host: celvpint0603
    Port: 1521
    SID:  orcl
    User: RQUSER
Connected.
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2" "csample3" "csample4"
Command to remove csample4 directory on HDFS ...
Create a new csample4 directory on HDFS ...
[1] "/user/oracle/RTest/csample4"
Create ORE Frame for DF_TABLE 
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
[1] "A" "B"
Upload ORE Frame to HDFS .. 
ORCH commands to check the file size and sample data ...
[1] TRUE
[1] 121
[1] 4
    A B
1 13 m
2 26 z
3  7 g
>
Tags