Hadoop Frequently Asked Questions

This document provides answers to frequently asked questions about Hadoop distributed by Cloudera for use on the Oracle Big Data Appliance(BDA).

QUESTIONS AND ANSWERS

Is the environment variable $HADOOP_HOME used in CDH 4.1.2 ?

On BDA V2.0.1 with CDH 4.1.2, $HADOOP_HOME has been deprecated. It is good practice to unset it  if it was previously set.

In lieu of the environment variable $HADOOP_HOME what should be used in CDH 4.1.2 ?

On BDA V2.0.1 with CDH 4.1.2, use $HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce.

Should OS disks (/dev/sda, /dev/sdb) be used to store local data? HDFS data?

No, this is not recommended.

How can data on the OS disks be cleaned up, since storing it there is not recommended?

Simply delete the data and it will be automatically cleaned upon the mirrored disk as well.

Does the Cloudera CDH Client have to be installed on all Exadata DB nodes?

If using Oracle SQL Connector for HDFS then yes, the  CDH Client needs to be installed on all Exadata DB nodes. If using Oracle Loader for Hadoop then the CDH client does not have to be installed on all Exadata DB nodes.

If a disk goes bad and is replaced can you verify the disk is functional with  regards to HDFS?

If a disk goes bad and is replaced you can verify the disk is functional with regards to the local file system  by doing something like:

## copy a file to filesystem on that disk
# cp /<path>/<file> /u03/
# sync

## check for differences between the 2 files
# echo Checking copied file.
# diff /<path>/<file> /u03/.

## copy the file back and check for differences
# cp /u03/<file> /tmp/<file>
# sync
# echo Checking file after copying back.
# diff /<path>/<file> /tmp/<file>

  
There is no  similar sequence which can be done to verify the disk is functional with  regards to HDFS.  This type of functionality is built into HDFS with in-built per block md5 checksums which are read back over time by the blockScanner on each DataNode and repaired via recopy of a healthy block as needed.  Nothing needs to be done on to verify the disk is functional with regards to HDFS other than check that the disk is writable from the OS level as described.

If one of the services managed by Cloudera Manager(CM) goes into "BAD" health, is there a recommended order for checking the status of services?

Yes.  If any of the services goes into "BAD" Health in CM check the status of services  in the order below:

1) First check the Zookeeper service Status.  If Zookeeper is in "BAD" health then the cluster will not be stable.  The Zookeeper service will need to be fixed prior to fixing any other service.  Check the Zookeeper logs for additional details on Zookeeper status.

If the Zookeeper service is in "Good" health then continue to (2)

2. Check the status of the HDFS service.
a) First check the Failover Controller  Status.  If the Failover Controller service is in "BAD" health then then check the log files.

b) Check the NameNode service status.  If the NameNode service is in "BAD" health then check the NameNode logs. 

If the Zookeeper and HDFS services are in "Good" health then continue to (3)

3) Check the logs of the service that is bad.  
You can upload the output from "bdadiag" to an Oracle Support SR for review.

If the nodes of the BDA cluster have been up for close to 200 days is a reboot recommended?

In BDA versions V2.0.1 - V2.2.0 yes all nodes approaching 200 days of uptime need reboot.

Generally first reboot  the node where standby NameNode resides and make sure it's Healthy. Once the standby NameNode is Healthy then manually failover Active NameNode  to Standby After the Active NameNode is switched to standby then reboot that node.

Can you decommision non-critical nodes from BDA HDFS cluster , inorder to install NoSQL ?

Currently it's not supported on BDA to remove/decommission nodes from a deployed HDFS cluster.

For HA testing is it possible to relocate Hive services to a different node after a Hive node failure?

Migration of Hadoop roles (JT, NN, Hive, ZK, etc.) is not currently supported on the BDA. For now you need to stick to the layout of services provided. The software checks will start reporting errors when you move Hadoop roles controlled by Mammoth to different locations.

What options are available for migrating service roles on the BDA?

Since the documentation on Oracle Big Data Appliance Restrictions on Use states that migration of Hadoop roles (NN, Hive, ZK, etc.) is not currently supported on the BDAwhat options are available for migrating service roles on the BDA


The BDA is a fully top-to-bottom supported Hadoop Appliance.  We do not support arbitrary movement of specific Hadoop roles since this may result in configurations that are not supportable ( e.g. both NameNodes on the same host)  or are less than optimal in terms of performance.

However, our goal is to support enough flexibility to meet most  requirements. Clearly there are situations where it is necessary to be able to move master roles off of a particular node.

- when adding a new rack to an existing cluster
- in case of catastrophic failure of a server
- for scheduled maintenance of a rack or of particular servers

We are working on supporting in Mammoth the ability to move all master roles (NameNode, JournalNode, MySQL, Cloudera Manager etc) off a particular server and onto another server (which was previously a regular slave node). The previous master node would become a regular slave node (DataNode + NodeManager) if it was still up. We believe that this ability will support the three cases listed above (and in the case of adding a new rack to an existing cluster we will automatically distribute the 4 master nodes between the 2 racks).  This functionality is planned for a future release.

 

What are the options for destroying i.e. performing a non-recoverable delete all the data stored on the DataNodes in HDFS?

The fastest way to delete all HDFS data in a Hadoop cluster is to run:

$ hadoop fs -rm -R -skipTrash "/*"

This will remove all HDFS data and skip the trash option so the deletes are finalized. The NameNode will still have work to do in that it will have to purge the blocks on all DataNodes after a short period of time.

When destroying HDFS data is there an option for replacing the data blocks on all DataNodes with some random pattern of bytes (0s/1s or something else)? In other words is there a way to securely delete sensitive data from HDFS by overwriting the physical disk locations with new data i.e. with randomly generated output?

No, this would be considered a "secure wipe" and this functionality is not present in HDFS.

Running a very long reducer seems to be filling one DataNode.  Why would that be?

Reducers will write their data to the DataNode on which they are running, which will cause growth on that particular DataNode. The NameNode will make replicas of the blocks written but the original blocks will remain there.