Friday, 20 February 2015

Debugging Debugging common issue


Configuration Issues

Ø  Using an old version of pig script (0.6 vs 0.7).

Q. How to find out the version ?

A. Pick one of the failed M/R jobs. Look for jobconf variable “java.class.path” to find pig release and check for version on the command line.
Fix: pig –useversion 0.7 …

Ø  Job fails to start  - Too many mappers/reducers

Q: How did we know this ?

A: Find the failed job-id from web-UI and grep for it in the JT logs. Exception tells us the problem
Fix: Set higher block size
          pig –useversion 0.7 –Dmapred.min.split.size=…..
          pig –useversion 0.6 –Dpig.overrideBlockSize=…..

Ø  Mappers/Reducers failing with OOM.

Q: How do we know this is the case?

A: Pick a failed M/R jobs. Look at a failed task stdout/stderr logs to see the stack trace.
Fix: pig –Dmapred.job.(map/reduce).memory.mb=[size in MB]
              -Dmapred.(map/reduce).child.java.opts=“[jvm options]”
Note:-
  1. “*.java.opts” should be at least 200 MB less than “*memory.mb”
  2. Check the cluster config to find out default slot config for “*memory.mb” – (gateway)$ cat $HADOOP_CONF_DIR/mapred-site.xml | grep -A1 "memory.mb”
Giving higher memory might be a temporary solution




 Script Issues

Ø  Using an unsuitable join – FR-Join

Q. How do we know if this is the problem?

A. Mapper/Reducers executing join failing with OOM. Failing job can be correlated to join portion of script using “explain”
Fix: Use regular join. For FR join input data of “right” relation should be less than 150 MB.   

Ø  Failed to specify the number of reducers

Q. What happens in this case?

A. Single reducer running for many hours and getting GBs of data. Job fails sometimes with reducer running OOM.
Fix: Set “default_parallel” in the script. Tuning parallel value for individual M/R boundary operators is recommended

Ø  UDF issues - Bug in java code

Q. What indicates this ?

A. The exception stack trace in the stdout/stderr logs of a failed task   from the last failed job of the script.
Fix. Code inspection and adding System.out debugging statements to nail down the problem. Requires help from user.

Ø  UDF issues - Failing to ship dependent files/jars loaded in UDF

Q. What happens in this case ?

A. The M/R job fails with mappers/reducers throwing an IOException or a FileNotFoundException.
Fix. pig -Dpig.additional.jars=/local/path/to/jar
             -Dmapred.cache.file=hdfs://path/to/file#symlink2
            -Dmapred.create.symlink=yes



Data is corrupt  

Ø  How can this be inferred and validated ?


A.  Very small number (typically 1) of mappers failing leading to job and finally script failure.
Fix. 
            1.  Get the split info of failed mapper from webUI
            2.  Logon to the machine having split
    3.  Grep datanode logs for the failed map attempt-id to get the ‘block-id’ info
    4.  On the machine run fsck to get file information:
                    hadoop fsck <input-path-in-pig-script> -files –blocks | grep –B1 “block-id”
     5.  Run the script with only this path as input to see it fail.

No comments:

Post a Comment