Configuration Issues
Ø Using an old version of pig script (0.6 vs 0.7).
Q. How to find out the version ?
A. Pick one of the failed M/R jobs.
Look for jobconf variable “java.class.path” to find pig release and check for
version on the command line.
Fix: pig –useversion 0.7 …
Ø Job fails to start - Too many mappers/reducers
Q: How did we know this ?
A: Find the failed job-id from web-UI
and grep for it in the JT logs. Exception tells us the problem
Fix: Set higher block size
pig –useversion 0.7 –Dmapred.min.split.size=…..
pig –useversion 0.6
–Dpig.overrideBlockSize=…..
Ø Mappers/Reducers failing with OOM.
Q: How do we know this is the case?
A: Pick a failed M/R jobs. Look at a
failed task stdout/stderr logs to see the stack trace.
Fix: pig –Dmapred.job.(map/reduce).memory.mb=[size
in MB]
-Dmapred.(map/reduce).child.java.opts=“[jvm options]”
Note:-
- “*.java.opts” should be at
least 200 MB less than “*memory.mb”
- Check the cluster config to
find out default slot config for “*memory.mb” – (gateway)$ cat
$HADOOP_CONF_DIR/mapred-site.xml | grep -A1 "memory.mb”
Giving
higher memory might be a temporary solution
Script Issues
Ø Using an unsuitable join – FR-Join
Q. How do we know if this is the
problem?
A. Mapper/Reducers executing join
failing with OOM. Failing job can be correlated to join portion of script using
“explain”
Fix: Use regular join. For FR join input
data of “right” relation should be less than 150 MB.
Ø Failed to specify the number of reducers
Q. What happens in this case?
A. Single reducer running for many hours
and getting GBs of data. Job fails sometimes with reducer running OOM.
Fix: Set “default_parallel” in the script.
Tuning parallel value for individual M/R boundary operators is recommended
Ø UDF issues - Bug in java code
Q. What indicates this ?
A. The exception stack trace in the
stdout/stderr logs of a failed task
from the last failed job of the script.
Fix. Code inspection and adding System.out
debugging statements to nail down the problem. Requires help from user.
Ø UDF issues - Failing to ship dependent files/jars loaded in UDF
Q. What happens in this case ?
A. The M/R job fails with
mappers/reducers throwing an IOException or a FileNotFoundException.
Fix. pig
-Dpig.additional.jars=/local/path/to/jar
-Dmapred.cache.file=hdfs://path/to/file#symlink2
-Dmapred.create.symlink=yes
Data is corrupt
Ø
How can this be inferred and
validated ?
A. Very
small number (typically 1) of mappers failing leading to job and finally script
failure.
Fix.
1.
Get the split
info of failed mapper from webUI
2.
Logon to the
machine having split
3. Grep datanode logs for the failed map
attempt-id to get the ‘block-id’ info
4. On the machine run fsck to get file
information:
hadoop fsck <input-path-in-pig-script>
-files –blocks | grep –B1 “block-id”
5. Run the script with only this path as
input to see it fail.
No comments:
Post a Comment