-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathDesignNotes
More file actions
98 lines (69 loc) · 3.73 KB
/
DesignNotes
File metadata and controls
98 lines (69 loc) · 3.73 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46. conf.setCombinerClass(Reduce.class);
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
//TODO: Abhi
(Search for this that needs to be implemented)
Split the data -> so that we can work on it independently and
then aggregate the results that we have (sequential)
Put similar things in a common bucket. (Sorted by KEYS)
We need to make sure that BAD input is IGNORED (robutsness is important)
we do not want to run for long time and fuck up for BAD.
Experienece with BATCH processing at SAGITEC
-- Components of this system --
Files/Data - Some input that we need to process/chew upon
System Orchestrator - Accepting input/requests from USER to start some jobs/data crunching work
Job Tracker - Submit the job to this guy - it divides tasks into mappers and reducers (Overall distrubuted Monitoring of the Work)
(Distributed Co-ordinator)
Task Tracker (running as daemon on each of the nodes to do some work)
(Local monitoring of work on each Node) (Local Co-ordinator)
Worker-MapComputation (actually doing the work)
Worker-ReduceComputation (actually doing the work of Reduce)
Polling mechanism needs for the Job's Progress Reporting
Tasks of the Framework:
- Scheduling the Tasks
- Monitorting the Progress
- Re-exeuting the failed tasks
- Handling failure of the nodes
Anatomy of a MapReduce Job Run
Job Submission
Job Initialization
Task Assignment
Task Execution
Progress and Status Updates
Job Completion
Failures
Task Failure
Tasktracker Failure
Jobtracker Failure
Job Scheduling
The Fair Scheduler
Shuffle and Sort
The Map Side
The Reduce Side
Skipping Bad Records
Very Excellent Summary
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System
(see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node.
The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks.
The slaves execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate
interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.
The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the
responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.