Are only the initial input and final output are stored in HDFS?


1 Answer(s)


Hi Keerthi,

The actual production environment data flow in map-reduce framework looks like this:
1. Input file stored on hdfs filesystem
2. Map process the data and write outputs to local filesystem
[if we store it on hdfs, it will be replicated to three nodes, which will cause extra storage overhead and as this is a temporary result, not so important]
3. Reduce gets its input from map output and process it
4. Reduce output get stored on hdfs again for further operations

Now to add up few more points:
1. If we are using standalone mode of hadoop installation, map data is picked up from local filesystem and reduce output
wrote back to local filesystem.
2. If we are using only map jobs, output of map job get written on hdfs.

And here are the straight answers:
Question1: Are only the initial input and final output are stored in HDFS?
Answer1: Yes, in actual production environment, initial input and final output are stored in HDFS.

Question2: Are each data note connected to HDFS?
Answer 2: Yes, each data node is connected to HDFS and is part of hadoop cluster.

Question3: Are the the out put of Map stored in HDFS system?
Answer 3: No, in general scenario, map output is stored in local filesystem.
Only, in case of map-only job[no reducer operation], map output is stored on hdfs directly, as this is the final result.


Hope this helps.
Vote-up :)
Happy Learning @ Dezyre