MapReduce for Sorting 1PB Data
Navtej Kohli
always discover new technologies and invention in the Information
technology and computer science . Because its now be challenge to every
search engines to how to mange data and retrieve the desired results
with in milliseconds. Because there are huge amount of data comes
through web pulishing ,news , blogs and websites and managing such data
in a timely fashion major challenge even when every search engines
promise to give live results .
Google whish is search engine giant and plays important roles for
spreading and sharing information to across the world . Google techie
always working for enhancing search quality and minimizing time for
fetching the search results. Google officials feels that as the data
grow up rapidly so they are working continuously how they minimize the
time to sorting and comparing the data specially when Google is
committed to give live results.
Google had introduce new technology for organize world information
after a long study over that matter to how to sort data so quickly and
discover MapReduce and claming that it will reduce much time and speed
up process of sorting . Basically MapReduce is a key component of the
software industry which makes to run multiple processes simultaneously
and is now perfect solutions if any program want multiple calculation
for generating any results specially when its to be used that program
on regular basis , Google officials also claiming that due to
simplicity it be easily be adopted and run on many environment .as well
as Real world application where much computation required to make
calculation over distributed network data.
According to Google officials Google followed the rules of a standard
terabyte (TB) sort benchmark for making such process Which help to
understand and comparing the various technology and add many other
feature .Google says “You can think of it as an Olympic event for
computations. By pushing the boundaries of these types of programs, we
learn about the limitations of current technologies as well as the
lessons useful in designing next generation computing platforms. This,
in turn, should help everyone have faster access to higher-quality
information”.
Google was feeling proud while announcing that now Google able to sort
1TB stored on the Google File System as 10 billion 100-byte records in
uncompressed text files) on near about 1,000 computers in 68 seconds ,
much better in comparison with pervious working on 910 computers with
1TB data took 209 seconds .
Google also curious to know what happen if a user want to sort more
than one terabytes or more even want to sort one petabyte. A thousands
of terabytes makes one
Petabyte .it is 12 times the amount of archived web data in the U.S.
Library of Congress as of May 2008. In comparison, consider that the
aggregate size of data processed by all instances of MapReduce at
Google was on average 20PB per day in January 2008.
Google give some more interesting feedback took six hours and two
minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers.
We're not aware of any other sorting experiment at this scale and are
obviously very excited to be able to process so much data so quickly.
An interesting question came up while running experiments at such a
scale: Where do you put 1PB of sorted data? We were writing it to
48,000 hard drives (we did not use the full capacity of these disks,
though), and every time we ran our sort, at least one of our disks
managed to break (this is not surprising at all given the duration of
the test, the number of disks involved, and the expected lifetime of
hard disks). To make sure we kept our sorted petabyte safe, we asked
the Google File System to write three copies of each file to three
different disks.
Significantly improved handling of the so-called "stragglers" (parts of
computation that run slower than expected) was a key software technique
that helped sort 1PB. And of course, there are many other factors that
contributed to the result. We'll be discussing all of this and more in
an upcoming publication. And you can also check out the video from our
recent Technology RoundTable Series.
Navtej Kohli
want to congratulate Google techniques for this new technology. Because
the data are growing rapidly and new measurement are coming to evaluate
the quantity of data so new other technologies and continuous work
necessary because we will have to analyze more data than before in
future.
