Is Big Data really Big?

Today, scientists are obsessed with the term “Big Data” yet it sometimes seems that people who use this phrase don’t have a good grasp of its meaning. Like most good buzz-words, “Big Data” sparks the idea of something grand and complicated, though it appears to be quite ordinary for listeners, as they feel like having an explicit idea of the notion. However, the technical meaning of “Big Data” is little bit different from what it actually appears. One can understand the tremendous accumulation of data by understanding the fact that the U.S. healthcare system alone reached approximately 150 exabytes in 2011 and is expected to reach yottabyte scale sooner. The challenge for analyzing this massive volume of varied data is also big and changing due to the increase in data-sharing devises.

One of the biggest promises of “Big Data” is in the area of data accumulation due to the rapid rise of inexpensive whole-genome sequencing technologies using next-generation sequencing (NGS) instruments. Today, the current estimate of world’s sequencing capacity is approximately 13 quadrillion DNA bases per year. These is truly Big Data, requiring 3 or more exabytes of storage. For example, the NIH-funded 1000 Genomes Project deposited around 200 terabytes of raw sequencing data into GenBank during its first 6 months. It is practically twice that has been deposited into GenBank for the past 30 years !

In the year 2012, the U.S. government started a $200 million“Big Data Research and Development Initiative”. This program facilitate the broad use complex biomedical data by developing standards; novel analytical techniques; software; training sessions for scientists; as well as establishing “Centers of Excellence” for developing approaches addressing essential issues in areas of medicine, computational biology, as well as informatics.

As a simple case, cancer is s disease which is always adapting with respect to its environment. To understand its evolution, scientists need to understand the patterns associated with such adaptations. It is here that big data scientists can play a major role by using high-performance computation to understand and study this evolution of cancer. Due to the rapid growth of sequencing technologies, genomic information is growing at a tremendous pace. But, sequencing a particular genome does not give any knowledge, due to the presence of large number of mutations. The target is to identify those mutations which specifically proliferate tumors, thereby identifying and finally killing them. Big data scientists can perform critical quantitative assays to measure and understand the effect of such mutations in the genome which further creates enormous amounts of data that again needs to be integrated with the current available DNA sequencing data. The future of course lies in utilizing these interdisciplinary strategies towards developing personalized cancer treatment.

About Somnath Tagore

I am a Computational Systems Biologist. Currently working on Data Mining and Quantitative Modeling in Cancer genomics.