So, the buzz is data science. We are now discussing everywhere there was a big hype in the media about the data science. If you look back sometimes early 2000 you could see people start talking about it regularly and frequently. However, it was originally presented back in the sixties in academia. In Sep 2012 data science has been awarded as “The Sexiest Job of the 21st Century” at Harvard Business Review, a prestigious and very popular magazine by Harvard Business School. Now you could imagine how it able to grab all recent attention and limelight.
Now, what is data science?
In short, a scientific way to look at, analyze data which in turn would open up a new dimension of available information. This is a sophisticated disciple requires statistical knowledge and computer programming skill. Data science is a complex disciplinary. Data science become a new and “sexist” profession in the world. Behind this success a long story there was the combination of the mature discipline of statistics on computer science and especially delegate a new phase that involved with a vast story of big data. A long history has been associated to make a sense of data science. It has been analyzed by a scientist, computer scientist, Liberian, statistician and other. The following information gives a brief idea about how the term data science evolved, related term and it’s used.
Data science has been formally introduced evolved to become the sexiest job in the world. There were several mathematicians, scientist and international organizations who played the key role directly or indirectly. Interestingly those contributions were not related to data science always, however, they have defined few building blocks which was very important for data science discipline.
International Federation for Information Processing established in 1960 under UNESCO who set some key guideline and concept on data, how it should be processed and what standard should maintain. However it was not data science at all, but it defines a systematic way of data processing and presenting. Before those guidelines, there were data presentation or processing was limited to individual domain and used to be very difficult to interpret on the other domain. They first introduced a term Datalogy in 1968 to formalize this data analysis practice.
John W. Tukey who was an American mathematician and famous for the development of FFT algorithm and box plot. He writes the book ‘The Future of Data Analysis’ in 1962. He first brought the idea of the relationship between the statistic and analysis or more preciously data analysis. Earlier data analysis used to consider as “applied” disciple of Statistics; which makes the scope very limited and scoped within the business area. In this book he writes:
“For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt……………
I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier……………….
A large part of data analysis are inferential in the sample-to-population sense but these are only parts, not the whole… Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.”
This book has been cited several times in many research paper as the formal introduction of data analysis outside Statistical disciple. At later stage researchers come up with several hypotheses to derived another dimension of the same data resulting in a better decision making.
One important thing to notice Tukey was a Mathematician, not a Statistician; who blends statistical analysis and mathematics together to make data analysis more “scientific” and acceptable.
In the year 1977 Tukey published his another major work: Exploratory Data Analysis. He bought another major idea on how “Explanatory” and “Confirmatory” data analysis should be done and he stressed upon “Side By Side” approach. That means we need the new or revised hypothesis to do this analysis side by side. But why this idea was so important? As mentioned before contemporary data analysis used to be a statistical disciple and limited to the specific domain. For example, a particular hypothesis may be useful to find out a type of health issues of the population but the same hypothesis might not be applicable to another area like identifying a quality of a particular corp.
Another important name that came after Tukey is Peter Naur. In 1974 he published “Concise Survey of Computer Methods” in Sweden and The United States. The book is a collection of modern data processing method from various domain used worldwide in the verity of applications. Other important fundamental aspects of the book were data standard or guidelines defined by International Federation for Information Processing. Which makes those ideas more acceptable and interpretable with wide ranges of audiences. In fact, those ideas detailed in the book comes with a short survey or example data processing. In this book, he used the term “Data Science” several times. At the later stage, Naur produces the new or formal definition of data science. “The science that dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.” From this time the term ‘Data science’ is used very frequently. But it really took a long time to catch on. After his paper data science is pushed towards more and more.
In 1977, The International Association for Statistical Computing (IASC) was originated. It is included as the sector of International Statistical Institute (ISI). The main aim of IASC is to connect and exchange statistical computing worldwide between statistician, computer professional, educational institute, researchers and government on various subject or domain. They start publishing a monthly journal named “Computational Statistics & Data Analysis”. This was a tremendous move as it helps with knowledge sharing and new ideas on computational statistics and data analysis. If you notice by this time data analysis become and has been accepted as an important disciple.
In 1989, first ‘Knowledge Discovery Database Workshop’ has been organized by Gregory Piatetsky- Shapiro, also known as KDD-89. KDD-89 discussed these areas,
- Expert Database Systems
- Scientific Discovery
- Fuzzy Rules
- Using Domain Knowledge
- Learning from Relational (Structured) Data
- Dealing with Text and other Complex Data
- Discovery Tools
- Better Presentation Methods
- Integrated Systems