An introduction to BIG DATA in TCC, GNDEC

First of all, it is a very old post which I drafted back in 2013. A lot of things has changed since then, the way I write, the way I think, and many more. So lately, I have decided to publish such drafts(mainly related to my technical learning). Gonna go with this one, because I never knew that in future I would be doing thesis in the same topic.

Well, a very good session related to the buzz world ‘big data’ took place in our tcc in 2013. A lot many concepts were being discussed and not even for single minute I felt bored or tired up. All thanks to Rai sir, Navjot sir, Navdeep Sir and Jagdeep sir. I had been told to make a brief record of the session which was going to take place. So, here we go.

It is a scenario of big data. Everything has grown up to such a large extent. We are so busy using social networking sites, multimedia, online streaming of videos, commenting and liking stuff on facebook, instagram, and what not. Everything going on real time. So, a hell lot of data gets generated through all such activities at a very high speed. This heterogeneous data comes under the category of ‘big data’ as it is big in every sense(size, speed, category, etc). And it is very difficult, almost impossible to deal with this kind of data using traditional database systems and technologies.

The session started with a brief introduction to Big-Data & Big-Data analytics including its introduction, various type of databases and data structures involved in it . Various technical terms like No-SQL(Not only SQL) , Cassandra, Apache hadoop, Hbase, Hive, flume, SAP Hannah etc, were introduced. Further, their applications in various fields like healthcare, transportation etc. from business point of view were being explained by the speaker.

In brief, big data and big data analytics can be dealt in following steps:

Firstly, we have to acquire data in large amounts for analysis. Eg., millions of tweets from twitter.

Next, using No-SQL we can store that acquired data for further processing.  Taking the example of Cassandra which stores data in tabular fashion (which ultimately enhance processing efficiency).

Now to process the data, Hadoop (similar to Map-reduce which originated from google) is a framework for processing parallel problems across huge data sets using a large number of computers (nodes), collectively referred to as a cluster where processing can occur on data stored either in a file system (unstructured) or in a database (structured). It can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. According to the hadoop architecture, the master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

After this, Machine learning, NLP(Natural Language Processing), Artificial intelligence were the other terms and technologies we were made clear about. NLP algorithms deal with analyzing data based upon which predictions are being made, for example, polarity based analytics.

Then Navdeep sir made things more clear by explaining the three main terms involved in companies like IBM.
1) Administration
2) Development
3) Analysis

And finally, there was a hands on session on sentiment analysis using python programming.

Leave a comment