Wednesday, April 11, 2012

BIGDATA


I started working in Semantic technologies in 2008 and was very intrigued with potential of Semantic technologies to create a meaning out of unstructured data. I did POCs and developed libraries using GATE and lingpipe the NLP parser to parse the free English text and breaks down the sentences in to RDF tags. I could run SPARQL queries over the entire text similar SQL queries over RDBMS Database.

Suddenly, in 2012 the concept has come back again with new name BigData.

Major software Vendors are making their products Bigdata compliant. Trend I see is more for Content Management Product companies because they hosted the unstructured text in their CMS

What is BIGData? 
Layman uptake-
1)      Does when data gets very big it becomes BIGDATA?  Keep adding disk does it not solve problem.
2)      Is it is about Hadoop? Hadoop has HDFS similar to Google File Cluster and Map-Reduce which google used for processing large files. So does Hadoop does solves my problem
Let me list few things which characterizes BiGDATA as industry is speaking about-
1)      Big data is massive scale of data which is being generated in the organization in order of petabytes and zetabytes.
a.       Humans generated data like journal, reports , text, documents (text or otherwise), pictures, videos, slideware, chat, blogs etc.
b.      Machine generated data like logs , GPS output, Sensors outputs, Output from interfaces of medical, devices, cameras etc.
c.       One file (video/audio/corpus) itself can be so big as to be of size terabytes. Hence, need the provision of splitting the file and storing.
2)      Data is not very valuable or even interesting because
a.       Data is not so critical that if we lose some of the data the system will crash or revenue loss will happen.
b.      Data starts to provide some value only by aggregation or summary of the entire data.
We would not spending million dollars to house this in traditional data tools like expensive databases.
So BigData is not transactional data.
Transactional is very critical and required ACID for marinating the integrity of the data. Any loss or corruption of such data can be huge revenue loss to the company.

Enter HADOOP
Hadoop Distributed File System (HDFS)- Hence, enters Hadoop, which is Cloud of commodity low value workstation.  Distributed architecture challenges of synchronization, network failure tolerance, redundancy of data are in-built in hadoop. It solves the problem of storing this inexpensive large data on the grid of low commodity machines.
Map-Reduce - Next, the data is so large that processing it in real-time is a very large problem. Again, Hadoop solves this problem by utilizing the famous Map-Reduce paradigm to break large data into smaller sets of data and processing to aggregate into a meaningful data.

Hadoop may be able to solve the problem of infrastructure for BigData.
Vendors are now implementing their product using Hadoop stack for developing grid using low commodity servers.
However, real problem lies in how we actually extract meaning out of unstructured free text. Is Hadoop’s Regular expression enough to extract the meaning out of this free unstructured text.
 Log files are fine; because they have well defined structure and can be extracted meaningfully by regular extraction patterns.
But free text like journals, chat, blogs, reports, books, audio files etc. are different ballgame.  
Enter Semantic technology
Analysis of aggregation–
First step towards analyzing these texts have been of using cluster analysis using corpus by way of machine learning.
Training or Machine Learning - Corpus is created by feeding the Analyzer with large text which creates a dictionary of a sort associated with a concept. This phase is called training.
The next phase is actual analysis. Analyzer is fed the free text and based on its dictionary of words with concepts predicts what is the concept of the input text. It provides the concept matching with some accuracy percentage.
Example -
 If the text contains words like football, stadium, match, score … Most probably the analyzer will predict that the text is of Category:Sports, Sub-Category:Football-Match.
Few more words are like, Obama, speech, audience, ties… Analyzer may indicate some like 15% probability of concept being politics.
Advance Phase –
The Free text is processed and analyzed and concepts (entities- place, person, thing etc.)  are extracted linked with the ton of informative data on web by linking (similar to dumb linking we have of pages in html today; but it will be intelligent linking this time).
The entire text can be broken in RDF statement of facts:- subject, predicate and objects. Well depicted by RDF statements (n-tuple of RDF/XML).
This will allow SQL like queries over RDFs and worlds’s entire information (Bigdata) will be a big meaningful database accessible to Web agents.


2 comments:

  1. I got a weblog from where I know how to in fact get helpful facts concerning my study and knowledge....... by ECO 561 Week 2 provider.

    ReplyDelete
  2. Incredible Post! Exceptionally Informative....please continue exploring different avenues regarding your composition and doing intriguing things! I'm liking it.
    MKT 421 Week 4 Individual Assignment

    ReplyDelete