MyLearnings: Search

Information in an Enterprise exists in many places. Some places are structured like the databases. Some are unstructured like Emails, Documents, Contracts, and Circulars etc.

Search in Structured Information

The structured information space is very mature with database players like Oracle, SQL Server providing RDBMS based data infrastructure which allows application to store data according to the pre-defined schema. This schema defines the structure of that information.

This structure also defines the way to retrieve the data. SQL Query is very general way to retrieve the information from the RDBMS database. This SQL Query is generally in he format of:

SELECT (ColumnNames) From (Table) Where {Expression (= ColumnName Relation Value) }
This format describes the format to retrieve the data.

Select: allows selecting the required information to be retrieved.
From: allows to select the source where to retrieve.
Where: allows to filter the information as per the requirement.

So for a table,
Select EmployeeAddress from Employee Where EmployeeName = “Kapil”.

Will retrieve information of employee address from Employee Table but only for employee name with provided.
This space is very mature.

However, it is limited only for information which is entered in the database as per the table or database schema. The information which cannot be/or is not entered in database cannot be retrieved.

Searching in Unstructured Information

Enterprise information however does not exist only in databases. There is lot of information almost more than 80% information which is in non-structured sources like Email, Word Documents, PDF Documents, Excel, Chat, Circulars…

• 90% of office communication and information exchange happens over email. Circulars are text or word documents which contains lot of events and decision making information.

• Minutes of Meetings which contains lots of important information exists in the word or text documents.
• Vendors Contracts and SLAs are exist as a Word Document.
• Systems and application’s user guide and user manual exist as documents.

These are typical information of an Enterprise office.

There is very specific information also for various domains Education like-
Schools Books, Courses, Technical Books and Documents, Journals all exist in the Word or PDF format in unstructured manner.

In non-connected PCs …

All this unstructured information has been traditionally stored in our machines file folders.

How do we retrieve them?
-In Windows system, using the search interface we do a search on keyword and specify other parameter like drive name, extension of the file…

But it retrieves the list of large documents which may have the keyword present in the document. We then manually read the document to find the actual information.

-In UNIX system, we do the grep; which find the through the list of documents expression or keyword in regular expression.

Keyword Search

This searching is based on the keyword search which is based on the concept of the system indexing the information offline. This indexing actually inverses the linkage; all the words and phrases are created as index pointing to the document.

So when we search, system actually searches the index. Now when our keyword matches the word in the index, it returns back the document list that word was pointing to.

However, in interconnected world like intranet

All this information of the enterprise can be stored in the common centralized machine in the file folders. This how it still happens…

The retrieval method is almost the same. We specify the keyword and the document list containing that keyword and matching that filter expressions are listed. Then we manually can go in the access the information.

But lately, Enterprise Systems built only managing this type of content Enterprise Content Management (ECM) products have started being used. These ECM products also come in other flavours like Document Management Systems, Digital Asset Management (Digital Assets like audio, video, text document) …

They manage the entire lifecycle of these Information assets; ingest, manage and access/deliver. These products also come in with capabilities of search, policy management, preservation and storage.

So retrieve an asset from ECM, we use the ECM search with same keyword principle and it returns the list of document matching the criteria and we can retrieve the document.

Search, here works on the same principle of keyword search. ECM crawls the entire content and creates index of the content.

Metadata Search

ECM however allows the other type of search called metadata which is based on the metadata of the content. Metadata is the information about the document; abstract, author, date etc. So we can also use the search of metadata to get the list of document we want.

However, in interconnected world like internet

Same keyword search is also the principle used for searching the document or information in larger scale of internet.

Google, the main search engine, crawls the information publish across the word using the crawlers and indexes it on its on large GFS (Google File system) clusters for managing the huge volume of index.

But huge volume of content across the world (almost more than billion distinct web sites), the keyword based search has reached its limitations. Google search has been utilized to its fullest capability to provide the information at fast pace. However, with 1000 of pages being returned for a keyword is resulting in the overload of data with no useful information.

People have to manually search through list of documents to get the information he wants…

Now, comes in the next phase of search linguistic

Linguistic search is further divided into Shallow linguistic and Deep Linguistic(Semantic).

Shallow Linguistic

Shallow linguistic does more than simple keyword search by linking the keyword with its respective lemma and synonyms.

It also retrieves the document with summarization and sentiment analysis.
This is made possible by using training corpus which contains the similar text and which maps the content with particular sentiments. The system is intelligent as it matches the set of keywords with sentiments and categorization.

So when new content is feed to this intelligent sentiment analyzer, it matches the content with its information and scoring index to return the sentiment of the information. It also is able to abstract out the content by removing the repeated sentences and frequency and some linguistic algortimn.

Search, now is narrower, however the documents are still needs to be human analyzed.

Deep Linguistic (Semantic)

Semantic Analysis includes grammatical, logical, morphological analysis and identification of the content. The analysis is as per the context of the sentence, identified with meaning and relations.

Once Semantic Analysis has been done, the words are organized in relation using the semantic technology (RDF). RDF allows defining the statement in form of tuples; Subject, Predicate and Object.

Subject is the resources about which is the subject of this informational statement.Predicate depicts the relationship of this resource to the Object. This Object could be another resource or be text value (some explanation of the subject).

Once the entire document is converted into RDF tuples, it is now in structured format. This structures can be explained in analogy to the RDBMS. Breaking the statement into 3 columns; subject, predicate and Object.

SPARQL (similar RDBMS SQL) is used to retrieve the information from the RDF tuples.

SPARQL is in same format as SQL RDBMS:

SELECT (subject,predicate,object) FROM WHERE {Expression: relation between subject and object}

This differentiates form other searches by directly answers the queries based on the semantic relations existing in the document.

More of Advantage of Semantic Search in different post….

Monday, August 2, 2010

Search

No comments:

Post a Comment