Topic-based Indexing and Search Relevance

Document Vector Demo


Error
Sorry, something about your request could not be processed. Please try another text sample.
Term Weight

Rethinking Search

The traditional and most commonly used method of determining document relevance is the TF-IDF method. However, this method only considers how often a term occurs, and does not consider the grammatical, semantic, and contextual information associated with each term in the content. Thus a tremendous amount of information is lost when using this method, and search results are often not accurate.

A second generation of text analysis methods rely heavily on statistical machine-learning, such as Latent Semantic Indexing (LSI). Starting with a large matrix of terms and documents, LSI uses a method called singular-value decomposition to find the "latent" relationships between terms. Although these methods begin to take contextual information into consideration, like most matrix methods, they are difficult to scale to large document collections and suffer from large input spaces.

Other methods for relevance ranking in Web search include Google's well-known PageRank algorithm, which takes into account the hyperlink relationships between webpages. However, this method cannot be used for searching enterprise and personal document databases, including emails, where there are usually no hyperlinks.

Based on our research and unique insights into the relationships between language and information, we've developed a smarter, scalable, and more robust framework for topic discovery and relevance ranking.

Quantitative Topic Prominence

Natural languages have an internal grammatical and semantic structure. Our theory is that different parts of the structure carry different amounts of information, and represent different degrees of information focus, and terms with a high degree of information focus can represent topics of a document.

The major differences in our theory vs. traditional methods can be illustrated with a few examples, without going into too much technical details. Suppose a user is searching with the keyword "camera", and consider the two short documents below:

  1. The camera uses a rechargeable battery. It comes with a charger.
  2. My camera is out of battery. I did not bring a charger.

These two documents have very similar sentence structure and word count. The word "camera" occurs once in each document. Thus, a TF-IDF method cannot determine which document is more relevant than the other since the frequencies are the same.

However, the grammatical, semantic, and contextual attributes associated with the word "camera" in (1) are very different from that in (2). These differences represent the different degrees of information focus, or topic prominence, about the word "camera". In our research, many people agreed that document (1) is more informative about the topic of "camera" than document (2) in our research.

In (1), the word "camera" represents the object in a general form, and the rest of the document provides information about the object of camera. In (2), the word "camera" represents a specific instance of a camera, and the rest of the document provides information about a specific event that is focused on the speaker "I" and not on the object "camera". Thus the entire text in (2) provides less information about a "camera" than in (1). Compared to a traditional TF-IDF method, our Quantitative Topic Prominence model will identify the content in (1) as representing a more prominent topic for "camera" than in (2), based on the way information about "camera" is encoded in the documents, and will rank the search results accordingly. Try comparing the two documents in the demo box above.

Consider the next example:

  1. A camera is a device that can take pictures.
  2. John and Mary are photographers. John has two cameras, and Mary has three cameras.

Example (4) contains two occurrences of the word "camera", while (3) contains only one. Using the traditional TF-IDF method, (4) will be ranked higher than (3). In contrast, our topic model and linguistic analysis methods will rank (3) higher than (4), based on the fact that in (3), the word "camera" represents an object in a general form, and the rest of the sentence provides a certain amount of information about camera, while in (4), none of the two instances of the word "camera" is in a prominent topical position, and the rest of the content does not provide as much information about "camera" as in (3). Thus, the word "camera" is a more prominent topic in (3) than in (4), even though the frequency of "camera" in (4) is higher than in (3).

Here's another example:

  1. A camera is a device that can take pictures. There are two common types of cameras. One is a digital camera, and the other is a film camera.
  2. Digital cameras capture images using a semiconductor chip called CCD. Digital cameras became popular about a decade ago. With a digital camera, a user can store images onto a memory card or computer. It is one of the advantages of a digital camera over a film camera.

Document (5) contains 4 occurrences of the word "camera", and document (6) contains 5 occurrences of the word "camera". Again, the traditional method of TF-IDF will rank (6) higher than (5) for a query of "camera". However, it's clear that the main topic of (5) is "camera" while the main topic of (6) is actually "digital camera". As before, in document (5), the word "camera" represents an object in its general form, while in (6), the word "camera" is modified by the word "digital". Digital cameras are a sub-class or a specific instance of a camera. Thus, if the search term is "digital camera", our algorithm will consider (6) being more relevant than (5). However, when the search term is "camera", our algorithm will consider (5) as being more relevant than (6), even though the frequency of "camera" is higher in (6) than in (5). Try comparing the two documents for yourself in the demo box at the top of this page.

We also encourage you to try and see for yourself how traditional methods handle the above examples, such as Google Desktop Search, Windows Search, or an open source search engine like Lucene, and compare them with our results.

Get in touch!

We provide document indexing services using our patented and patent-pending technologies, as well as licenses to our technology partners. You can either use our term vectors alone for indexing and retrieval, or combine them with your existing methods, for much improved search results.

To request access to our API for evaluating our Quantitative Topic Prominence model, please contact us or send us an email at info@LinfoResearch.com. We'd love to get in touch!