Time spent researching a topic online will yield a wealth of information from many sources. Unfortunately, so much of that information is redundant. Reading each wastes time.

The holy grail of information hunting is to find that one document that sufficiently covers the topic, contains all the relevant data, with no duplicate information in the form of either text or images.

Air Force researchers at the Information Directorate in Rome, New York, have put their heads to this issue and developed a system for removing redundant text and images from digital documents.

It comprises the steps of organizing text into sentences and paragraphs; analyzing the sentences and the paragraphs; comparing this with other documents; and identifying redundancies between the documents. Further steps include

  • Extracting statistical features such as the size of a paragraph in characters; character histograms; the number of sentences; the number of words in each sentence; word histograms; and the starting and ending word of each sentence
  • Determining whether similar statistical features exist and if so, then deciding the paragraphs are similar,
  • Removing redundant paragraphs,
  • Comparing sentences and paragraphs with other documents
  • Determining whether the paragraphs are placed in a different order and if so, then analyzing the starting word of each sentence, and analyzing the length of each sentence

This invention also examines all the images related to the set of original documents and removes the same or similar images while keeping pointers that could assist in a future reconstruction of the original documents. At this point, the invention merges text-paragraphs and images and creates the new document.

