|
| OCR
and Imaging |
|
 |
dtSearch
supports the PDF "image with hidden text" format,
and can highlight right on the scanned image
in this format. |
 |
dtSearch
also supports combined text and image displays
in HTML. |
 |
dtSearch
Desktop and Network include a built-in image
viewer. |
 |
dtSearch
recommends using fuzzy searching for sifting
through possible OCR errors. |
|
|
OCR and PDF
The Adobe PDF file
format provides two ways to combine in a single
file images and OCRed text, or images that
have been converted to text through Optical Character
Recognition (OCR) software.
(1) The "image with hidden text format" stores the complete
original image of a scanned document, along with the text obtained through
OCR. The text is "hidden" in the sense that simply opening
the PDF file displays only the scanned image, not the underlying OCR'ed
text. Because the OCR'ed text is "hidden" in the file, however,
dtSearch can index and search it.

After a search, when a user clicks on an "image with hidden text
format" PDF document, the dtSearch product will display the scanned
image. Because the actual OCRed text is "hidden," the
display will appear to highlight hits directly on the image. Click here for
a dtSearch Web demo showing hidden text highlighting.
(2) Another option for combining scanned images and OCRed text
in a single PDF file uses "small images" for the parts of each
scanned page that do not appear to be text. For example, the format would
store a picture or a signature as a small image embedded in the page.
The format would store the non-picture portion of the page only as OCRed
text.
While the "small images" alternative does not preserve the
true image of the original document, it does produce much more compact
files than the "image with hidden text" option. The "small
images" PDF file usually stores only a few images for each page,
instead of a complete image of the whole document. The text detected
through OCR in the "small images" format can also be more readable
because the resulting PDF file stores it as text with font information
rather than as an image.
For more information
on both PDF / OCR options, including a list of
some additional third-party products that OCR
into the PDF format, click here. |
| |
|
|
 |
The
dtSearch product line can instantly search
terabytes of text across a desktop, network,
Internet or Intranet site. |
dtSearch
products also serve as tools for publishing,
with instant text searching, large document
collections to Web sites or CD/DVDs. |
 |
over
two dozen indexed, unindexed, fielded and full-text
search options |
 |
highlights
hits in HTML, XML and PDF, while displaying
embedded links, formatting and images |
 |
converts
other file types — word processor, database,
spreadsheet, email and full-text of email attachments,
ZIP, Unicode, etc. — to HTML for display
with highlighted hits |
 |
built-in Spider adds
a third-party or other Web site (public, secure
content, password accessible, etc.) to your searchable
database |
 |
Spider supports
Web-based content (HTML, PDF, XML, etc.) as well
as dynamically-generated content (ASP.NET, MS CMS,
SharePoint, etc.) |
| General
supported file types |
| SQL
and similar data sources |
|
|
|