Detailed instructions for use are in the User's Guide.
[. . . ] BusinessObjects ThingFinderTM SDK Getting Started Guide
BusinessObjects ThingFinderTM SDK 4. 3
Copyright
2008 Business Objects, an SAP company. Business Objects owns the following U. S. patents, which may cover products that are offered and licensed by Business Objects: 5, 295, 243; 5, 339, 390; 5, 555, 403; 5, 590, 250; 5, 619, 632; 5, 632, 009; 5, 857, 205; 5, 880, 742; 5, 883, 635; 6, 085, 202; 6, 108, 698; 6, 247, 008; 6, 289, 352; 6, 300, 957; 6, 377, 259; 6, 490, 593; 6, 578, 027; 6, 581, 068; 6, 628, 312; 6, 654, 761; 6, 768, 986; 6, 772, 409; 6, 831, 668; 6, 882, 998; 6, 892, 189; 6, 901, 555; 7, 089, 238; 7, 107, 266; 7, 139, 766; 7, 178, 099; 7, 181, 435; 7, 181, 440; 7, 194, 465; 7, 222, 130; 7, 299, 419; 7, 320, 122 and 7, 356, 779. Business Objects and its logos, BusinessObjects, Business Objects Crystal Vision, Business Process On Demand, BusinessQuery, Cartesis, Crystal Analysis, Crystal Applications, Crystal Decisions, Crystal Enterprise, Crystal Insider, Crystal Reports, Crystal Vision, Desktop Intelligence, Inxight and its logos, LinguistX, Star Tree, Table Lens, ThingFinder, Timewall, Let There Be Light, Metify, NSite, Rapid Marts, RapidMarts, the Spectrum Design, Web Intelligence, Workmail and Xcelsius are trademarks or registered trademarks in the United States and/or other countries of Business Objects and/or affiliated companies. [. . . ] For information about using the tf. langid-config, see "Language and Encoding Settings" on page 24 for details.
Byte Order Marks
In Unicode, the scalar value "0xfeff" is the "zero-width, no-break space" character. Under a little-endian serialization, this value is "0xfffe", which is not a legal Unicode character. This character is designated as a BOM only when it occurs at the very start of a Unicode input stream, such as a stream encoded in UTF-8, UTF-16, UCS-2 or UCS-4. When encountered at any other location, it is the ZWNBSP character. First, the BOM may serve as a signature for Unicode streams. Second, the BOM indicates the serialization of the Unicode input. In both cases, ThingFinder handles the BOM in a straightforward way. When a BOM is detected, it is used to ascertain the serialization of the input--as little-endian or big-endian. Then, the BOM is stripped from the input and not processed any further. Input that doesn't include a byte order mark is assumed to have the byte order of the current machine. Note: ThingFinder serializes the output using the native endian architecture (little-endian or big-endian) of the host machine.
20
Language Guide and Reference
Language Module Overview Document Properties
2
File Formats
ThingFinder processes text in HTML or plaintext. Text in other formats should be converted before processing, using a separate conversion product.
Language Guide and Reference
21
2
Language Module Overview Document Properties
22
Language Guide and Reference
Configuring ThingFinder
chapter
3
Configuring ThingFinder Language and Encoding Settings
This chapter describes the configurable features of ThingFinder. Information is presented in the following sections:
Language and Encoding Settings Text Processing Entity Type Weights Sub-entities Custom Extraction Rules Post-processing Configuration
Language and Encoding Settings
This section describes the configurable language and encoding settings.
Detecting Language and Encoding
ThingFinder can automatically determine the language and encoding of input documents. To do this, ThingFinder uses a matrix of encoding-language pairs, listed in the tf. langid-config file, during the language and encoding identification process. For example:
<encodings-languages-covered> <list key = "cp_1252"> <item key = "english" /> <item key = "french" /> <item key = "german" /> <item key = "spanish" /> </list> <list key = "cp_1256"> <item key = "arabic" /> </list> <list key = "iso_8859_6"> <item key = "arabic" /> </list> <list key = "utf_8"> <item key = "english" /> <item key = "french" /> <item key = "german" /> <item key = "spanish" /> <item key = "arabic" /> </list> </encodings-languages-covered>
24
Language Guide and Reference
Configuring ThingFinder Language and Encoding Settings
3
This list should include all languages for each encoding that could possibly occur in the input text. Encoding-language pairs not listed here are not considered during detection. For instance, if this list only includes "cp_1252", then, regardless of what the input encoding is, it will always be identified as "cp_1252". The Unicode encodings UCS-2, UCS-4 or UTF-16 are not included by default because there are very few documents in these encodings. If you are processing documents in these encodings, you should add them to the tf. langid-config file, located in the lx-3/lang directory. Open the tf. langid-config file in a text editor, and add lines to it, using the format shown above.
Configuring the Names of Languages and Encodings
You can configure variant names for languages and encodings. The tf. language-encoding-config configuration file, located in the lx-3/ lang directory, contains the standard language and encoding names in the <list> tag. Each of these has a corresponding list of accepted variant names in the <item key> tag. [. . . ] The cgv utility enables you to display the linguistic analysis of a specific sentence, including, but not limited to, the CGUL STEM, POS, NP, TE, and CL marker information for the input data. The cgv utility accepts input either directly from the console or from a file. Either way, the cgv utility's output displays the analysis results of the entire input. The cgv utility is found in the same place as tfdemo (. \lx3\[platform]). [. . . ]