Index Structure

From ATIRE

Jump to: navigation, search

Each index file begins with the string "ANT Search Engine Index File\n\0\0". Performing head -n1 index.aspt will show this string on all valid ATIRE index files.

In general the index file contains the postings followed by the dictionary and then the header, although the dictionary second level and postings may be intertwined.

Contents

Header

Despite the name, the header for each index file is at the end of the document. The header contains the following data in the given order (with the file signature occupying the last 4 bytes of the file):

Variable Number of Bytes Notes
Position in the file of the first dictionary level 8
Length of the longest term 4
Number of unique terms 4 This only exists in version 0.4 of the index file (IMPACT_HEADER defined)
Length of the longest compressed postings list in bytes 4
Highest document frequency 8
Reserved 8 Possible future use: checksum
Reserved 8 Possible future use: collection name, currently set to 0x494e444558000000, or "INDEX" in Intel byte order
Version of the index 4 Used to make sure search engine will be able to process index
File signature 4 Used to test the validity of the index

Dictionary

There are two levels to the dictionary in ATIRE. On disk the second level is stored first, then the top level. The top level is always loaded into memory on start-up.

Top Level

There are 8 bytes dedicated to storing the number of nodes that are contained in the top level of the tree.

Then for each node, the first B_TREE_PREFIX_SIZE bytes (currently set to 4), or the whole term if smaller, are stored followed by the NUL terminator. Then 8 bytes for the position on disk of the second level for terms that begin with the prefix in this node (for instance, shee -> sheen, sheep, sheepish, sheet etc.)

Second Level

Each section of the second level begins with 4 bytes containing the number of terms in this section. Then for each term in the node:

Variable Number of Bytes Notes
Collection frequency 4
Document frequency 4
Position of postings on disk 8 DOCID_1 << 32 | IMPACT_1
Length of postings 4 DOCID_2
Length of postings in bytes 4 IMPACT_2
Term local max impact 1 Only if TERM_LOCAL_MAX_IMPACT is defined.
Position of string in string block 4

Then the strings for the tails of the terms in this node, NUL terminated (following the example above, n, p, pish, t etc.

In the case where SPECIAL_COMPRESSION is defined, then the postings position and postings length variables get replaced by the respective docid and impact values, as shown in the above table.

Postings

The structure of the postings depends largely on whether or not IMPACT_HEADER is defined, and whether the term is special (begins with "~")

~

Terms that begin with ~ denote variables to be used within the search engine and are not directly searchable.

These terms are neither difference encoded, or impacted ordered. The search engine allows these variables to be get and set with {get,set}_variable methods, which allows storage of arbitrary 64-bit values.

See known issues for some known issues with regards to ~ variables.

Impact Header

Each postings list is preceded by an impact header:

Variable Number of Bytes Notes
Postings chain 8 For chaining postings lists together, currently unused
Chain length 8 As above
Quantum count 4
Beginning of the postings 4 Pointer to the beginning of the postings on disk

Following this data is a compressed block of 3 * quantum_count values the contain the impact values, document counts and postings start.

The impact values are the tf values, or the option specified by -Q when indexing. The document counts contain the number of documents for each impact value. The postings start is a location on disk for the postings for that impact value. That is, there are document_counts[i] documents with impact impact_values[i] and the docids start at postings_start[i]. The docids are difference encoded for each impact. Each docid list for each impact value is compressed separately.

Non Impact Header

Postings lists are impact ordered, and stored in decreasing order of impact, with 0 identifying the end of the docid list for a given impact. For example: impact, docid, 0, impact, docid, docid, 0. Docid's are difference encoded for each impact, and the whole impact ordering is compressed.

See Also