http://hi.baidu.com/masel 共5篇
lucene学习1——词域信息文件(.fnm)
今天开始学习lucene索引文件的文件格式。
为了让lucene生成索引不使用复合文件(Compound Files),设定IndexWriter.setUseCompoundFile(false),这样生成的索引文件就包含大部分lucene索引文件的文件格式,如.fdx .fdt .fnm.frq等。
下面主要介绍.fnm文件。
lucene的《索引文件的文件格式说明文档》(docs/fileformats.html)有这样说明:
Field Info
Field names are stored in the field info file, with suffix .fnm.
FieldInfos (.fnm) --> FieldsCount, FieldsCount
FieldsCount --> VInt
FieldName --> String
FieldBits --> Byte
- The low-order bit is one for indexed fields, and zero for non-indexed fields.
- The second lowest-order bit is one for fields that have term vectors stored, and zero for fields without term vectors.Lucene >= 1.9:
- If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.
- If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.
- If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.
- If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.
Fields are numbered by their order in this file. Thus field zero is the first field in the file, field one the next, and so on. Note that, like document numbers, field numbers are segment relative.
举个例子:

第0x00个字节(0x02)表示有2个词域信息(id、content)
第0x01个字节(0x02)表示第一个词域信息的名称有2个字节(i 和 d)
第0x02、0x03个字节(0x69 0x64)表示第一个词域信息的名称的内容(id)
第0x04个字节(0x01)表示第一个词域信息的FieldBits (FieldBits的第1种情况)
第0x05-0x0c个字节的原理同第2-5个字节,其中第0x0c字节(0x0f)表示FieldBits的第1、2、3、4种情况。
.fnm文件对应源码中的org.apache.lucene.index.FieldInfo和 org.apache.lucene.index.FieldInfos这2个类
lucene学习2——词域存储文件(.fdx和.fdt)
保存的词域数据主要存在词域索引.fdx和词域数据.fdt这2类文件中。
lucene的《索引文件的文件格式说明文档》(docs/fileformats.html)有这样说明:
Stored Fields
Stored fields are represented by two files:
- The field index, or .fdx file.This contains, for each document, a pointer to its field data, as follows:FieldIndex (.fdx) -->
SegSize FieldValuesPosition --> Uint64This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file.
The field data, or .fdt file.
This contains the stored fields of each document, as follows:
FieldData (.fdt) --> SegSize
DocFieldData --> FieldCount, FieldCount
FieldCount --> VInt
FieldNum --> VInt
Lucene <= 1.4:
Bits --> Byte
Value --> String
Only the low-order bit of Bits is used. It is one for tokenized fields, and zero for non-tokenized fields.
Lucene >= 1.9:
Bits --> Byte
- low order bit is one for tokenized fields
- second bit is one for fields containing binary data
- third bit is one for fields with compression option enabled (if compression is enabled, the algorithm used is ZLIB)
Value --> String | BinaryValue (depending on Bits)
BinaryValue --> ValueSize, ^ValueSize
ValueSize --> VInt
简单地说,.fdx保存了每个document的词域数据在.fdt的起始位置(每个位置数据用Uint64,8个字节),.fdt则是按document依次存储词域数据。.fdx存储的每个document数据是词域个数和所有的词域数据信息。每个词域数据信息包括该词域序号(FieldNum)、词域位信息(Bits)和词域数据。词域数据又分字符串和二进制2种类型。字符串数据包含字符个数(非字节数)和字符串内容,字符串内容是经过utf-8编码的。
举个例子:
如图的.fdx文件:

表示有2个document,第0个document的词域数据信息起始于.fdt的第0x00字节,第1个document的词域数据信息起始于.fdt的第0x65字节。
再看其对应的.fdt数据:

图中阴影部分的数据就是对应第1个document的词域数据。第0x65字节表示该document有3个词域,即FieldCount。第0x66-0x69字节表示第0个词域数据信息,其中第0x66字节表示词域序号,即FieldNum;第0x67字节表示词域位信息,即Bits;第0x68表示词域数据字符长度,即1个;第0x69字节就是词域数据,即“4”。同理,第0x6a-0x86字节表示第1个词域数据信息,其中第0x6c字节表示该词域数据字符长度,即有0x0e个字符,第0x6d-0x86字节表示该词域数据。由于词域数据是经过utf-8编码的,而每个字符可能占1字节、2个字节或者3个字节,因此,lucene在读取每个词域数据字符时候需要进行解析判断这个字符到底占用多少字节,具体可参考org.apache.lucene.store,IndexInput的readChars(char[] buffer, int start, int length)方法。为什么词域数据字符长度不是utf-8编码后的字节数而是编码前的字符数呢?可能与存储Term的位置信息和偏移量有关系,有待以后回答。
lucene学习3——词条字典[Term Dictionary]文件(.tis和.tii)与词条频率文件(.frq)、词条位置文件(.prx)
词条字典[Term Dictionary]文件由2个文件组成:词条信息文件(tis文件)和词条信息索引文件(tii文件)。词条信息文件包含了词条频率数据在词条频率文件(.frq)、词条位置文件(.prx)的起始位置。
lucene的《索引文件的文件格式说明文档》(docs/fileformats.html)有这样说明:
Term Dictionary
The term dictionary is represented as two files:
- The term infos, or tis file.TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfosTIVersion --> UInt32TermCount --> UInt64IndexInterval --> UInt32SkipInterval --> UInt32MaxSkipLevels --> UInt32TermInfos -->
TermCount TermInfo -->Term -->Suffix --> StringPrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VIntThis file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text.TIVersion names the version of the format of this file and is -2 in Lucene 1.4.Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".FieldNumber determines the term's field, whose name is stored in the .fdt file.DocFreq is the count of documents which contain the term.FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file.SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval.
The term info index, or .tii file.
This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.
TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
TIVersion --> UInt32
IndexTermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
TermIndices --> IndexTermCount
IndexDelta --> VLong
IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.
SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.
MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels.
举个简单的例子:
一个索引包含3个document,每个document只有一个field——“content”
doc0.add(new Field("content", "中国国家主席中国", Field.Store.YES,Field.Index.TOKENIZED));
doc1.add(new Field("content", "Lucene原理", Field.Store.YES,Field.Index.TOKENIZED));
doc2.add(new Field("content", "中国四川", Field.Store.YES, Field.Index.TOKENIZED));
这样有6个词条(term),根据utf-8编码后的自己排序依次是:

对应的tis文件:

举个词条来说,途中阴影部分就是第1个词条——"中国"的词条信息。其中,第0x24字节表示该词条的起始位置;第0x25字节表示该词条的字符数;第0x26-0x2b字节表示该词条的数据(即中国);第0x2c字节表示第0个词域;第0x2d字节表示该词条出现在2个文档中(第0,2个文档);第0x2e、0x2f分别表示该词条在词条频率文件(.frq)、词条位置文件(.prx)的起始位置,即分别是0x00+0x01、0x00+0x01,同理,第2个词条在词条频率文件(.frq)、词条位置文件(.prx)的起始位置就应该是0x00+0x01+0x03、0x00+0x01+0x03。该例没有SkipDelta 这个属性。
接着看一下frq文件:

图中阴影部分数据就是第1个词条(中国)的frq信息。这几个数据是怎么生成的,参考一下lucene的《索引文件的文件格式说明文档》:
Frequencies
The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document.
FreqFile (.frq) --> TermCount
TermFreqs --> DocFreq
TermFreq --> DocDelta, Freq?
SkipData --> < NumSkipLevels-1, SkipLevel>
SkipLevel --> DocFreq/(SkipInterval^(Level + 1))
SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt
SkipChildLevelPointer --> VLong
TermFreqs are ordered by term (the term is implicit, from the .tis file).
TermFreq entries are ordered by increasing document number.
DocDelta determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt.
For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of VInts:
15, 8, 3
DocSkip records the document number before every SkipInterval th document in TermFreqs. If payloads are disabled for the term's field, then DocSkip represents the difference from the previous value in the sequence. If payloads are enabled for the term's field, then DocSkip/2 represents the difference from the previous value in the sequence. If payloads are enabled and DocSkip is odd, then PayloadLength is stored indicating the length of the last payload before the SkipIntervalth document in TermPositions. FreqSkip and ProxSkip record the position of every SkipInterval th entry in FreqFile and ProxFile, respectively. File positions are relative to the start of TermFreqs and Positions, to the previous SkipDatum in the sequence.
For example, if DocFreq=35 and SkipInterval=16, then there are two SkipData entries, containing the 15 th and 31 st document numbers in TermFreqs. The first FreqSkip names the number of bytes after the beginning of TermFreqs that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts. The first ProxSkip names the number of bytes after the beginning of Positions that the 16 thSkipDatum starts, and the second the number of bytes after that that the 32 nd starts.
Lucene 2.2 introduces the notion of skip levels. Each term can have multiple skip levels. The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))). The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip level is Level=0.
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, containing the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, and 31st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the 15th and 31st document numbers in TermFreqs.
The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, containing the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, and 31st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the 15th and 31st document numbers in TermFreqs.
The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.
很复杂,复杂的我也还没理解。这里仅仅是最简单的情况。根据这个说明,我们就理解这些数据有什么含义、是怎么生成。再详细说下第0x01-0x03字节。第0x01字节是0,偶数,就得在看下一个字节0x02,那么这2个字节的含义是:在第0/2=0个文档出现0x02次;第0x03字节是5,基数,那么这个字节的含义是:在第5/2=2个文档出现1次。在看一下第0x00字节,它表示词条lucene的frq信息。lucene是在第1个文档出现一次,而且只出现这么一次,因此(1-0)*2+1=3,这就是第0x00字节的内容。prx与frq差不多,参考一下lucene的《索引文件的文件格式说明文档》,有这样一段信息需要注意:
PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position.

本例(payloads disabled)图中阴影数据表示第0x01个词条(中国)在第0个document的2个位置:0x00、0x00+0x03;而第0x03字节表示该词条在在第2个document出现的位置:0x00。
再回头看看tii文件

在.tis文件中每隔一个分组跨度便产生一个分组点,在.tis文件中term编号(从0起)能够整除indexinterval(Lucene默认分组跨度为0写0x80=128个term)时,便将当前term的前驱term作为分组点(第一个分组点为““)保存在.tii文件中。
本例的词条不足0x80个,因此只有一个分组点。这个分组从tis文件的第0x18字节开始。
.tii文件中保存了指向.tis文件中的指针,检索时.tii文件要被预取入内存中,再折半查询找出相邻近并小于或等于query term的分组点term,从.tii文件中分组点term的指针指向的.tis文件位置开始,次序查询.tis文件中的term直到找到quey term或者找出字典排序大于query term的term为止(表明没有包含query term)。
lucene学习4——正态化因子[Normalization Factors]文件(.nrm)
在公式形式与写法与“数理统计”的正态分布相似,我把Normalization Factors译成“正态化因子”。
lucene的《索引文件的文件格式说明文档》介绍:
Normalization Factors
Pre-2.1: There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:
Norms (.f[0-9]*) --> SegSize
2.1 and above: There's a single .nrm file containing all norms:
AllNorms (.nrm) --> NormsHeader, NumFieldsWithNorms
Norms --> SegSize
NormsHeader --> 'N','R','M',Version
Version --> Byte
NormsHeader has 4 bytes, last of which is the format version for this file, currently -1.
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.
These are converted to an IEEE single float value as follows:
- If the byte is zero, use a zero float.
- Otherwise, set the sign bit of the float to zero;
- add 48 to the exponent and use this as the float's exponent;
- map the mantissa to the high-order 3 bits of the float's mantissa; and
- set the low-order 21 bits of the float's mantissa to zero.
A separate norm file is created when the norm values of an existing segment are modified. When field N is modified, a separate norm file .sN is created, to maintain the norm values for that field.
Pre-2.1: Separate norm files are created only for compound segments.
2.1 and above: Separate norm files are created (when adequate) for both compound and non compound segments.
lucene为每个索引词域保存一个正态化因子。在创建索引的时候,lucene计算出每个词域的正态化因子,是一个浮点数值,然后把这个浮点数根据IEEE标准转化规则转化为一个字节保存着.nrm文件中。查找时候,再从.nrm中读取,把字节转化为浮点数值。这个转化在SmallFloat实现。
上述的IEEE标准转化规则似乎与IEEE_754有出入,IEEE_754中单浮点的小数位是23位,而lucene作者理解的应该是24位。
正态化因子的计算规则可参考api文档的Similarity部分。

其中,lengthNorm(field)在DefaultSimilarity的算法是(float)(1.0 / Math.sqrt(numTerms)),numTerms是该词域分词化后词条个数。
举个例子,有个document,其增强因子(boost)是1.0f,有个词域(名称唯一)内容:中国国家主席。这个词域分词化有3个词条:中国、国家、主席。这样,我们就可以计算出这个词域的正态化因子norm=1.0*1.0/30.5,再通过SmallFloat.floatToByte315把其转化成字节形式:0x78,这就是在.nrm文件保存的该词域的正态因子。
lucene学习5——分段文件(segments_N 和segments.gen)
The active segments in the index are stored in the segment info file, segments_N. There may be one or more segments_N files in the index; however, the one with the largest generation is the active one (when older segments_N files are present it's because they temporarily cannot be deleted, or, a writer is in the process of committing, or a custom IndexDeletionPolicy is in use). This file lists each segment by name, has details about the separate norms and deletion files, and also contains the size of each segment.
As of 2.1, there is also a file segments.gen. This file contains the current generation (the _N in segments_N) of the index. This is used only as a fallback in case the current generation cannot be accurately determined by directory listing alone (as is the case for some NFS clients with time-based directory cache expiraation). This file simply contains an Int32 version header (SegmentInfos.FORMAT_LOCKLESS = -2), followed by the generation recorded as Int64, written twice.
Pre-2.1: Segments --> Format, Version, NameCounter, SegCount, SegCount
2.1 and above: Segments --> Format, Version, NameCounter, SegCount, NumField, IsCompoundFile>SegCount
2.3 and above: Segments --> Format, Version, NameCounter, SegCount, NumField, IsCompoundFile>SegCount
Format, NameCounter, SegCount, SegSize, NumField, DocStoreOffset --> Int32
Version, DelGen, NormGen --> Int64
SegName, DocStoreSegment --> String
IsCompoundFile, HasSingleNormFile, DocStoreIsCompoundFile --> Int8
Format is -1 as of Lucene 1.4, -3 (SegmentInfos.FORMAT_SINGLE_NORM_FILE) as of Lucene 2.1 and 2.2, and -4 (SegmentInfos.FORMAT_SHARED_DOC_STORE) as of Lucene 2.3
Version counts how often the index has been changed by adding or deleting documents.
NameCounter is used to generate names for new segment files.
SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.
SegSize is the number of documents contained in the segment index.
DelGen is the generation count of the separate deletes file. If this is -1, there are no separate deletes. If it is 0, this is a pre-2.1 segment and you must check filesystem for the existence of _X.del. Anything above zero means there are separate deletes (_X_N.del).
NumField is the size of the array for NormGen, or -1 if there are no NormGens stored.
NormGen records the generation of the separate norms files. If NumField is -1, there are no normGens stored and they are all assumed to be 0 when the segment file was written pre-2.1 and all assumed to be -1 when the segments file is 2.1 or above. The generation then has the same meaning as delGen (above).
IsCompoundFile records whether the segment is written as a compound file or not. If this is -1, the segment is not a compound file. If it is 1, the segment is a compound file. Else it is 0, which means we check filesystem to see if _X.cfs exists.
If HasSingleNormFile is 1, then the field norms are written as a single joined file (with extension .nrm); if it is 0 then each field's norms are stored as separate .fN files. See "Normalization Factors" below for details.
DocStoreOffset, DocStoreSegment, DocStoreIsCompoundFile: If DocStoreOffset is -1, this segment has its own doc store (stored fields values and term vectors) files and DocStoreSegment and DocStoreIsCompoundFile are not stored. In this case all files for stored field values (*.fdt and *.fdx) and term vectors (*.tvf, *.tvd and *.tvx) will be stored with this segment. Otherwise, DocStoreSegment is the name of the segment that has the shared doc store files; DocStoreIsCompoundFile is 1 if that segment is stored in compound file format (as a .cfx file); and DocStoreOffset is the starting document in the shared doc store files where this segment's documents begin. In this case, this segment does not store its own doc store files but instead shares a single set of these files with other segments.
0 Comment(s):
Post a Comment