Mailing List Archive

Modify Lucene to make it an inverted index suitable for cloud native environment
With the landing of lxdb, I always feel that there is something missing. Before lxdb started, I took a pen and drew what kind of database is the perfect database in my mind. Now that the main design goal has been completed, I still feel that it is not perfect. In essence, Lucene, the core of recording letters, is not perfect enough and cannot be split. There are still some deficiencies in the cloud native environment

The essence of lxdb is to integrate spark, HBase and Lucene into one product, just like the gourd baby that I saw when I was a perfect child. Seven gourd brothers merged into a big diamond gourd baby, which is more powerful. It has the powerful OLAP analysis ability of spark, the real-time update ability of HBase, and the rapid multi-dimensional filtering with the help of Lucene index, Roughly speaking, it is almost a perfect product, which can almost meet most of the scenarios in the field of big data, such as perfect distributed storage, distributed computing, high concurrency and flexibility. Most of the products on the market can not meet the technical perfection of lxdb, the timeliness of kudu, the OLAP performance of spark, the full-text retrieval of ES and the high concurrency of HBase. But the real use is that there are some very unpleasant places. Let me give you some examples one by one.

== Existing problems==
1. The process must be resident and cannot be used on demand
The disadvantage of Lucene and HBase is that once the service is started, the process must be resident. No matter whether there is query or data import, these processes must be hung on it
What I expect more is that like the native spark, it can start some processes when there are SQL queries. When these processes are not used, they are slowly recycled

2. Different calculations of the same data are not separated, so it is impossible to realize the resource isolation of calculation
Another disadvantage of resident process is that all calculations must be read by resident process, and most of the time tasks have priority. The response speed of ad hoc query task is much faster than that of batch query task. We want to give more and faster resources to ad hoc query task, and let batch task run slowly in the background

The resident process brings us a lot of trouble in this aspect. We prefer to separate computing, and separate different types of tasks to different processes, or even to different computing nodes, so as to avoid mutual influence

3. Can't split able, the computing resources used by the same data can't be flexibly adjusted
For the same piece of data, we often hope that a very important query needs to run quickly to get the result. I can allocate a lot of computing resources to it to get the result as soon as possible. For those unimportant tasks, we can allocate a few processes to run slowly, That is, it can't dynamically adjust and slice computing resources, it can only bind fixed processes to compute

4. Multiple systems cannot communicate with each other
Most of the time, I hope that the index format of lxdb can be more open and run directly in other systems without any change. Just like hive, I create a data table and define the parquet format. Besides hive itself, impala can directly access its data, Presto and spark can also access it. This system is more flexible

The current way of binding process between HBase and Lucene makes the data in lxdb of other systems can only be transferred once through the service of lxdb and the resident process of lxdb, which greatly affects the efficiency and increases the complexity of interworking between multiple systems. We prefer to interweave in the file layer directly through the format of type parquet without transfer service

==How do we plan to solve this problem==
1. we don't plan to shave Lucene

Lucene is still the king in the field of full-text retrieval and multi-dimensional retrieval. There are no comparison between various performance indicators. I have measured various data formats or database systems. But in this field, there is no way to surpass Lucene, and there is no one saying that Lucene is the level of "Wang". At present, the popular Solr and elasticsearch also rely directly or indirectly on lucene

2. we plan to transform Lucene

Lucene's core is inverted index, which involves the storage formats of forward and backward. We intend to keep these concepts and API interfaces, and the logic remains unchanged

But the implementation of inverted and forward row is replaced by the original blocktree and block compressed FDT and docvalues stored by columns. In fact, we find that the format of the nested column storage is very similar to the inverted index. Only when many people use parquet, the data storage is random, After we move parquet into Lucene framework, because of the ordered nature of inverted tables, the performance of parquet will be particularly good. Moreover, Lucene's original inverted table can only be single-dimensional. After we replace with parquet, it will become multidimensional, or stored by column

3. after Lucene is modified, the open index is not loaded into memory in advance. Paquet format does not involve preloading, and the opening speed of index without preload is much faster. It can be loaded dynamically in different processes according to the calculation requirements

4. after the inverted table of Lucene is modified to realize parquet, the inverted table is made into multi-dimensional inverted table, which makes the inverted table be used for further statistical analysis. Lucene is not good at traversing the inverted table. Because it involves the collaborative linkage of multiple files such as tip, Tim, Doc, pay, POS, and about 6-9 different pointer files are pointed to, We will find that the original Lucene will cause the system resources to be very high and even run up if there is a * or other unified retrieval. After replacing with the implementation of parquet, these files are unified and merged into one file. Moreover, the traversal performance of parquet is very good. Moreover, parquet has a layer of coarse index based on Lucene index, and it will jump by block, This will make the efficiency of multi condition merging and the combination of statistical analysis more effective than the original skiplist, and the original payload in Lucene does not have any filtering ability. Now it can be filtered through parquet. At the statistical analysis level, parquet will be better than Lucene in performance due to less random reading

5. it is easy to split table based on parquet. Originally, it was the same index with 10million records. I can split 10 points without one million records, and give them to 10 different processes to calculate, which is impossible in Lucene

6. achieve the real sense of read-write separation. The original HBase after transformation is only used to maintain data writing and process Lucene index merging. It will also create some lightweight real-time snapshots for queries to prevent the old segment from being deleted during index merging. All query requests are separated, which can be run in the query process in lxdb, or run in hive, spark, Even in impala. Because the index is parquet, it can be read directly by wrapping it into an inputformat or RDD

7. it is more suitable for cloud native environment

For real-time write, we only need to maintain a certain HBase process to maintain the continuous real-time write of Lucene index, while all queries are started on demand in the cloud native environment, and resources will be released without query. However, when Lucene generates data, there will be some data temporarily in memory. Here, we will consider some short-term and timeliness issues, This part of index can be stored in distributed memory file system such as alloxo or directly stored in kV system by NRT technology, but this is not the focus of the near future



??????LXDB??????,????????????ô,LXDB???????,??????????,????????ô???????????????????????????????,??????????????????????,????????ò???????,??????,¼??????lucene??????????,????splitable,??????????????????.
LXDB????????spark,hbase,lucene??????????????????????????????,?????????????????«??,7????«??????????????????«?????????????,?????spark?????olap????????,????hbase???????????µ?????,???????lucene????????????????,????????????????????????,??????????????????????????????????????????????????????????????????????????????????????????LXDB????????????????,kudu????,spark??OLAP????,ES????????,Hbase?????????????????????????????????????????????????????????
??? ?????????
1. ????????????????????
Lucene??hbase ???????,???????????,???????????????????,??????û????,??û?????????,????????????¹???????.
?????????????????,???????????spark????,??SQL????????,??????????????,?????????û???????,????????????.
2. ???????????????û?????,???????????????????
?????????????????????????????????????????????,???????????????????????,??????????????????????????????????????????,???????????????????????????????????,??????????????????????????????.
????????????????????????????????,????????????????,??????????????????????????,?????????????????????,????????????.
3. ????split able,??????????õ???????????????????
????????????,??????????????,?????????????,?????????????????,?????????????????????????,??????????????????????.????????????????,??????????????????????????????.??????hbase??lucene??????????????split able,?????????????????????????????,???????????????????.
4. ????????????
????????????,lxdb?????????????????,???????????????????????????,????hive????,?????????????????parquet???,????hive????????????,impala??????????????????,presto,spark????????,????????????????.
????hbase??lucene???????????,????????????lxdb???????,??????lxdb?????,???lxdb???????????????,?????????????,????¶????????????????????,????????????????????parquet??????,????????,??û?????????.
???? ??????????????????????
1. ???????????????lucene
Lucene?????????????????????,??????????,???????????û??????????,??????????????????????????????,????????????û??????lucene??,?????????????,????????lucene??????????????,???????solr??elasticsearch???????????????lucene.
2. ???????????lucene
Lucene??????????????,??????????????????????.???????????????????????API???,??????????.
???????????????????????blockTree?????????fdt???????docvalues,??????parquet??????????????.??????????????parquet???????????????????????????????????,?????????????parquet????????????????????,???????parquet??????lucene???????????,??????????????????,????parquet???????????????.??lucene??????????????????,????????parquet???????????????,????????.
3. ????lucene??,???????????????????????????????,paquet????????????,û?????????????????????????,??????????????????????????????.
4. ????lucene???????parquet????,??õ????????????????,????õ?????????????????????????,???lucene??????????????????ô?õ?,?????tip,tim,doc,pay,pos????????????????,???????6~9????????????????????????,????????????lucene??????*????????????,????????????????,???????.???????parquet????,?????????????????1?????,??parquet?????????????????,??parquet??????lucene?????????????????????????,????????????,??????ö????????????????????????????????????skiplist?????,?????lucene???payload?????????????????,??????????parquet??????.??????????????,parquet??????????????,?????????lucene??.
5. ????parquet??split able??????,?????????????,??1000??????¼,?????split??10??,û??100??????¼,???????10???????????????,?????????lucene??????????.
6. ?????????????????????,???????????hbase???????????????????,????lucene?????????.???????????????????????????,?????????????????,???segment?????.??????????????????????,??????????lxdb???????????,???????????hive,spark,????impala??.???????????parquet,????????inputformat??rdd????????????.
7. ??????????????
?????????,??????????????????hbase????????lucene????????????????,???????????????????????????????,û???????????.???????lucene??????????????,?????????????????????,?????????????????????????,???????,???????nrt?????????????????????????????????alluxio??,??????????kv??????,??????????????????,???????????????.




fp@lucene.cn yannian mu
www.lucene.xin