A little about database design for search engines

Without a database, even without a few radically different, such a project is impossible. So we will devote time to this question.

So at least will need a database serving the usual "flat" (2D) data – i.e., a certain ID is assigned to the data field.
Why is the field data I consider one? Because:
the

    selection is based only on the ID field of the search data is not possible. There is a specialized index – otherwise such amounts of information will be of little avail the

  • any number of fields can be Packed into one, for this I "on the knee" has created a small set of application libraries, in particular when the package is stored CRC data, not to use God forbid broken

If you do not set a task of minimizing the number of lines of code working with data and a bit of convenience, almost any task can be reduced to one where these items will be sufficient. In the case of such high requirements of optimality and speed, in my opinion it is justified.

The basic operations for these DB tables are:
the
    the
  • sample ID
  • the
  • to read sequentially the entire database (in memory or hash)
  • the
  • to update a record by ID
  • the
  • to insert a new one in the end, to ID


Optimal I found the use of tables "page type" where data is stored, written to, read from disk pages. In each page – fixed number of records. Quite simply if I know in advance a fixed record size – then the table will work even faster, however, in the case when a record size changes significantly in the treatment, nothing changes. Update, append at the end are done within the page in memory, then the page is written to disk. In the file table pages are stored sequentially.

The question is how to update records in the middle of the table when their size changes – after all, if the whole table is more 10-20-200 GB, then copy half of the table to a temporary file and then back would take hours? I moved this question to the file system dividing all pages into blocks. One unit – single file on disk, number of files in one table is not limited. In each file are stored sequentially a fixed number of pages. Then if you need to update a record in the middle of the table, I need to change only 1 file is much smaller and often limited volume. A responsibility not to ask the file system to do the type of stupidity you have to put in the beginning, then at the end, and then again in the beginning I took. Just to keep the server always write a batch, the corresponding functionality is as optimized as possible, everything happens in memory. Well, of course, the entire system of modules search engine built on the assumption that to write 1000 records to the end faster than 1 in the beginning — so when you write in the beginning it is sometimes easier to make a copy of the table.

Well, with the usual tables decided. Now described of the database is very good, in particular, processes in the search process 35 GB chunks of texts with an arbitrary sample.

But there are restrictions – in a table to store the matching: for each word a list of documents in which the word was found (with additional information) is almost impossible – each word will be, well, a lot of documents, and thus the volume will be huge.

So what operations should be done with such a database:
the
    the
  • sequential selection from the beginning of the list for the desired word and arbitrary until you get bored
  • the
  • should be good to change the list of documents for words, but then you can make a feint – to do only insert to the end of the database


How to update index? It is obvious that if the index is empty and we start to insert lists of documents, beginning with the first word to the last – we'll be writing only at the end of the file. Moreover, to write or not to write a physically separate blocks for each word separately on the disc, business developer – and in either case, you can just remember where I finished another block and its length, to keep this a simple list. Then the procedure is sequential read is this: move the file on top of the list for the right words, and read sequentially until the list for the following words: 1 seek, and the minimum required number of read — victory (here I specifically do not consider the operation of the file system itself — can be separately engaged in their optimizations)
Now, obviously, when we want to add to the index information about the newly indexed pages, can save new information separately, to open the current index reading, and a new one created on the record, and to read and write sequentially merging with the information you want to add, starting from the first word to the last. The question of in which order to put in a list of documents I'll discuss later when I talk about the construction of the index.

Full contents and list of my articles on search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Parse URL in Zend Framework 2

Custom database queries in MODx Revolution