CHAPTER-3, CONT’D ‘STORAGE AND RETRIEVAL SYSTEM BY JAVED ASHRAF- 1970′
CHAPTER-3, CONT’D ‘STORAGE AND RETRIEVAL SYSTEM BY JAVED ASHRAF- 1970′
CHAPTER THREE
COLLECTION OF DATA
3.1 Disc File Description for Programming
The first bucket of the disc file consists of the File Directory. This directory provides the following information:-
(a) the number of the bucket where the first non-processed document is stored,
(b) the total number of buckets used for processed and non-processed documents. All these buckets may not be together in the disc file,
(c) the number of the bucket available to hold keywords in a case when a bucket of the first section of the Thesaurus overflows. It is actually the first unused bucket of the second section of the Thesaurus,
(d) the number of non-process documents in the system,
(e) the address of the first available space for entry lists storing.
Each of the above items of information is stored in one computer word starting from the first word of the file directory bucket.
As mentioned in section 2.2 the document record descriptions are stored in three parts. The first of these is a fixed length directory of the document description, contained in the first ten words of the document description. Each word can be considered as consisting of two equal parts, each 12 bits long. The second 12 bits of the seventh word is, as shown in Diagram 1, used for holding the number of buckets required for that document record description. This word is of interest to us and will be used in an algorithm to work out the number of buckets, serial number of the last bucket used for the document description and number of computer words used for the document description.
DIAGRAM 1
FIXED DIRECTORY OF EACH DOCUMENT
Word |
No. 12345Length of AuthorsLength of TitleLength of SourceLength of DescriptionsLength of
Chapter
HeadingsStarting Address of AuthorsStarting Address of TitleStarting Address of SourceStarting Address of DescriptionsStarting Address of Chapters678910Length of
Chosen
KeywordsBLANKBLANKCode reference BookCode continuedStarting Address of these WordsNo. of Buckets used by the DocumentNo. of Chosen Keywords in BookCode continuedCode continued
As stated in section 2.2 the third part of the disc file is the Thesaurus. This part uses a fixed number of buckets. There are two sections in the Thesaurus. The first section of it is where keywords hit after hash indexing and where they are stored if the buckets which they hit are not overflowing. This section uses 119 consecutive buckets. The second section holds the keywords of overflowing buckets. This uses 40 consecutive buckets. With each keyword the link address of the first element of its associated entry list is stored.
There are no basic parameters involved in exploring fourth part of the disc file, i.e. the Associated entry list, excepting the length of an element of the list. As mentioned in section 2.2 the first element consists of two entries and afterwards every element length is doubled subject to a maximum of 63. This list can be accessed only through the Thesaurus. Diagram 2 illustrates the constitution of the elements of an entry list chain. It is possible that these elements may not be together or even in the same bucket.
DIAGRAM 2
CHAIN STRUCTURE OF THE ASSOCIATED ENTRY LIST
FIRST ELEMENT: A set of five words in a bucket,
2nd |
LINK
ADDRESSBUCKET
NUMBERWORD
NUMBERBUCKET
NUMBERWORD
NUMBER1st Reference2nd Reference
SECOND ELEMENT: A set of nine words in a bucket,
3rd |
LINK
ADDRESSBUCKET
NUMBERWORD
NUMBERBUCKET
NUMBERWORD
NUMBERBUCKET
NUMBERWORD
NUMBERBUCKET
NUMBERWORD
NUMBER 3rd Reference 4th Reference 5th Reference 6th Reference
and so on. The 2nd link address provides the link for the 2nd element and the 3rd link address is the link to the 3rd element of the list and so on.
In addition to keywords the Thesaurus also contains non-significant words, although they do not have a link address. This mixed dictionary of the keywords and non-significant words maker possible the construction of an algorithm which, when the documents are being processed, can deal with keywords and non-significant words which have already been introduced to the system. This is the basis of the automatic processing of documents in the system.
Diagram 3 reveals the overall layout of all the parts of the system on the disc file. The programmes which have been built in order to collect the desired statistical data will be based on this data structure. The present date structure is likely to be reorganized so that a more economic use can be made of the available storage. However, the basic parts of the system will remain the same. The Disc File uses 16 cylinder consisting of 1280 single bucket blocks, on an exchangeable disc.
DIAGRAM 3
DATA STRUCTURE ON THE DISC FILE
BUCKET
NUMBERS
1 2 – – – – – – – – – – – – – – – – – – – – – 320
FILE |
DIRECTORYDOCUMENTS (BOOKS) RECORD DESCRIPTION
(Cont’d on No.896)
321 – – – – – – – – – 439 440 – – – – – – – – – – – – 480
THESAURUS |
FIRST SECTIONSECOND SECTION
481 – – – – – – – – – – – – – – – – – – – – – – – – – – 895
ASSOCIATED ENTRY LISTS |
596 – – – – – – – – – – – – – – – – – – – – – – – – – – 1040
DOCUMENT BOOK DESCRIPTION |
1041 – – – – – – – – – – – – – – – – – – – – – – – – – 1280
UNUSED |
3.2 Programmes
The first phase of programming is to find out the required information about the basic parameters of the data structure on the disc file which could be made to manipulate the statistical information about any stage of the system. Our main concern is to know the distribution at the stages when the system contained fewer processed document description records then it does now. At present the system consists of 256 processed documents. Data in later programs was obtained for the stage in which the system contained 40 processed documents and also for the stages obtained as the number of processed documents is increased in regular intervals of 20.
The program which does the job of computing the number of computer words the number of buckets and the last bucket number used by the document descriptions records is given coded in PLAN in Appendix 1 along with its detailed flow chart. Its general flow chart is shown overleaf. The data obtained are presented in Appendix 1A. Appendix 1B and 1C give the distribution tables.
To obtain the rest of the required data, the disc file has to be accessed through the Thesaurus. Further scanning and manipulations are then possible by using the references found there. The keywords and their entries, which existed in the system at any stage, could be found by comparing the entries which refer to the bucket used by the document description, with the number of the last bucket to be used for storing the last document descriptions of that state. We now describe some of the progress which were built using this approach. The general flow charts of these programs are given later in this chapter.
GENERAL FLOW CHART OF PROGRAM IN APPENDIX 1
To compute the number of words, the number of buckets and the last bucket number used, per document
|
Advertisements
No comments:
Post a Comment