Thursday, 23 February 2017

CHAPTER-2, CONT’D ‘STORAGE AND RETRIEVAL SYSTEM BY JAVED ASHRAF- 1970′ Posted on December 5, 2012


CHAPTER-2, CONT’D ‘STORAGE AND RETRIEVAL SYSTEM BY  JAVED ASHRAF- 1970′

CHAPTER-2, CONT’D ‘STORAGE AND RETRIEVAL SYSTEM BY JAVED ASHRAF- 1970′


CHAPTER TWO

DESCRIPTION OF PILOT PROJECT
2.1   Introduction
The existing pilot scheme is similar to the “SPIRES” systems installed at Stanford (Parker, 1968; Veaner and Fasana, 1968), the “ORIBT” system developed at Systems Development Corporation, Santo Monica (Nance end Lathrop, 1968) and to the system developed at the Moore School Laboratory, Philadelphia (Rubinoff, 1968),   It is most like the SPIRES system.  The input media is similar to N.I.C. (nineteen hundred indexing and cataloguing, 1967), a software package for the 1900 series by I.C.L.
Walston (1965) has pointed out that indexing provides the key to retrieval efficiency.  Thus, although one might be ambitious to include a large number of keywords to describe a document, this may endanger the economies of the system without making significant additions to the fruitfulness of the responses.    For this reason we should be careful when deciding on a method for choosing keyword fields and when we are choosing a strategy on how to pick them.  The system under investigation uses titles and chapter headings of documents, to pick out all those words which can be considered relevant or significant for retrievals. These are called keywords.     The total number of keywords for a limited subject is itself limited.  These keywords are selected as they appear for the first time during processing the list of documents.
Synonyms of a keyword are treated equally to that keyword in the system; they are dealt with by giving to them the same entry list as the keyword itself. The operator and the user are allowed to suggest suitable synonyms.  A separate synonym dictionary has yet to be developed which would be used in updating the system and then retrieving.    We only store stems in the thesaurus provided they have at least four letters end at the moat eight letters. Words with smaller stems are stored whole and their different forms stored as synonyms.
In a sum bibliographic information system the quick response of queries about documents can only be obtained by allowing ultimate referencing from an invested file.  The main advantage of an inverted file is the tremendous economy when making searches. A matrix gives in Table 1 reveals the improved achievements of an inverted file ever a serial file. The inverted file is organized by means of keywords which have appeared in the documents.  The appearance of a keyword is a document description indicates the need of its retrieval with respect to that keyword, if the keyword is given in a search request.

A MATRIX FOR THE FIRST 20 DOCUMENTS AND A FEW SELECTED KEYWORDS.  (PRESENCE IS REPRESENTED BY 1 AND ABSENCE BY 0)
SERIAL FILE
DOCUMENTS
KEYWORDS
STEMS
1234567891011121314151617181920
DEGREE00010000000001000000
AUTOMAT10010110000000010010
INVERTED FILE
PATTERN
10000010010000000001
MANIPULA00000000100100100000
DIFFER11000000000011000000
MACHINE00010100001000100000
INPUT10000000000000011000
LANGUAGE00000100000000110100
ALGEBRA11000000010001000000
INTRODUC10101111000000101010
SYSTEM11101011011100000001
PURPOSE01010001000001000000
PROBLEM00000101000100000000
OPERAT10001000010010000010
PROGRAM00001011000100111101
DESIGN10000010010000000001
THEOR00000010010010000010
BINARY00100000001100000000
DIGIT11000010010100100000
ARITHMET10100001011100110000
2.2       Data Structure gad Storing
            All data is organized on a disk file of 16 cylinders. A cylinder consists of 80 blocks, each block containing a single 128 word bucket. The file is organized in the following parts:-
1.         The File Directory.
2.         The Document Records.
3.         The Thesaurus consists of keywords.
4.         Associated inverted files.
The first bucket on the disk file holds a general record of the data file.  From the second bucket onwards the document record description is stored. Every new document starts from a fresh bucket.
These records are constituted by three separate parts which are as follows:-
(a)       a fixed length directory.
(b)       a variable length directory.
(c)       variable length data fields.
The third part of the disc files, the Thesaurus, engages a fixed number of buckets for its first section.  It is used to store the keywords by means of hash indexing.  The number of buckets for this part should essentially be prime.  The second section of the Thesaurus is used for overflowing from the first section, i.e. if a keyword hits a bucket of the first section to be stored there and the bucket has already been exhausted that would be regarded as the start of overflowing for that bucket.  Each bucket of the Thesaurus has a capacity of holding a maximum of 40 keywords. The first 8 computer words out of 128 computer words of a bucket are reserved to hold link addresses of the buckets, which share its keywords after overflowing.    The remaining 120 computer words of the bucket are treated as groups of three consecutive words. The first two words are used to store the keyword or its stem.  The third word provides the initial address of the Associated entry list which is the associated inverted file for a keyword.
Storing the Thesaurus
Higgins and Smith (1970) present a detailed account of hash Indexing. When a bucket of overflows, it is dumped back into the core store.  All these keywords and the new keywords hitting this bucket are rehashed for the prime 3.   Those keywords leaving remainder 0 are stored in the same bucket and the keywords whose remainders are 1 and 2 are stored in two separate buckets of the second section of the Thesaurus, leaving link addresses in the overflowed bucket.  This is known as rehashing.  At the moment the hash indexing algorithm consists of a stem of the first four characters of the keywords.    For the computation of the bucket number in which it is to be stored, the algorithm regards the decimal representation of the string of the four characters which is divided by the number constituting the first part of the Thesaurus buckets.  The remainder thus obtained, is the desired relative address.
The fourth part of the system consists of the Associated entry lists. For each keyword there is a list of references to the documents, in which that keyword appears in a title or chapter heading. This list is stored in a chain structure, the first element of each chain holding two entries (references), the second four entries.  An element can hold up to a maximum of 63 entries. This exponential   chain system enables the storage allocated to a particular list to be determined by this frequency of the keyword use.
The first word of each element stores the link address or the next clement of that chain, if any.  Each entry is stored in a pair of words; the first stores the number of the first bucket in which the record is stored while the second word gives the starting character location, within the document description, of the phrase from which the keyword was taken. The maximum number of entries 63 in an element of the chain is due to the bucket size limitation.
The general layout of the system on the disc file is arbitrary since the apace allocated for each part of the system varies with the number of documents.  This flexibility is necessary to keep the updating and maintenance sample.
To update the system document descriptions, which have been stored on the disc in an off line run, are processed by the system.  The words are read one by one from the document description and the keywords are separated from the non-significant words.  These keywords then arranged in alphabetic order using a library routine called FLSORT (Hoare, 1962) and stored in the Thesaurus using the hash indexing routine.   When the word appears for the first time, the system asks the operator on the console if it is significant or note, and deals with it accordingly.     The system makes an additional entry list for old keywords; in case of a new keywords it creates a new associated entry list.
The retrieval of information is, with respect to a keyword or keywords given by the user, in the course of a simple conversation with the computer on a console connected to the computer.  At every step a user is provided with a set of queries and by making the appropriate response he indicates to the computer which query he would like to put to it.     Although the system is search orientated, batch processing retrievals was also be carried out.
Search strategy of the system which responds to the various types of queries remains to be described.   The data file is the source of the responses to search requests and the reliability and efficiency of a query response is determined by the data structure.
2.3   The Search Strategy
As mentioned in the previous section the retrieval of a document by a keyword will occur if the keyword appears in the document description. The search strategy can be described as follows.
The documents and the few selected keywords, respectively, could be considered to be arranged sequentially over the columns and the rows of a matrix.    The matrix as shown in Table 1 can be framed consisting of elements of either I or 0, representing the presence or the absence of a keyword in a document description respectively.   The matrix, so formed, can be used to describe the way in which the system response to search requests, which are essentially the manipulations of the row vectors.
The possible search requests include simple, Boolean, linear, statistical and hierarchical searches.  Their responses are the direct outcome of matching of the proposed keywords with the document description. The outcome consists of a vector and the console prints out only information about the documents which correspond to the elements with value 1.  One can regard the elements as the coefficients of association of the keywords with documents, the coefficient values in the present system being 0 and 1.     There is a scope for generation at this stage, in more heterogeneous circumstances, to what Salton (1968) calls a “higher level scheme”. This system is, in Salton’s notation, a scheme of “level one”.
2.4  Hardware Description
The retrieval software is developed, within the limitations of an operations system called M.C.S. available at Queen’s University. M.C.S. is an on-line and time sharing system.  It has a fixed allocation of core store with total access to a number of on-line consoles and shared access to disc and drum storage.     These facilities are available with an I.C.L, 1907 computer, a high speed line printer and a card reader, paper tape peripherals, three exchangeable disc storage units, five Tope decks, one graph plotter and twenty remote control tele-type consoles which are linked to the central processor through a multiplexer.
2.5  Data Available

The disc file does not contain any record of statistical information about the system.  An evaluation study requires observations of the way in which various parts of the system change with increasing number of documents.  In particular information on the storage requirement and its effect on retrieval efficiency is required.
The major problem involved in obtaining the required data, was the determination of the way in which the statistical distributions of various quantities in the file varied as the number of documents increased.     These distributions included:-
(a)       the distribution of the number of computer words used by the document descriptions,
(b)       the distribution of the keywords in the buckets which contained the Thesaurus,
(c)       the distribution of the associated entry list lengths,
(d)       the distribution of the keywords per document,
(e)       the distribution of the new keywords introduced to the system,
(f)        the distribution of the number of entries per document.
To extract out the data for the desired distributions the prior knowledge of the data structure is essential. This includes:-
(a)       the exact data structure on the disc file, and the distribution within parts  of the disc file such as the serial number of last buckets used by each document description,
(b)       a knowledge of basic parameters involved, such as hash indexing using the prime number 119 to calculate relative bucket addresses to store and retrieve a keyword in the first section of the Thesaurus.
The next chapter describes the programs and the methods used to collect the data.  This data will be used in the analysis establishing the final conclusions.
****************
********

 MESSAGE: DEDICATED TO ANONYMOUS COMBATANTS OF KNOWLEDGE

مکتب علم الل مہد منل لحد   Learning continues from birth to death

FOR PROMOTION OF LEARNEDNESS SHARE WITH FRIENDS ABOUT;
  

  1. WHETHER ‘BRAIN-POWER’ OR ‘HEART-IMPULSE’ PLINTHS ‘HUMAN GEN’? http://sunedu.wordpress.com/2012/09/17/whether-brain-power-or-heart-impulse-plinths-human-gen/

Advertisements

No comments:

Post a Comment