Core Concepts in Data Processing and Information Retrieval
Classified in Computers
Written on in
English with a size of 3.3 KB
Fundamental Concepts in Knowledge and Data
- The most fundamental combiner is the unification of the Self with itself, which gives rise to the knower, the process of knowing, and the known.
- Normalization is a mathematically interesting, bottom-up technique for producing a set of relations with desirable properties from a set of mathematical dependencies in the data.
- A graphical technique employed by Vedic science is the unified field chart, which provides a holistic overview of a discipline and links all knowledge with the Self.
Data Compression Techniques
Variable-Length Integer Coding (VarInt)
A simple approach to compression is to use only as many bytes as necessary to represent an integer. This technique is known as variable-length integer coding (or VarInt). It is accomplished by using the high-order bit of every byte as a continuation bit. This bit is set to one in the last (lowest) byte and zero elsewhere.
Principles of Data Processing Frameworks
Postings and Payloads
Statement Assessment: False
For simple Boolean retrieval, the simplest payload is nothing, as no additional information beyond the document ID is needed in the posting. The existence of the posting itself indicates the presence of the term in the document.
Combiner Functionality
Statement Assessment: False
A combiner is an optional part of the process, so it might not be invoked by the framework at all. If used, it could be invoked once or many times. The purpose of an in-mapper combiner is to force its execution within the mapper, reducing the number of key-value pairs that need to be shipped across the network (for each key and all its associated values).
In-Mapper Combiner Drawbacks
Statement Assessment: True
- It breaks functional programming rules, as state is preserved across multiple input key-value pairs.
- It can become a scalability bottleneck, as it depends on having sufficient memory to store intermediate results until the mapper has processed all key-value pairs in an input split.
Data Analysis and Frequency Calculation Examples
Problem Set Answers
- a) 2 postings are identified (covering items 1, and then 2-9).
- b) Selected items: b, d, e, f.
Total Count Calculations
(Iowa, *) = Σw’ N(Iowa, w’) = 9+4+25+7+6+10+11+5+3 = 80(Nebraska, *) = Σw’ N(Nebraska, w’) = 16+5+33+5+12+7+10+14+8 = 110(Florida, *):No answer provided.(Illinois, *) = Σw’ N(Illinois, w’) = 12+8+9+8+2+1 = 40
Relative Frequency Calculations
Below are the calculations for relative frequency:
f(fox | Iowa) = (6+10) / 80 = 0.2f(deer | Illinois) = 8 / 40 = 0.2f(rabbit | Nebraska) = (10+14) / 110 ≈ 0.22f(cat | Illinois) = (8+2) / 40 = 0.25f(mouse | Iowa):N/A, as 'mouse' is not associated with 'Iowa' in the dataset.(Illinois, [dog, fox]):Incomplete.