Information Retrieval

Objective: This test evaluates a set of information retrieval (IR) metrics to measure the effectiveness of search algorithms in retrieving relevant documents.

Required Parameters: prompt, context

Result Interpretation: The scores reflect the search system's ability to identify and rank relevant documents effectively. Higher scores indicate better performance.

Types of measures in IR Metrics Test:

  1. Accuracy: Reports the probability that a relevant document is ranked before a non-relevant one.

  2. AP (Average Precision): The mean of the precision scores at each relevant item returned in a search results list.

  3. BPM (Bejeweled Player Model): A measure for evaluating web search using a player-based model.

  4. Bpref (Binary Preference): Examines the relative ranks of judged relevant and non-relevant documents.

  5. Compat (Compatibility measure): Assesses top-k preferences in a ranking.

  6. infAP (Inferred AP): AP implementation that accounts for pooled-but-unjudged documents by assuming that they are relevant at the same proportion as other judged documents.

  7. INSQ: A measure for IR evaluation as a user process.

  8. INST: A variant of INSQ.

  9. IPrec (Interpolated Precision): Precision at a given recall cutoff used for precision-recall graphs.

  10. Judged: Percentage of top results with relevance judgments.

  11. nDCG (Normalized Discounted Cumulative Gain): Evaluates ranked lists with graded relevance labels.

  12. NERR10 (Not Expected Reciprocal Rank): Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure.

  13. NERR11 (Not Expected Reciprocal Rank): Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure.

  14. NERR8 (Not Expected Reciprocal Rank): Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure.

  15. NERR9 (Not Expected Reciprocal Rank): Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure.

  16. NumQ (Number of Queries): Total number of queries.

  17. NumRel (Number of Relevant Documents): Number of relevant documents for a query.

  18. NumRet (Number of Retrieved Documents): Number of documents returned.

  19. P (Precision): Percentage of relevant documents in the top results.

  20. R (Recall): Fraction of relevant documents retrieved.

  21. Rprec (Precision at R): Precision at the number of relevant documents for a query.

  22. SDCG (Scaled Discounted Cumulative Gain): A variant of nDCG accounting for unjudged documents.

  23. SETAP: The unranked Set AP (SetAP); i.e., SetP * SetR.

  24. SETF: The Set F measure (SetF); i.e., the harmonic mean of SetP and SetR.

  25. SetP: The Set Precision (SetP); i.e., the number of relevant docs divided by the total number retrieved.

  26. SetR: The Set Recall (SetR); i.e., the number of relevant docs divided by the total number of relevant documents.

  27. Success: Indicates if a relevant document is found in the top results.

Example with metrics "Success"

evaluator.add_test(
    test_names=["ir_metrics_test"],
    data={
        "prompt": "What is the capital of France?",
        "context": ["london is a city", "Mumbai is the capital of Maharastra."],
    },
    arguments={"metric_name": "Success", "cutoff": 4, "max_rel": 7},
).run()

evaluator.print_results()

Last updated