Leveraging AI for Index Recommendations and Regular Path Queries

Classified in Computers

Written on December 29, 2023 in English with a size of 4.45 KB

AI Meets AI: Leveraging Query Executions to Improve Index Recommendations

How artificial intelligence (AI) can benefit automated indexing (AI)?: Comparing the execution cost of two plans (different index configurations of same query) is key for index tuning. Instead of using optimizer’s estimates for such comparison, formulating it as a classification task in machine learning results in significantly higher accuracy.

MOTIVATION:

Being able to fully automate index recommendation and implementation is a significant value-add for improving query execution cost.
Requirement: creating or dropping indexes should not cause significant query performance regressions - users enforce a no query regression constraint.
Using the optimizer’s estimates to enforce the constraint can result in significant errors.

THE APPROACH: Use execution history to train an ML model to predict execution cost by formulating the problem as a regression task. The index tuner can then use the ML model’s predicted cost instead of the query optimizer’s estimated cost.

LEARNING THE CLASSIFIER: Offline model: Once the query plan pairs have been featurized, use off-the-shelf ML packages to train the classifier. The simplest approach is to use linear or tree models to train one offline model. Used models:

Logistic Regression (simplicity and training speed)
Random Forest
Gradient-boosted Trees
Light GBM with gradient-boosted decision trees.

EXPERIMENTAL EVALUATIONS:

Regression vs. classification: 2× −5× reduction in fraction of errors for picking the cheaper one between a pair of plans when using the classifier compared to using other learned or analytical models to predict cost or cost ratio. Offline model: RF-based models outperform others in accuracy and training efficiency.

Efficiently Answering Regular Simple Path Queries on Large Labeled Networks

Regular Path Queries (RPQ): A fundamental query in labeled graphs is to determine if there exists a path from a given source vertex to a specified target vertex, such that the labels encountered along the path form a string that is from a given regular language. Specified with a starting node, a destination node and a regular expression over the node and edge labels.

LIMITATIONS ON EXISTING TECHNIQUES FOR RSP QUERIES:

They handle a very small subclass of regular expressions
Costs grow exponentially with number of labels
Unable to scale to networks containing more than a handful of labels
Do not support query-time label definition
They assume that the label set space is known
Unable to handle dynamic networks.

PROPOSED SOLUTION - ARRIVAL: Approximate Regular-simple-path Reachability In Vertex and Arc (directed edges) Labeled Graphs: an index-free, near-optimal algorithm with linear storage and time complexity. Based on the simple idea of sampling a small number of paths and then answering the RSP query by processing only this sampled set.

Entity Matching Meets Data Science: A Progress Report from the Magellan Project

WHAT IS MAGELLAN PROJECT? A system that tries to solve Entity Matching problems. Started at the University of Wisconsin-Madison. Aim is creating an Entity Matching system that can advance EM field like R or Hadoop in Big Data field. Ideas from Data Science.

WHAT IS ENTITY MATCHING? Find data instances that refer to the same real-world entity. Dave Smith, Madison & David D. Smith, Madison are same entities. Matching tuples from two different tables or within a single table.

ENTITY MATCHING STEPS: Blocking Step: Using heuristics to remove non-matching pairs at the beginning. Matching Step.

RELATED WORK on Data Science Magellan borrows many trends and ideas from DS. Ecosystem of interoperable tools such as R, PyData etc. Microservice architecture. Using Machine Learning techniques.

CONCLUSION: Magellan borrows ideas from Data Science. PyMatcher for power users, on-premise Python packages. CloudMatcher for lay users, on future cloud-native microservices. Planning to apply Magellan to other data integration problems. Schema matching, data exploration and information extraction.

Related entries:

Tags: