Database Security, AutoML, and Data Lake Table Joining
Classified in Computers
Written at on English with a size of 3.65 KB.
Acid Rain: Concurrency-Related Attacks in Database-Backed Web Applications
Motivation:
- 12 popular self-hosted e-commerce applications (deployed over 2M websites, representing over 50% of all e-commerce websites)
- 22 critical ACIDRain attacks identified and verified
- Flexcoin -> Bankrupted
Problem Definition:
An application is vulnerable if:
- Anomalies Possible: Under concurrent API access, the application may exhibit behaviors (i.e., anomalies) that could not have arisen under serial execution.
- Sensitive Invariants: The anomalies arising from concurrent access lead to violations of application invariants.
Solution:
- Execute API calls against a live application and database to generate a (possibly sequential) trace of database activity.
- Analyze the trace for potential anomalies that could arise under concurrent execution.
Prevention:
- Select for update
- User-level concurrency control:
- Prevent concurrent calls in the same session
- Single read of data
- Multiple validations
They opened issues on Github:
- 18 different vulnerabilities reported
- 7 of them are confirmed
- One of them has feedback: "use your brain! it's not hard to come up with a solution that does not involve coding."
Democratizing Data Science through Interactive Curation of ML Pipelines
Introduction to Alpine Meadow:
A new automated machine learning tool, Alpine Meadow, is developed:
- The tool is interactive, claimed to be an important aspect for fast operation.
- The tool selects the best possible setting for a given ML task among the options and constructs a final pipeline in an automated way.
- It is claimed to outperform many of its equivalents in several cases.
AutoML Approach:
AutoML or "Learning to Learn" is an approach in ML systems. The aim is to give the best prediction result in the fastest way, based on automatically selecting the best possible:
- Data processing steps
- Learning algorithm
- Hyperparameters
The Need for Improvements in AutoML:
Current AutoML systems take days or weeks to complete. They usually do not allow user intervention. Providing a quick response is important. If computing powers are at the limit, user interaction might save time.
Contributions of Alpine Meadow:
The tool is interactive, showing the positive aspects of interactions. Pipeline selection provides new insights for AutoML approaches.
System Overview:
The approach starts by mimicking a data scientist:
- Optimize and automate all steps as much as possible.
- Build many pipelines; select the most plausible one; show the results of the best one.
- The selection process is analogous to query optimization.
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
Introduction:
A solution for finding joinable tables is proposed:
- Given a table and a join column, the system finds all joinable tables with a cheaper method: overlap set similarity search.
Conclusion:
- JOSIE, with its cost function, is adaptive to new data distributions.
- Suitable for large data lakes.
- Outperforms an approximate algorithm.
Future Work:
- Estimation of set intersection size based on token frequencies.
- Using past statistics, automated selection of query columns.
- Fuzzy join.