Limits...
Titian: Data Provenance Support in Spark.

Interlandi M, Shah K, Tetali SD, Gulzar MA, Yoo S, Kim M, Millstein T, Condie T - Proceedings VLDB Endowment (2015)

Bottom Line: Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result.Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California, Los Angeles.

ABSTRACT

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

No MeSH data available.


Run time of Newt and RAMP data lineage capture in a Spark word count job. The table summarizes the plot results at four dataset sizes, and indicates the run time as a multiplier of the native Spark job execution time.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4697929&req=5

Figure 3: Run time of Newt and RAMP data lineage capture in a Spark word count job. The table summarizes the plot results at four dataset sizes, and indicates the run time as a multiplier of the native Spark job execution time.

Mentions: Figure 3 gives a quantitative assessment of the additional time needed to execute a word count job when capturing lineage with Newt. The results also include a version of the RAMP design that we built in the Titian framework. For this experiment, only RAMP is able to complete the workload in all cases, incurring a fairly reasonable amount of overhead i.e., RAMP is on average 2.3× the Spark execution time. However, the overhead observed in Newt is considerably worse (up to 86× the Spark run time), preventing the ability to operate at 500GB. Simply put, MySQL could not sustain the data lineage throughput observed in this job. A more detailed description is available in Section 5.


Titian: Data Provenance Support in Spark.

Interlandi M, Shah K, Tetali SD, Gulzar MA, Yoo S, Kim M, Millstein T, Condie T - Proceedings VLDB Endowment (2015)

Run time of Newt and RAMP data lineage capture in a Spark word count job. The table summarizes the plot results at four dataset sizes, and indicates the run time as a multiplier of the native Spark job execution time.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4697929&req=5

Figure 3: Run time of Newt and RAMP data lineage capture in a Spark word count job. The table summarizes the plot results at four dataset sizes, and indicates the run time as a multiplier of the native Spark job execution time.
Mentions: Figure 3 gives a quantitative assessment of the additional time needed to execute a word count job when capturing lineage with Newt. The results also include a version of the RAMP design that we built in the Titian framework. For this experiment, only RAMP is able to complete the workload in all cases, incurring a fairly reasonable amount of overhead i.e., RAMP is on average 2.3× the Spark execution time. However, the overhead observed in Newt is considerably worse (up to 86× the Spark run time), preventing the ability to operate at 500GB. Simply put, MySQL could not sustain the data lineage throughput observed in this job. A more detailed description is available in Section 5.

Bottom Line: Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result.Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California, Los Angeles.

ABSTRACT

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

No MeSH data available.