Limits...
Efficient and scalable graph similarity joins in MapReduce.

Chen Y, Zhao X, Xiao C, Zhang W, Tang J - ScientificWorldJournal (2014)

Bottom Line: With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs.Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation.The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.

View Article: PubMed Central - PubMed

Affiliation: College of Information System and Management, National University of Defense Technology, Changsha 410073, China.

ABSTRACT
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.

Show MeSH
Evaluating verification.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4121100&req=5

fig3: Evaluating verification.

Mentions: The verification was evaluated with candidate pairs generated by Basic. Term “+MJ” denotes applying multiway join in verification, while “+MRGED” denotes adapting alternative MapReduce-based GED calculation. We use term “MGSJoin” to indicate the basic algorithm incorporating both techniques. Figure 3(a) shows the runtime comparison between +MRGED and MGSJoin. The result illustrates the superiority of applying multiway join. When τ = 5, algorithm with multiway join is about 6,000 seconds faster than with ordinary joins. Figure 3(b) presents the result of evaluating MRGED, where Basic and +MRGED are compared. It can be observed that the runtime of both algorithms grows exponentially. When τ equals 1, the Basic finishes quicker than +MRGED (162 s and 204 s, resp.), while τ equals 2; the +MRGED outweighs Basic (the runtime is 712 s and 580 s, resp.). This is because the calculation required for τ = 1 is small, where the MRGED is clumsy compared with A*, whereas when τ = 2 more calculation is required so that the advantage of MRGED comes out. When τ is within the range of 3–5 Basic is unable to finish because the large calculation drives Basic out of memory.


Efficient and scalable graph similarity joins in MapReduce.

Chen Y, Zhao X, Xiao C, Zhang W, Tang J - ScientificWorldJournal (2014)

Evaluating verification.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4121100&req=5

fig3: Evaluating verification.
Mentions: The verification was evaluated with candidate pairs generated by Basic. Term “+MJ” denotes applying multiway join in verification, while “+MRGED” denotes adapting alternative MapReduce-based GED calculation. We use term “MGSJoin” to indicate the basic algorithm incorporating both techniques. Figure 3(a) shows the runtime comparison between +MRGED and MGSJoin. The result illustrates the superiority of applying multiway join. When τ = 5, algorithm with multiway join is about 6,000 seconds faster than with ordinary joins. Figure 3(b) presents the result of evaluating MRGED, where Basic and +MRGED are compared. It can be observed that the runtime of both algorithms grows exponentially. When τ equals 1, the Basic finishes quicker than +MRGED (162 s and 204 s, resp.), while τ equals 2; the +MRGED outweighs Basic (the runtime is 712 s and 580 s, resp.). This is because the calculation required for τ = 1 is small, where the MRGED is clumsy compared with A*, whereas when τ = 2 more calculation is required so that the advantage of MRGED comes out. When τ is within the range of 3–5 Basic is unable to finish because the large calculation drives Basic out of memory.

Bottom Line: With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs.Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation.The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.

View Article: PubMed Central - PubMed

Affiliation: College of Information System and Management, National University of Defense Technology, Changsha 410073, China.

ABSTRACT
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.

Show MeSH