Min hash is a probabilistic method for estimation the similarity of two sets in terms of their Jaccard index. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very disparate size. In this manuscript, we introduce a new and efficient approach, called the containment min hash approach, that is more suitable for sets of very different size. We accomplish this by leveraging another probabilistic method (in particular, Bloom filters) for fast membership queries. We derive bounds on the probability of deviation for both the classic min hash technique and our containment approach and demonstrate the significant improvement in terms of both relative error and time/space complexity. We then use this method to analyze metagenomic data (that is, data given by the totality of DNA from a given environmental sample of microorganisms) and show how it can be used to detect the presence of very small, low abundance microorganisms.
Hooman's major professor is Prof. David Koslicki.