python, finding recurring pairs of data

I would like to find all pairs of data that appear together at least 10 times.

For example, given a large input file of keywords:

>>> foo, bar, spam, eggs, durian, stinky tofu, ...
>>> fruit, meat, vinegar, sphere, foo, ...
>>> merlot, red, hearty, spam, oranges, durian, ...
>>> ...

"durian" and "spam" appear together twice and all other combinations appear together only once.

Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.

#!/usr/bin/env python
'''
Processes the given input and returns all pairs that appear together in 
at least 10 different lines.

Algorithm Explaination:

Hash each value such that each value maps a 'set' of row indexes 
(i.e., each line in the file)

Next, rather than compare all possible combinations of values, create a 
shorter list of candidate values who have 10 or more row indexes.

The set-intersection will contain all rows that BOTH values were listed in, 
if the size of the set-intersection is greater than 10, output this pair
'''

import codecs, csv, argparse, sys

def setIntersection(infile):
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_sets = {}
    i = 0
    for row in csv_data:
        i = i+1
        for a in row:
            if a in data_sets:
                data_sets[a].add(i)
            else:
                data_sets[a] = set([i])
    candidates = {a:s for a,s in data_sets.items() if len(s) > 10}
    pairc = {a for a in candidates.keys()}
    for a,s in candidates.items():
        pairc.remove(a)
        for c in pairc:
            if len(s & candidates[c]) > 10:
                print('%s, %s, %s' % (a,c,len(s & candidates[c])))

def main():
    p = argparse.ArgumentParser()
    p.add_argument('--output-file', '-o', default='')
    p.add_argument('--bitmap', '-b', action="store_true")
    p.add_argument('files', nargs='+')
    args = p.parse_args()

    if (args.output_file != ''):
        sout = sys.stdout
        fh = open(options.output_file, 'w')
        sys.stdout = fh

    for infile in args.files:
        setIntersection(infile)

    if (args.output_file != ''):
        sys.stdout = sout
        fh.close()

if __name__ == '__main__':
    main()

Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,

def bitmapIntersection(infile):
    '''
    The bitmap example is more general purpose and easily ported to C.  In fact, 
    the bitCount function is a python version of computing Hamming Weight from K&R.

    A simple example illustrates the bitmap,

      keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ]
      keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ]
             & : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND
    '''
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_maps = {}
    i = 0
    for row in csv_data:
        for a in row:
            if a in data_maps:
                data_maps[a] += (2 ** i)
            else:
                data_maps[a] = (2 ** i)
        i = i+1
    candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10}
    pairc = {a for a in candidates.keys()}
    for a,b in candidates.items():
        pairc.remove(a)
        for c in pairc:
            bits = bitCount(b & candidates[c])
            if bits > 10:
                print('%s, %s, %s' % (a,c,bits))

def bitCount(int_type):
    ''' ported from Kernighan & Ritchie's "The C Programming Language"
    '''
    count = 0
    while(int_type):
        int_type &= int_type - 1
        count += 1
    return(count)

A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the BitVector module.

This entry was posted in python. Bookmark the permalink.

Comments are closed.