The line-by-line feature in Python allows it to count hard disk-bound data. The most frequently used data structures in Python are list and dictionary. Many cases the dictionary has advantages since it is a basically a hash table that many realizes O(1) operations.
However, for the tasks of counting values, the two options make no much difference and we can choose any of them for convenience. I listed two examples below.
Use a dictionary as a counter
There is a question to count the strings in Excel.
Count the unique values in one column in EXCEL 2010. The worksheet has 1 million rows and 10 columns.
or numbers.
For example,
A5389579_10
A1543848_6
A5389579_8
Need to cut off the part after (including) underscore such as from A5389579_10 to A5389579
Commonly Excel on a desktop can’t handle this size of data, while Python would easily handle the job.
# Load the Excel file by the xlrd package
import xlrd
book = xlrd.open_workbook("test.xlsx")
sh = book.sheet_by_index(0)
print sh.name, sh.nrows, sh.ncols
print "Cell D30 is", sh.cell_value(rowx=29, colx=3)
# Count the unique values in a dictionary
c = {}
for rx in range(sh.nrows):
word = str(sh.row(rx)[1].value)[:-3]
try:
c[word] += 1
except:
c[word] = 1
print c
Use a list as a counter
There is a question to count emails.
The challenge is to scale up the small sample data to larger size. The solution I have has the complexity of O(nlogn), which is only limited by the sorting step.A 3-column data set includes sender, receiver and timestamp. How to calculate the time between the sender sends the email
and the receiver sends the reply email?
raw_data = """
SENDER|RECEIVER|TIMESTAMP
A B 56
A A 7
A C 5
C D 9
B B 12
B A 8
F G 12
B A 18
G F 2
A B 20
"""
# Transform the raw data to a nested list
data = raw_data.split()
data.pop(0) # Remove the Head
data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))
# Sort the nested list by the timestamp
from operator import itemgetter
data.sort(key=itemgetter(2))
for r in data:
print r
# Count the time difference in a list
c = []
while len(data) != 1:
y = data.pop(0)
for x in data:
if x[0] == y[1] and x[1] == y[0]:
diff = x[2] - y[2]
print y, x, '---->', diff
c.append(diff)
break # Only find the quickest time to respond
print c
P.S.
I come up with the O(n) solution below, which utilizes two hash tables to decrease the complexity.
__author__ = 'dapangmao'
def find_duration(data):
# Construct two hash tables
h1 = {}
h2 = {}
# Find the starting time for each ID pair
for x in data:
if x[0] != x[1]:
key = x[0] + x[1]
try:
h1[key] = x[2]
except:
h1[key] = min(h1[key], x[2])
# Find the minimum duration for each ID pair
for x in data:
key = x[1] + x[0]
if h1.has_key(key):
duration = x[2] - h1[key]
try:
h2[key] = duration
except:
h2[key] = min(h2[key], duration)
return h2
if __name__ == "__main__":
raw_data = """
SENDER|RECEIVER|TIMESTAMP
A B 56
A A 7
A C 5
C D 9
B B 12
B A 8
F G 12
B A 18
G F 2
A B 20
"""
# Transform the raw data to a nested list
data = raw_data.split()
data.pop(0) # Remove the Head
data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))
# Verify the result
print find_duration(data)