Function or method is the most basic unit in Python programming. Test-driven development is a key for a developer to assure the code quality of those units. In his book, Harry Percival illustrated a few great examples about how to use TDD with Python and Django. It seems that for web development, TDD including unit testing and integration testing is the cornerstone for every success. For data analysis, coding mostly relies on built-in packages instead large framework like Django, which makes TDD easier. In my opnion, TDD in data analysis could have three steps.
- Step 1: requirement analysisBefore writing any code for data analysis, the programmer should seriously ask the customer or himself about the requirements.
- What are the input parameter?
- what if the input data doesn’t fit the assumptions?
- What is the purpose of this funtction or method? what are the desired outputs?
- For example, there is a recent coding challenge called Maximum Product Subarray.
> Find the contiguous subarray within an array (containing at least one number) which has the largest product.
For example, given the array [2,3,-2,4],
the contiguous subarray [2,3] has the largest product = 6. - OK, understanding this question is quite straight-forward. Given a array(or a list in Python), you return the integer that is the maximum product from a continuous subarry out of the input array.
def maxProduct(A):
""" A function to find the maximum product value for a continuous subarray.
:param A: an array or list
:type A: list
"""
if A is None or not isinstance(A, list):
return None
if len(A) == 1:
return A[0]
pass- A production version of the codes above should be more like:
class FunctionInputError(Exception):
pass
def maxProduct(A):
""" A function to find the maximum product value for a continuous subarray.
:param A: an array or list
:type A: list
"""
if A is None or not isinstance(A, list):
raise FunctionInputError('must give a list as input')
if len(A) == 1:
return A[0]
pass
- Step 2: write test casesGiven not a single line of logic codes has been writen yet, I call the current step as black-box testing, which means that I want this funtion to fail every test cases. Python has a built-in module doctest, which allows embedding the test cases within the docstring. I write six test cases, run the psedu-function below and arbitrarily specify the result to be -1. As expected, it fails all the six test cases with horrible red warnings under the Python shell. That is a good thing: it proves that the testing works.
import doctest
def maxProduct(A):
""" A function to find the maximum product value for a continuous subarray.
:param A: an array or list
:type A: list
- testcase1
>>> maxProduct([0, 2, 3,-2,4, 1, -1])
48
- testcase2
>>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
12
- testcase3
>>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
30
- testcase4
>>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
12000
- testcase5
>>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
10000
- testcase6
>>> maxProduct([0, -2, 0])
0
"""
if A is None or not isinstance(A, list):
return None
if len(A) == 1:
return A[0]
return -1
doctest.testmod() - Step 3: implement the logicIt’s time to tackle the most difficult part: write the real function. Think about time complexity (it is best to use only one iteration around the input array which means
O(n)
), and space complexity (it is best not to use extra space). Runtestmod()
again and again to find mistakes and modify the codes accordingly. Finally I come with a solution with a helper shadow function _maxProduct. And it passes the six test cases. Althoug I am not sure that this function does not have any bug, at least it works now.import doctest
from sys import maxint
def maxProduct(A):
""" A function to find the maximum product value for a continuous subarray.
:param A: an array or list
:type A: list
- testcase1
>>> maxProduct([0, 2, 3,-2,4, 1, -1])
48
- testcase2
>>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
12
- testcase3
>>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
30
- testcase4
>>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
12000
- testcase5
>>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
10000
- testcase6
>>> maxProduct([0, -2, 0])
0
"""
if A is None or not isinstance(A, list):
return None
if len(A) == 1:
return A[0]
return max(_maxProduct(A), _maxProduct([a for a in reversed(A)]))
def _maxProduct(A):
max_val_forward = 1
rst = -maxint
for a in A:
if a != 0:
max_val_forward *= a
rst = max(rst, max_val_forward)
else:
rst = max(0, rst)
max_val_forward = 1
return rst
if __name__ == "__main__":
doctest.testmod()
In conclusion, the most important thing about TDD in data analysis is writing test cases, which really needs a lot of training and exercises.