![]() |
Python is a language universally praised for cutting down development time, but using it efficiently for data analysis is not without its pitfalls. The same features that make development easy in the beginning (dynamic, permissive type system) can be the downfall of large systems; and confusing libraries, slow running times and not designing with data integrity in mind can quickly eat up development time instead.
This article will cover the most common time wasters encountered when working with Python and Big Data and provide suggestions to get back on track and spend time on what really matters: using creativity and scientific methods to generate insights from vast amounts and diverse types of data.
##2 Mistake #1: Reinventing the wheel
The Python community is in a very good state when it comes to data analysis libraries that have rich functionality and have been extensively tested. So, why reinvent the wheel?
I see this a lot during code challenges where the candidate needs to load a CSV file into memory in order to work with it. Quite a few spend a large chunk of time writing custom CSV loading functions and invariably end up with a dictionary of dictionaries which is slow to query and difficult to transform. That leaves them with little time to impress with their ability to generate insights from data.
There truly is no reason to get bogged down with solving already-solved problems - spend a few minutes to do a Google search or ask a more experienced developer for suggestions for a data analysis library.
Incidentally, one such library that's widely used at the time of this writing is Python Pandas. It comes with useful abstractions for dealing with large datasets, a lot of functionality for ETL (extract, transform, load) and good performance. It cuts down developer time by enabling the succinct expression of data transformations and providing functions to load, unify and store data from different sources and formats.
To illustrate the former, let's say we have a CSV file with the header Product, ItemsSold and we want to find the ten most popular products. Let's compare a reasonable implementation in vanilla Python and an implementation using the powerful abstractions of Python Pandas:
###2.1 Vanilla Python
- from collections import defaultdict
- header_skipped = False
- sales = defaultdict(lambda: 0)
- with open(filename, 'r') as f:
- for line in f:
- if not header_skipped:
- header_skipped = True
- continue
- line = line.split(",")
- product = line[0]
- num_sales = int(line[1])
- sales[product] += num_sales
- top10 = sorted(sales.items(), key=lambda x:x[1], reverse=True)[:10]
- import pandas as pd
- data = pd.read_csv(filename) # header is conveniently inferred by default
- top10 = data.groupby("Product")["ItemsSold"].sum().order(ascending=False)[:10]
##3 Mistake #2: Not tuning for performance
When programs take too much time to produce any output, the developer's rhythm and focus are broken. Slow programs also limit the amount of experimentation a developer can do - if your program takes ten minutes to output results for a small dataset, you have the possibility to tweak and execute your program only around thirty times per day.
So, when you find yourself sitting idly waiting for the code to execute, it may be time to try and identify bottlenecks. There are specialized utilities available to help developers profile and speedup their code. Most of these work within the IPython interactive shell.
The simplest way to profile code within IPython is to use the %timeit magic command to get the runtime of a Python statement. A more sophisticated tool is the line profiler which you can download. After launching IPython, type:
- %load_ext line_profiler
- %lprun -f function_to_profile statement_that_invokes_the_fuction
- Line # Hits Time Per Hit % Time Line Contents
- ==============================================================
However, if you've reached the point where you are sure the algorithm is of optimal complexity and implementation for the problem, it may pay off to cythonize parts of your code for performance. Using the %timeit command from above we can compare the runtime of the uncythonized version and the cythonized one.
###3.1 Uncythonized version
Paste this into IPython:
- def sum_uncythonized():
- a = 0
- for i in range(100000):
- a += i
- return a
Install cythonmagic if you don't have it already, and within IPython type:
- %load_ext cythonmagic
- and copy-paste the following text as a single block:
- %%cython
- def sum_cythonized():
- cdef long a = 0 # this directive defines a type for the variable
- cdef int i = 0
- for i in range(100000):
- a += i
- return a
- %timeit sum_cythonized()
- >>>10000 loops, best of 3: 52.3 µs per loop
- %timeit sum_uncythonized()
- >>>100 loops, best of 3: 3.28 ms per loop
##4 Mistake #3: Not understanding time and timezones
When programmers first work with time, Epoch time can be a strange concept to grasp. The basic concept to understand is that Epoch time is the same number around the world at any given instant but how this number is translated into hours and minutes of the day depends on the timezone and the time of the year (because of daylight savings). When working with Python, these translations are handled by the datetime and pytz modules.
Python's built-in time handling modules can be confusing both in how the functions are named and how to convert between different representations of time, much like the C time libraries it is based on internally. It is important to be able to work with it correctly, because data in timeseries format is very common. One pain point is timezone handling. A common misconception concerns the following statement:
- dt = datetime.fromtimestamp(utc_number)
For timezone support, the pytz module is a solution which can operate with the datetime module. It can handle creation of the local time when the UTC time is known, and it also respects daylight savings rules. We can verify this by doing the following:
- from datetime import datetime
- import pytz
- ams = pytz.timezone('Europe/Amsterdam')
- winter = datetime(2000, 11, 1, 10, 0, tzinfo=pytz.utc)
- summer = datetime(2000, 6, 1, 10, 0, tzinfo=pytz.utc)
- print summer.astimezone(ams) # CET time is +2 hours
- >>>2000-06-01 12:00:00+02:00
- print winter.astimezone(ams) # CEST time is +1 hour
- >>>2000-11-01 11:00:00+01:00
- td = datetime(2000, 11, 1, 10, 0, tzinfo=pytz.utc) - datetime(2000, 11, 1, 10, 0, tzinfo=ams)
- print td.total_seconds()
- >>>1200 # 20 minutes ? (somehow pytz falls back to a long outdated timezone setting for Amsterdam)
Recent libraries that tackle the issue of providing a more intuitive API on top of the maze that is Python time handling are Delorean, times and arrow. (continue)
If you found this post interesting, follow and support us.
Suggest for you:
Zero to Hero with Python Professional Python Programmer Bundle
The Python Mega Course: Build 10 Python Applications
Advanced Scalable Python Web Development Using Flask
Python, Ruby, Shell - Scripting for Beginner
Python For Android Hacking Crash Course: Trojan Perspective
No comments:
Post a Comment