Wednesday, August 17, 2016

The Top Mistakes Developers Make When Using Python for Big Data Analytics_part 1


##1 Introduction

Python is a language universally praised for cutting down development time, but using it efficiently for data analysis is not without its pitfalls. The same features that make development easy in the beginning (dynamic, permissive type system) can be the downfall of large systems; and confusing libraries, slow running times and not designing with data integrity in mind can quickly eat up development time instead.

This article will cover the most common time wasters encountered when working with Python and Big Data and provide suggestions to get back on track and spend time on what really matters: using creativity and scientific methods to generate insights from vast amounts and diverse types of data.

##2 Mistake #1: Reinventing the wheel

The Python community is in a very good state when it comes to data analysis libraries that have rich functionality and have been extensively tested. So, why reinvent the wheel?
I see this a lot during code challenges where the candidate needs to load a CSV file into memory in order to work with it. Quite a few spend a large chunk of time writing custom CSV loading functions and invariably end up with a dictionary of dictionaries which is slow to query and difficult to transform. That leaves them with little time to impress with their ability to generate insights from data.

There truly is no reason to get bogged down with solving already-solved problems - spend a few minutes to do a Google search or ask a more experienced developer for suggestions for a data analysis library.

Incidentally, one such library that's widely used at the time of this writing is Python Pandas. It comes with useful abstractions for dealing with large datasets, a lot of functionality for ETL (extract, transform, load) and good performance. It cuts down developer time by enabling the succinct expression of data transformations and providing functions to load, unify and store data from different sources and formats.

To illustrate the former, let's say we have a CSV file with the header Product, ItemsSold and we want to find the ten most popular products. Let's compare a reasonable implementation in vanilla Python and an implementation using the powerful abstractions of Python Pandas:

###2.1 Vanilla Python
  1.  from collections import defaultdict
  2. header_skipped = False
  3. sales = defaultdict(lambda: 0)
  4. with open(filename, 'r') as f:
  5.     for line in f:
  6.         if not header_skipped:
  7.            header_skipped = True
  8.            continue
  9.         line = line.split(",")
  10.         product = line[0]
  11.         num_sales = int(line[1])
  12.         sales[product] += num_sales
  13. top10 = sorted(sales.items(), key=lambda x:x[1], reverse=True)[:10]
###2.2 Pandas
  1. import pandas as pd
  2. data = pd.read_csv(filename) # header is conveniently inferred by default
  3. top10 = data.groupby("Product")["ItemsSold"].sum().order(ascending=False)[:10]
Notes: Doing the task in vanilla Python does have the advantage of not needing to load the whole file in memory - however, pandas does things behind the scenes to optimize I/O and performance. Additionally, the in-memory sales dictionary of the vanilla Python solution is not lightweight either.

##3 Mistake #2: Not tuning for performance

When programs take too much time to produce any output, the developer's rhythm and focus are broken. Slow programs also limit the amount of experimentation a developer can do - if your program takes ten minutes to output results for a small dataset, you have the possibility to tweak and execute your program only around thirty times per day.

So, when you find yourself sitting idly waiting for the code to execute, it may be time to try and identify bottlenecks. There are specialized utilities available to help developers profile and speedup their code. Most of these work within the IPython interactive shell.

The simplest way to profile code within IPython is to use the %timeit magic command to get the runtime of a Python statement. A more sophisticated tool is the line profiler which you can download. After launching IPython, type:
  1. %load_ext line_profiler
  2. %lprun -f function_to_profile statement_that_invokes_the_fuction
Subsequently, you get an output of this form which describes what percentage of execution time was spent on which line at the function:
  1. Line #      Hits         Time  Per Hit   % Time  Line Contents
  2. ==============================================================
Using the line profiler has helped me personally identify bottlenecks with the use of the aforementioned Python Pandas library and achieve tenfold speedups by tweaking the implementation.

However, if you've reached the point where you are sure the algorithm is of optimal complexity and implementation for the problem, it may pay off to cythonize parts of your code for performance. Using the %timeit command from above we can compare the runtime of the uncythonized version and the cythonized one.

###3.1 Uncythonized version

Paste this into IPython:
  1. def sum_uncythonized():
  2.     a = 0
  3.     for i in range(100000):
  4.         a += i
  5.     return a
###3.2 Cythonized version

Install cythonmagic if you don't have it already, and within IPython type:
  1.  %load_ext cythonmagic
  2. and copy-paste the following text as a single block:
  3. %%cython
  4. def sum_cythonized():
  5.     cdef long a = 0 # this directive defines a type for the variable
  6.     cdef int i = 0
  7.     for i in range(100000):
  8.         a += i
  9.     return a
Then view the results:
  1.  %timeit sum_cythonized()
  2. >>>10000 loops, best of 3: 52.3 µs per loop

  3.  %timeit sum_uncythonized()
  4. >>>100 loops, best of 3: 3.28 ms per loop
We achieve a speed-up of two orders of magnitude just by defining types. Cythonmagic, indeed.

##4 Mistake #3: Not understanding time and timezones

When programmers first work with time, Epoch time can be a strange concept to grasp. The basic concept to understand is that Epoch time is the same number around the world at any given instant but how this number is translated into hours and minutes of the day depends on the timezone and the time of the year (because of daylight savings). When working with Python, these translations are handled by the datetime and pytz modules.

Python's built-in time handling modules can be confusing both in how the functions are named and how to convert between different representations of time, much like the C time libraries it is based on internally. It is important to be able to work with it correctly, because data in timeseries format is very common. One pain point is timezone handling. A common misconception concerns the following statement:
  1. dt = datetime.fromtimestamp(utc_number)
When dealing with timezones for the first time, people think this returns the date and time in UTC format. In reality, it returns the date and time in the timezone of the machine the command is being ran on. This essentially makes the code non-portable. I learned this the hard way by deploying the same code in a local machine and in a machine in another country and seeing my resulting plots oddly shifted.

For timezone support, the pytz module is a solution which can operate with the datetime module. It can handle creation of the local time when the UTC time is known, and it also respects daylight savings rules. We can verify this by doing the following:
  1. from datetime import datetime
  2. import pytz
  3. ams = pytz.timezone('Europe/Amsterdam')
  4. winter = datetime(2000, 11, 1, 10, 0, tzinfo=pytz.utc)
  5. summer = datetime(2000, 6, 1, 10, 0, tzinfo=pytz.utc)
  6. print summer.astimezone(ams) # CET time is +2 hours
  7. >>>2000-06-01 12:00:00+02:00 
  8. print winter.astimezone(ams) # CEST time is +1 hour
  9. >>>2000-11-01 11:00:00+01:00
However, as described in the official documentation, there are several ways to use the module and get unexpected results back because interoperability with the datetime module is not perfect. In the example below, we'd expect the time difference of the same date and time between UTC and the Amsterdam timezone to be one hour in winter, but, it's not:
  1.  td = datetime(2000, 11, 1, 10, 0, tzinfo=pytz.utc) - datetime(2000, 11, 1, 10, 0, tzinfo=ams)
  2.  print td.total_seconds() 
  3. >>>1200 # 20 minutes ? (somehow pytz falls back to a long outdated timezone setting for Amsterdam)
Ultimately, Python's native time support is at times counter intuitive and at times lacking. Current solutions can be made to work, but the exposed API is confusing. Before a library of the caliber and adoption of JodaTime in Java is implemented in Python, developers are advised to tread very carefully, test extensively that time methods do what they think they do and generally check methods whether they return time in UTC or local machine time and opt for storing and using UTC for their transformations where possible.
Recent libraries that tackle the issue of providing a more intuitive API on top of the maze that is Python time handling are Delorean, times and arrow. (continue)

If you found this post interesting, follow and support us.
Suggest for you:

Zero to Hero with Python Professional Python Programmer Bundle

The Python Mega Course: Build 10 Python Applications

Advanced Scalable Python Web Development Using Flask

Python, Ruby, Shell - Scripting for Beginner

Python For Android Hacking Crash Course: Trojan Perspective


No comments:

Post a Comment