Python Programmers

Wednesday, August 17, 2016

The Top Mistakes Developers Make When Using Python for Big Data Analytics_part 1

##1 Introduction

Python is a language universally praised for cutting down development time, but using it efficiently for data analysis is not without its pitfalls. The same features that make development easy in the beginning (dynamic, permissive type system) can be the downfall of large systems; and confusing libraries, slow running times and not designing with data integrity in mind can quickly eat up development time instead.

This article will cover the most common time wasters encountered when working with Python and Big Data and provide suggestions to get back on track and spend time on what really matters: using creativity and scientific methods to generate insights from vast amounts and diverse types of data.

##2 Mistake #1: Reinventing the wheel

The Python community is in a very good state when it comes to data analysis libraries that have rich functionality and have been extensively tested. So, why reinvent the wheel?
I see this a lot during code challenges where the candidate needs to load a CSV file into memory in order to work with it. Quite a few spend a large chunk of time writing custom CSV loading functions and invariably end up with a dictionary of dictionaries which is slow to query and difficult to transform. That leaves them with little time to impress with their ability to generate insights from data.

There truly is no reason to get bogged down with solving already-solved problems - spend a few minutes to do a Google search or ask a more experienced developer for suggestions for a data analysis library.

Incidentally, one such library that's widely used at the time of this writing is Python Pandas. It comes with useful abstractions for dealing with large datasets, a lot of functionality for ETL (extract, transform, load) and good performance. It cuts down developer time by enabling the succinct expression of data transformations and providing functions to load, unify and store data from different sources and formats.

To illustrate the former, let's say we have a CSV file with the header Product, ItemsSold and we want to find the ten most popular products. Let's compare a reasonable implementation in vanilla Python and an implementation using the powerful abstractions of Python Pandas:

###2.1 Vanilla Python

from collections import defaultdict
header_skipped = False
sales = defaultdict(lambda: 0)
with open(filename, 'r') as f:
for line in f:
if not header_skipped:
header_skipped = True
continue
line = line.split(",")
product = line[0]
num_sales = int(line[1])
sales[product] += num_sales
top10 = sorted(sales.items(), key=lambda x:x[1], reverse=True)[:10]

###2.2 Pandas

import pandas as pd
data = pd.read_csv(filename) # header is conveniently inferred by default
top10 = data.groupby("Product")["ItemsSold"].sum().order(ascending=False)[:10]

Notes: Doing the task in vanilla Python does have the advantage of not needing to load the whole file in memory - however, pandas does things behind the scenes to optimize I/O and performance. Additionally, the in-memory sales dictionary of the vanilla Python solution is not lightweight either.

##3 Mistake #2: Not tuning for performance

When programs take too much time to produce any output, the developer's rhythm and focus are broken. Slow programs also limit the amount of experimentation a developer can do - if your program takes ten minutes to output results for a small dataset, you have the possibility to tweak and execute your program only around thirty times per day.

So, when you find yourself sitting idly waiting for the code to execute, it may be time to try and identify bottlenecks. There are specialized utilities available to help developers profile and speedup their code. Most of these work within the IPython interactive shell.

The simplest way to profile code within IPython is to use the %timeit magic command to get the runtime of a Python statement. A more sophisticated tool is the line profiler which you can download. After launching IPython, type:

%load_ext line_profiler
%lprun -f function_to_profile statement_that_invokes_the_fuction

Subsequently, you get an output of this form which describes what percentage of execution time was spent on which line at the function:

Line # Hits Time Per Hit % Time Line Contents
==============================================================

Using the line profiler has helped me personally identify bottlenecks with the use of the aforementioned Python Pandas library and achieve tenfold speedups by tweaking the implementation.

However, if you've reached the point where you are sure the algorithm is of optimal complexity and implementation for the problem, it may pay off to cythonize parts of your code for performance. Using the %timeit command from above we can compare the runtime of the uncythonized version and the cythonized one.

###3.1 Uncythonized version

Paste this into IPython:

def sum_uncythonized():
a = 0
for i in range(100000):
a += i
return a

###3.2 Cythonized version

Install cythonmagic if you don't have it already, and within IPython type:

%load_ext cythonmagic
and copy-paste the following text as a single block:
%%cython
def sum_cythonized():
cdef long a = 0 # this directive defines a type for the variable
cdef int i = 0
for i in range(100000):
a += i
return a

Then view the results:

%timeit sum_cythonized()
>>>10000 loops, best of 3: 52.3 µs per loop
%timeit sum_uncythonized()
>>>100 loops, best of 3: 3.28 ms per loop

We achieve a speed-up of two orders of magnitude just by defining types. Cythonmagic, indeed.

##4 Mistake #3: Not understanding time and timezones

When programmers first work with time, Epoch time can be a strange concept to grasp. The basic concept to understand is that Epoch time is the same number around the world at any given instant but how this number is translated into hours and minutes of the day depends on the timezone and the time of the year (because of daylight savings). When working with Python, these translations are handled by the datetime and pytz modules.

Python's built-in time handling modules can be confusing both in how the functions are named and how to convert between different representations of time, much like the C time libraries it is based on internally. It is important to be able to work with it correctly, because data in timeseries format is very common. One pain point is timezone handling. A common misconception concerns the following statement:

dt = datetime.fromtimestamp(utc_number)

When dealing with timezones for the first time, people think this returns the date and time in UTC format. In reality, it returns the date and time in the timezone of the machine the command is being ran on. This essentially makes the code non-portable. I learned this the hard way by deploying the same code in a local machine and in a machine in another country and seeing my resulting plots oddly shifted.

For timezone support, the pytz module is a solution which can operate with the datetime module. It can handle creation of the local time when the UTC time is known, and it also respects daylight savings rules. We can verify this by doing the following:

from datetime import datetime
import pytz
ams = pytz.timezone('Europe/Amsterdam')
winter = datetime(2000, 11, 1, 10, 0, tzinfo=pytz.utc)
summer = datetime(2000, 6, 1, 10, 0, tzinfo=pytz.utc)
print summer.astimezone(ams) # CET time is +2 hours
>>>2000-06-01 12:00:00+02:00
print winter.astimezone(ams) # CEST time is +1 hour
>>>2000-11-01 11:00:00+01:00

However, as described in the official documentation, there are several ways to use the module and get unexpected results back because interoperability with the datetime module is not perfect. In the example below, we'd expect the time difference of the same date and time between UTC and the Amsterdam timezone to be one hour in winter, but, it's not:

td = datetime(2000, 11, 1, 10, 0, tzinfo=pytz.utc) - datetime(2000, 11, 1, 10, 0, tzinfo=ams)
print td.total_seconds()
>>>1200 # 20 minutes ? (somehow pytz falls back to a long outdated timezone setting for Amsterdam)

Ultimately, Python's native time support is at times counter intuitive and at times lacking. Current solutions can be made to work, but the exposed API is confusing. Before a library of the caliber and adoption of JodaTime in Java is implemented in Python, developers are advised to tread very carefully, test extensively that time methods do what they think they do and generally check methods whether they return time in UTC or local machine time and opt for storing and using UTC for their transformations where possible.
Recent libraries that tackle the issue of providing a more intuitive API on top of the maze that is Python time handling are Delorean, times and arrow. (continue)

If you found this post interesting, follow and support us.
Suggest for you:

Zero to Hero with Python Professional Python Programmer Bundle

The Python Mega Course: Build 10 Python Applications

Advanced Scalable Python Web Development Using Flask

Python, Ruby, Shell - Scripting for Beginner

Python For Android Hacking Crash Course: Trojan Perspective

Monday, August 15, 2016

Scalable Architecture DR CoN: Docker, Registrator, Consul, Consul Template and Nginx

Docker is great fun when you start building things by plugging useful containers together. Recently I have been playing with Consul and trying to plug things together to make a truly horizontally scalable web application architecture. Consul is a Service Discovery and Configuration application, made by HashiCorp the people who brought us Vagrant.

Previously I experimented using Consul by using SRV records (described here) to create a scalable architecture, but I found this approach a little complicated, and I am all about simple. Then I found Consul Template which links to Consul to update configurations and restart application when services come up or go down.

In this post I will describe how to use Docker to plug together Consul, Consul Template, Registrator and Nginx into a truly scalable architecture that I am calling DR CoN. Once all plugged together, DR CoN lets you add and remove services from the architecture without having to rewrite any configuration or restart any services, and everything just works!

Docker

Docker is an API wrapper around LXC (Linux containers) so will only run on Linux. Since I am on OSX (as many of you probably are) I have written a post about how to get Docker running in OSX using boot2docker. This is briefly described below:

brew install boot2docker
boot2docker init
boot2docker up

This will start a virtual machine running a Docker daemon inside an Ubuntu machine. To attach to the daemon you can run:

export DOCKER_IP=`boot2docker ip`
export DOCKER_HOST=`boot2docker socket`

You can test Docker is correctly installed using:

docker ps

Build a very simple Web Service with Docker

To test the Dr CoN architecture we will need a service. For this, let create the simplest service that I know how (further described here). Create a file called Dockerfile with the contents:

FROM python:3
EXPOSE 80
CMD ["python", "-m", "http.server"]

In the same directory as this file execute:

docker build -t python/server .

This will build the docker container and call it python/server, which can be run with:

docker run -it \
-p 8000:80 python/server

To test that it is running we can call the service with curl:

curl $DOCKER_IP:8000

Consul

Consul is best described as a service that has a DNS and a HTTP API. It also has many other features like health checking services, clustering across multiple machines and acting as a key-value store. To run Consul in a Docker container execute:

docker run -it -h node \
-p 8500:8500 \
-p 8600:53/udp \
progrium/consul \
-server \
-bootstrap \
-advertise $DOCKER_IP \
-log-level debug

If you browse to $DOCKER_IP:8500 there is a dashboard to see the services that are registered in Consul.
To register a service in Consul's web API we can use curl:

curl -XPUT \
$DOCKER_IP:8500/v1/agent/service/register \
-d '{
"ID": "simple_instance_1",
"Name":"simple",
"Port": 8000,
"tags": ["tag"]
}'

Then we can query Consuls DNS API for the service using dig:

dig @$DOCKER_IP -p 8600 simple.service.consul

; <<>> DiG 9.8.3-P1 <<>> simple.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39614
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;simple.service.consul. IN A
;; ANSWER SECTION:
simple.service.consul. 0 IN A 192.168.59.103
;; Query time: 1 msec
;; SERVER: 192.168.59.103#53(192.168.59.103)
;; WHEN: Mon Jan 12 15:35:01 2015
;; MSG SIZE rcvd: 76

Hold on, there is a problem, where is the port of the service? Unfortunately DNS A records do not return the port of a service, to get that we must check SRV records:

dig @$DOCKER_IP -p 8600 SRV simple.service.consul

; <<>> DiG 9.8.3-P1 <<>> SRV simple.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3613
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; QUESTION SECTION:
;simple.service.consul. IN SRV
;; ANSWER SECTION:
simple.service.consul. 0 IN SRV 1 1 8000 node.node.dc1.consul.
;; ADDITIONAL SECTION:
node.node.dc1.consul. 0 IN A 192.168.59.103
;; Query time: 1 msec
;; SERVER: 192.168.59.103#53(192.168.59.103)
;; WHEN: Mon Jan 12 15:36:54 2015
;; MSG SIZE rcvd: 136

SRV records are difficult to use because they are not supported by many technologies.
The container srv-router can be used with Consul and nginx to route incoming calls to the correct services, as described here. However there is an easier way than that to use nginx to route to services.

Registrator

Registrator takes environment variables defined when a Docker container is started and automatically registers it with Consul. For example:

docker run -it \
-v /var/run/docker.sock:/tmp/docker.sock \
-h $DOCKER_IP progrium/registrator \
consul://$DOCKER_IP:8500

Starting a service with:

docker run -it \
-e "SERVICE_NAME=simple" \
-p 8000:80 python/server

Will automatically add the service to Consul, and stopping it will remove it. This is the first part to plugin to DR CoN as it will mean no more having to manually register services with Consul.

##Consul Template

Consul Template uses Consul to update files and execute commands when it detects the services in Consul have changed.

For example, it can rewrite an nginx.conf file to include all the routing information of the services then reload the nginx configuration to load-balance many similar services or provide a single end-point to multiple services.

I modified the Docker container from https://github.com/bellycard/docker-loadbalancer for this example

FROM nginx:1.7
#Install Curl
RUN apt-get update -qq && apt-get -y install curl
#Download and Install Consul Template
ENV CT_URL http://bit.ly/15uhv24
RUN curl -L $CT_URL | \
tar -C /usr/local/bin --strip-components 1 -zxf -
#Setup Consul Template Files
RUN mkdir /etc/consul-templates
ENV CT_FILE /etc/consul-templates/nginx.conf
#Setup Nginx File
ENV NX_FILE /etc/nginx/conf.d/app.conf
#Default Variables
ENV CONSUL consul:8500
ENV SERVICE consul-8500
# Command will
# 1. Write Consul Template File
# 2. Start Nginx
# 3. Start Consul Template
CMD echo "upstream app { \n\
least_conn; \n\
{{range service \"$SERVICE\"}} \n\
server {{.Address}}:{{.Port}}; \n\
{{else}}server 127.0.0.1:65535;{{end}} \n\
} \n\
server { \n\
listen 80 default_server; \n\
location / { \n\
proxy_pass http://app; \n\
} \n\
}" > $CT_FILE; \
/usr/sbin/nginx -c /etc/nginx/nginx.conf \
& CONSUL_TEMPLATE_LOG=debug consul-template \
-consul=$CONSUL \
-template "$CT_FILE:$NX_FILE:/usr/sbin/nginx -s reload";

The repository for this file is here.

NOTE: the \n\ adds a new line and escapes the newline for Docker multiline command
This Docker container will run both Consul Template and nginx, and when the services change it will rewrite the nginx app.conf file, then reload nginx.

This container can be built with:

docker build -t drcon .

and run with:

docker run -it \
-e "CONSUL=$DOCKER_IP:8500" \
-e "SERVICE=simple" \
-p 80:80 drcon

SERVICE is query used to select which services to include from Consul. So this DR CoN container will now load balance across all services names simple.

##All Together

Lets now plug everything together!

Run Consul

docker run -it -h node \
-p 8500:8500 \
-p 53:53/udp \
progrium/consul \
-server \
-bootstrap \
-advertise $DOCKER_IP

Run Registrator

docker run -it \
-v /var/run/docker.sock:/tmp/docker.sock \
-h $DOCKER_IP progrium/registrator \
consul://$DOCKER_IP:8500

Run DR CoN

docker run -it \
-e "CONSUL=$DOCKER_IP:8500" \
-e "SERVICE=simple" \
-p 80:80 drcon

Running curl $DOCKER_IP:80 will return:

curl: (52) Empty reply from server

Now start a service named simple

docker run -it \
-e "SERVICE_NAME=simple" \
-p 8000:80 python/server

This will cause:

Registrator to register the service with Consul
Consul Template to rewrite the nginx.conf then reload the configuration

Now curl $DOCKER_IP:80 will be routed successfully to the service.

If we then start another simple service on a different port with:

docker run -it \
-e "SERVICE_NAME=simple" \
-p 8001:80 python/server

Requests will now be load balances across the two services.

A fun thing to do is to run while true; do curl $DOCKER_IP:80; sleep 1; done while killing and starting simple services and see that this all happens so fast no requests get dropped.

Conclusion

Architectures like DR CoN are much easier to describe, distribute and implement using Docker and are impossible without good tools like Consul. Plugging things together and playing with Docker's ever more powerful tools fun and useful. Now I can create a horizontally scalable architecture and have everything just work.
Written by Graham Jenson

If you found this post interesting, follow and support us.
Suggest for you:

Zero to Hero with Python Professional Python Programmer Bundle

The Python Mega Course: Build 10 Python Applications

Advanced Scalable Python Web Development Using Flask

Python 1000: The Python Primer

Data Mining with Python

Sunday, August 14, 2016

Optimizing Python - a Case Study

Developers love Python because it's quick to write, easy to learn, and is -- mostly -- fast enough. The qualifier there means you'll have situations where it just isn't. There's good news -- You have plenty of options to make your code faster.

Profile and optimize your existing code
Use a C module (or write your own)
Try a JIT-enabled interpreter like Jython or PyPy
Parallelize your workload

The difficulty of each depends heavily on your program. We'll be talking about profiling and optimizing pure Python code with the help of a few tools. In this article, we'll see how to use profilers to improve disq's performance by about a third.

Optimizating Python -- The How and Why

Before we get down business, let's talk about optimization. Optimization isn't about flailing around tossing @lru_cache around all your functions. Instead, it's a series of simple steps repeated until your program is "fast enough".
First, know your behaviors. The easiest way to make a program 100% faster is to echo 'exit(99)' > server.py, but that takes the "viable" out of your MVP. If it wasn't necessary, we wouldn't be optimizing it. You need to be able to repeatably verify your application is correct with automated -- ideally fast -- tests at the unit and system level.

Next, pick a metric to optimize for. You'll be trading between several dimensions, so know what matters to you. Memory usage, CPU time, bandwidth, disk I/O, and storage space are all factors you'll find yourself trading between in a sufficiently large application. Build measurements into your automated tests, or make a separate system. The closer to real world usage you can make these performance metrics, the better off you will be. Performance is sensitive to more factors than you can count: CPU cache size, network buffer size, kernel version, operating system, dependency versions, and more can all skew your numbers.

You have a goal: use less blank, or do blank faster. You know how much blank you're using and are ready to start changing things. Look at the metrics and find where you can have the biggest impact. Where is the hottest loop, the biggest object, or the most expensive operation?

The software we'll be optimizing is Disq, a client for Disque. Disque is still in alpha and hasn't been rigorously benchmarked, profiling the client is still worthwhile. We'll follow the steps outlined earlier: verify, measure, alter, repeat. The "verify" step is already handled for this client, let's dive into the measuring.

Optimize Slow Code First

The use case we'll be optimizing for is the fast consumption and acknowledgement of jobs by workers. Disque is built for many small jobs, so it makes sense that each worker would be consuming huge numbers of them. We'll pretend we have a worker program that follows (roughly) these steps.

Get a job (or wait for one)
Do some work
Acknowledge the job is done (so disque doesn't retry)
Repeat forever

We're already doing the work in literally no time, so now we want to trim the fat from our Disque client so we can process more jobs.

benchmarking_script.py

from disq import Disque
@profile
def read_jobs():
client = Disque()
while True:
job = client.getjob('q', timeout_ms=1)
if job is None:
break
"""normally you'd do work here"""
client.ackjob(job[1])

Line profiler can show where execution time is spent line-by-line. Note the "@profile" [decorator][decorator] on the read_jobs benchmark function. Lineprof provides the kernprof command that will collect information about the code.

// -l forces line-by-line profiling
// -v prints the results of the profiling immediately
$ kernprof -l -v first_script.py
Timer unit: 1e-06 s (microsecond)
Function: read_jobs
Line num Hits Time Per Hit % Time Line Contents
5 @profile
6 def read_jobs():
7 1 2273 2273.0 0.3 client = Disque()
8 1001 1776 1.8 0.2 while True:
9 1001 530698 530.2 65.6 job = client.getjob('q', timeout_ms=1)
10 1001 1282 1.3 0.2 if job is None:
11 1 2 2.0 0.0 break
12 // normally you'd do work here
13 1000 273414 273.4 33.8 client.ackjob(job[1])

Immediately, we can see that acknowledging a job takes half as long as retrieving one. We need to add the @profile decorator to the getjob function in the disq client.py. This turns out to be uninformative because getjob just calls self._getjob.

Interactive Profiling for Python

We could continue decorating each level and viewing the results; instead let's try a different tool. There's an interactive profiler for Python that covers our needs a bit better.

We can drill right down to see where the most time is being spent. A full 11 percent of the time is being spent just getting the a connection. No network action, just pulling a connection to use from the pool.
That time is being spent in this snippet of rolling_counter (full code available here).

def _expire(self)
""" called when a connection is retrieved """
""" cast key iterable to list because this loop can delete keys"""
for k in list(six.iterkeys(self._counts)):
""" find the location where all times are less than (current - ttl)
and delete all lesser elements """
del self._counts[k][
:self._counts[k].bisect(time.time() - self._ttl_seconds)
]
if len(self._counts[k]) == 0:
self.remove(k)

See what takes so long? We bisect a sorted list then slice it to remove times older than the sliding window of messages. Why is that there?

If a consumer sees a high message rate received from foreign nodes, it may optionally have logic in order to retrieve messages directly from the nodes where producers are producing the messages for a given topic. The consumer can easily check the source of the messages by checking the Node ID prefix in the messages IDs.
-- disque docs

It's an optional behavior that accounts 11% of the time it takes to send a message out. Turns out that's an expensive default, but I had implemented it without checking its impact on performance (there's a lesson there). Let's make it optional, since not all users will want to take the performance penalty.

With the option in place, let's see the difference between enabled and disabled.

With connection counting

We pay almost a full second of execution time over 1000 messages to count how many jobs come from each node in the cluster. If you're keeping score at home, that's a full millisecond per message.

Without connection counting

Without job source counting, the total runtime decreases from 3.87 to 2.88 seconds. This is definitely worth a change to the library's default behavior.

Speeding Up RollingCounter

Now let's try to improve connection counting for users that do want it. Here's a starting point (courtesy of lineprof).

File: counting_profiler.py
Function: count_incoming at line 7
Line num Hits Time Per Hit % Time Line Contents
7 @profile
8 def count_incoming():
9 // Set the count lifetime really low
10 1 11 11.0 0.0 rc = RollingCounter(ttl_secs=0.1)
11 10001 10424 1.0 2.5 for i, _ in izip(cycle('1234567890'), xrange(10000)):
12 10000 306433 30.6 73.4 rc.add(i)
13
14 66 29167 441.9 7.0 while rc.max() is not None:
15 65 71697 1103.0 17.2 sleep(0.001)

Ok, so adding takes a hefty 73% of our runtime, and it's going to be the most frequently run, and most of that time is spent adding the time to the sortedlist of times messages were received. Think for a second: time is only ever going to increase, so we can safely change to an unsorted list and use append to skip the cost of sorting values.

Switching from blist.sortedlist to list only required 3 changed lines, here's the commit that made the change.

File: counting_profiler.py
Function: count_incoming at line 7
Line num Hits Time Per Hit % Time Line Contents
7 @profile
8 def count_incoming():
9 // Set the count lifetime really low
10 1 11 11.0 0.0 rc = RollingCounter(ttl_secs=0.1)
11 10001 8098 0.8 6.3 for i, _ in izip(cycle('1234567890'), xrange(10000)):
12 10000 18626 1.9 14.6 rc.add(i)
13
14 79 11993 151.8 9.4 while rc.max() is not None:
15 78 88966 1140.6 69.7 sleep(0.001)

After switching to list, the add function is 30 times faster, an enormous savings. Even better, switching to Python's stdlib bisect function cut the time it takes to find the most frequent node by 75 percent.

Python Performance in Practice

Building performant systems is hard work. Duh: that's why there are so many systems that aren't performant. The first step to improving your system is to have measurements in place that are easy to test between changes. For my projects, I use tox as a test runner because it provides the flexibility to define any environments you need -- not just unittest/py.test/nose commands.

To track performance, I use pytest-bench in a tox benchmarking environment that's as simple as tox -ebenchmark and spits out the results for several test workloads. The tox.ini file below is excerpted, and available in full here.

[testenv]
// exit after 2 failures, report fail info, log the 3 slowest tests, display test coverage within the module
commands = py.test --maxfail=2 -rf --durations=3
--cov disq
--cov-report html
--cov-report term
--benchmark-skip {posargs}
setenv = VIRTUAL_ENV={envdir}
PYTHONHASHSEED=0
deps = -r{toxinidir}/requirements.txt
-r{toxinidir}/test-requirements.txt
[testenv:benchmark]
// benchmarking environment that skips all non-benchmark tests
commands = py.test -rf --benchmark-only {posargs}
[testenv:pep8]
// PEP8 environment that forces me to have consistent style
commands = pep8
setenv = VIRTUAL_ENV={envdir}
PYTHONHASHSEED=0

The output of a benchmark run is simple enough, compare these two results from before and after the changes discussed here.

## Before
--- benchmark: 4 tests, min 5 rounds (of min 25.00us), 1.00s max time --
Name (time in us) Min Max Mean StdDev
test_addjob_benchmarks 72.0024 323.0572 103.5393 16.0487
test_addjob_no_replicate_bench 40.7696 173.0919 45.2473 7.1136
test_addjob_async_bench 42.9153 1088.8577 51.9152 12.2333
test_getjob_bench 79.8702 191.9270 92.0623 9.3458
## After
--- benchmark: 4 tests, min 5 rounds (of min 25.00us), 1.00s max time --
Name (time in us) Min Max Mean StdDev
test_addjob_benchmarks 71.0487 856.8764 103.6685 21.6713
test_addjob_no_replicate_bench 39.8159 432.0145 45.0144 7.7360
test_addjob_async_bench 41.0080 276.8040 44.5564 6.9397
test_getjob_bench 66.9956 213.1462 73.8683 9.1130

The real savings was in getjob, which is where we were bisecting the sorted list, and now use the stdlib bisect function. And now we can see when performance regresses for new changes, since benchmarks are a part of test runs.

Optimization Lessons

Any application (Python or not) can benefit from the verify-measure-alter-repeat pattern we used to improve the performance of disq by more than 30 percent. Next time you get a feeling that something is doing more work than necessary, follow your judgement and find out just how much you're paying for the layer of indirection you just added--you might be surprised.
All of the tools outlined here: line_profiler, profiling, tox, and pytest-bench are worthy additions to your toolbox. Use them. Try them out on your pet project and get comfortable with changing code to improve performance. Write code that does as little work as possible. The disq library turned out to be spending huge amounts of time doing unnecessary work, and our profilers were able to guide us to it.
Written by Ryan Brown

If you found this post interesting, follow and support us.

Suggest for you:

Complete Python Bootcamp (Hot)

Learning Python for Data Analysis and Visualization

Learn Python for Beginners!

Data Science: Deep Learning in Python

Python, Ruby, Shell - Scripting for Beginner

☞

Thursday, August 11, 2016

All you need to know about servers as a python developer.

##Introduction Servers have confused me quite a bit. I'm sure there will be many python developers who share the same predicament. So let me try to unveil the mist of confusion by sharing everything I know about Servers.

##HTTP: Protocol which rules the world wide web! HTTP (Hypertext Transfer Protocol) is a communication protocol and it is used to send and receive webpages and other data files on the internet. It is the set of rules and specs which govern the transfer of webpages and other data files on the internet.

A browser is an HTTP client because it sends requests to an HTTP server (Web server), which then sends responses back to the client. The standard (and default) port for HTTP servers to listen on is 80, though they can use any port. This article explains HTTP really well. Please do go through it.
If you want to be geeky about it check the HTTP 1.1 specs which was superseded by multiple RFCs (7230-7237). Search for these RFC's at ietf

##HTTP Server So HTTP Request and Response have a format! When a user enters a web site, their browser makes a connection to the site’s web server (this is called the request). The server looks up the file in the file system and sends it back to the user’s browser, which displays it (this is the response). This is roughly how the underlying protocol, HTTP, works. Seems simple?

Dynamic web sites are not based on files in the file system, but rather on programs which are run by the web server when a request comes in, and which generate the content that is returned to the user. They can do all sorts of useful things, like display the postings of a bulletin board, show your email, configure software, or just display the current time.

Irrespective of how the client or the server has been implemented, there will always be a way to form a valid HTTP Request for the client to work and similarly the server needs to be capable of understanding the HTTP Requests sent to it and form valid HTTP Responses to all the arrived HTTP Requests. Both the client and the server machines should also be equipped with the capability of establishing the connection to each other (in this case it'll be a TCP reliable connection) to be able to transfer the HTTP Requests (client -> server) and HTTP Responses (server -> client).

The Http Server( a program ) will accept this request and will let your python script get the Http Request Method and URI. The HTTP server will handle many requests from images and static resources. What about the dynamically generated urls ?

@app.route('\displaynews\<name_of_category>',methods=['GET'])

You might have used this decorator in Flask. Flask is a microframework for python. Flask will pattern match this route with the request from the browser. But how does flask parse the http request from the browser? The Http Server passes the dynamically generated urls to the application server. Whoa ! wait .... What are application servers now?

Apache HTTPD and nginx are the two common web servers used with python.

##Application Servers Most HTTP servers are written in C or C++, so they cannot execute Python code directly – a bridge is needed between the server and the program. These bridges, or rather interfaces, define how programs interact with the server. This is the application server. The dynamically generated urls are passed from the WebServer to the Application server. The application servers matches to url and runs the script for that route. It then returns the response to the WebServer which formulates an HTTP Response and returns it to the client.

Common Gateway Interface (CGI)

This interface, most commonly referred to as “CGI”, is the oldest, and is supported by nearly every web server out of the box. Programs using CGI to communicate with their web server need to be started by the server for every request. So, every request starts a new Python interpreter – which takes some time to start up – thus making the whole interface only usable for low load situations.

####mod_python mod_python is an Apache HTTP Server module that integrates the Python programming language with the server. For several years in the late 1990s and early 2000s, Apache configured with mod_python ran most Python web applications. However, mod_python wasn't a standard specification. There were many issues while using mod_python. A consistent wat to execute Python code for web applications was needed.

FastCgi and SCGI were other low level gateways used for deployment. They tried to solve the performance problem of CGI.

These low-level gateway interfaces are language agnostic.

##Rise of WSGI A Web Server Gateway Interface (WSGI) server implements the web server side of the WSGI interface for running Python web applications. WSGI scales and can work in both multithreaded and multi process environments. We can also write middlewares with WSGI. Middlewares are useful for session handling, authentication and many more. You can read about how to write your WSGI implementation on Armin's blog. A comparison between different WSGI implementations is given at this link.

####Gunicorn and uWSGI Gunicorn and uWSGI are two different application servers.
Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for UNIX. It is very simple to configure, compatible with many web frameworks and its fairly speedy. This article by digitalocean shows how to configure gunicorn with nginx.

uWSGI is another option for an application server. uWSGI is a high performance and a powerful WSGI server. There are many configuration options available with uWSGI. This article by digitalocean shows how to configure uWSGI with nginx.

##Apache vs Nginx Anturis has explained quite lucidly the differences between the two on their blog. This post explains how apache and nginx work.

To summarize:

Apache creates processes and threads to handle additional connections. While Nginx is said to be event-driven, asynchronous, and non-blocking.
Apache is powerful but Nginx is fast. Nginx serves up static content quicker.
Nginx includes advanced load balancing and caching abilities.
Nginx is a lot lighter than Apache

The organicagency benchmarked the performances of Apache and nginx. The results are available here
##What I use I use Nginx because it is fast , light and I find the configuration to be much easy. Gunicorn is very simple to configure. So I use gunicorn. uWsgi is also used a lot instead of gunicorn.
Written by Nupur Gulalkari

If you found this post interesting, follow and support us.
Suggest for you:

Learning Python for Data Analysis and Visualization

Learn Python for Beginners!

Data Science: Deep Learning in Python

Wednesday, August 10, 2016

Getting Started with Behavior Testing in Python with Behave_part2 (end)

Writing Tableized Tests

Often when writing tests we want to test the same behavior against many different parameters and check the results. Behave makes this easier to do by providing tools to create a tableized test instead of writing out each test separately. The next game logic to test is that the dealer knows the point value of its hand. Here is a test that checks several scenarios:

Scenario Outline: Get hand total
Given a <hand>
When the dealer sums the cards
Then the <total> is correct
Examples: Hands
| hand | total |
| 5,7 | 12 |
| 5,Q | 15 |
| Q,Q,A | 21 |
| Q,A | 21 |
| A,A,A | 13 |

You should recognize the familiar "given, when, then" pattern, but there's a lot of differences in this test. First, it is called a "Scenario Outline". Next, it uses parameters in angle brackets that correspond to the headers of the table. Finally, there's a table of inputs ("hand") and outputs ("total").

The steps will be similar to what we've seen before, but we'll now get to use the parameterized steps feature of Behave.

Here's how to implement the new "given" step:

@given('a {hand}')
def step_impl(context, hand):
context.dealer = Dealer()
context.dealer.hand = hand.split(',')

The angle brackets in the dealer.feature file are replaced with braces, and the hand parameter becomes an object that is passed to the step, along with the context.

Just like before, we create a new Dealer object, but this time we manually set the dealer's cards instead of generating them randomly. Since the hand parameter is a simple string, we split the parameter to get a list.

Next, add the remaining steps:

@when('the dealer sums the cards')
def step_impl(context):
context.dealer_total = context.dealer.get_hand_total()
@then('the {total:d} is correct')
def step_impl(context, total):
assert (context.dealer_total == total)

The "when" step is nothing new, and the "then" step should look familiar. If you're wondering about the ":d" after the total parameter, that is a shortcut to tell Behave to treat the parameter as an integer. It saves us from manually casting with the int() function. Here's a complete list of patterns that Behave accepts and if you need advanced parsing, you can define your own pattern.

There's many different approaches to summing values of cards, but here's one solution to find the total of the dealer's hand. Create this as a top-level function in the twentyone.py module:

def _hand_total(hand):
values = [None, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 10]
value_map = {k: v for k, v in zip(_cards, values)}
total = sum([value_map[card] for card in hand if card != 'A'])
ace_count = hand.count('A')
for i in range(ace_count, -1, -1):
if i == 0:
total = total + ace_count
elif total + (i * 11) + (ace_count - i) <= 21:
total = total + (i * 11) + ace_count - i
break
return total

In short, the function maps the card character strings to point values, and sums the values. However, aces have to be handled separately because they can value 1 or 11 points.

We also need to give the dealer the ability to total its cards. Add this function to the Dealer class:

def get_hand_total(self):
return _hand_total(self.hand)

If you run behave now, you'll see that each example in the table runs as its own scenario. This saves a lot of space in the features file, but still gives us rigorous tests that pass or fail individually.

We'll add one more tableized test, this time to test that the dealer plays by the rules. Traditionally, the dealer must play "hit" until he or she has 17 or more points. Add this scenario outline to test that behavior:

Scenario Outline: Dealer plays by the rules
Given a hand <total>
when the dealer determines a play
then the <play> is correct
Examples: Hands
| total | play |
| 10 | hit |
| 15 | hit |
| 16 | hit |
| 17 | stand |
| 18 | stand |
| 19 | stand |
| 20 | stand |
| 21 | stand |
| 22 | stand |

Before we add the next steps, it's important to understand that when using parameters, the order matters. Parameterized steps should be ordered from most restrictive to least restrictive. If you do not do this, the correct step may not be matched by Behave. To make this easier, group your steps by type. Here is the new given step, ordered properly:

@given('a dealer')
def step_impl(context):
context.dealer = Dealer()

## NEW STEP
@given('a hand {total:d}')
def step_impl(context, total):
context.dealer = Dealer()
context.total = total

@given('a {hand}')
def step_impl(context, hand):
context.dealer = Dealer()
context.dealer.hand = hand.split(',')

The typed parameter {total:d} is more restrictive than the untyped {hand}, so it must come earlier in the file.

The new "when" step is not parameterized and can be placed anywhere, but, for readability, should be grouped with the other when steps:

@when('the dealer determines a play')
def step_impl(context):
context.dealer_play = context.dealer.determine_play(context.total)

Notice that this test expects a determine_play() method, which we can add to the Dealer class:

def determine_play(self, total):
if total < 17:
return 'hit'
else:
return 'stand'

Last, the "then" step is parameterized so it needs to also be ordered properly:

@then('the dealer gives itself two cards')
def step_impl(context):
assert (len(context.dealer.hand) == 2)

@then('the {total:d} is correct')
def step_impl(context, total):
assert (context.dealer_total == total)

## NEW STEP
@then('the {play} is correct')
def step_impl(context, play):
assert (context.dealer_play == play)

Putting Everything Together

We're going to add one final test that will tie together all of the code we've just written. We've proven to ourselves with tests that the dealer can deal itself cards, determine its hand total, and make a play separately, but there's no code to tie this together. Since we are emphasizing test-driven development, let's add a test for this behavior.

Scenario: A Dealer can always play
Given a dealer
When the round starts
Then the dealer chooses a play

We already wrote steps for the "given" and "when" statements, but we need to add a step for "the dealer chooses a play." Add this new step, and be sure to order it properly:

@then('the dealer gives itself two cards')
def step_impl(context):
assert (len(context.dealer.hand) == 2)

#NEW STEP
@then('the dealer chooses a play')
def step_impl(context):
assert (context.dealer.make_play() in ['stand', 'hit'])

@then('the {total:d} is correct')
def step_impl(context, total):
assert (context.dealer_total == total)

This test relies on a new method make_play() that you should now add to the Dealer class:

def make_play(self):
return self.determine_play(self.get_hand_total())

This method isn't critical, but makes it easier to use the Dealer class.

If you've done everything correctly, running behave should display all of the tests and give a summary similar to this:

1 feature passed, 0 failed, 0 skipped
16 scenarios passed, 0 failed, 0 skipped
48 steps passed, 0 failed, 0 skipped, 0 undefined
Took 0m0.007s

Conclusion

This tutorial walked you through setting up a new project with the Behave library and using test-driven development to build the code based off of behavioral tests.

If you would like to get experience writing more tests with this project, try implementing a Player class and player.feature that plays with some basic strategy.
Written by Phillip Johnson

If you found this post interesting, follow and support us.
Suggest for you:

Build Your First Python and Django Application

Zero to Hero with Python Professional Python Programmer Bundle

The Python Mega Course: Build 10 Python Applications

Complete Python Bootcamp (Hot)

Learning Python for Data Analysis and Visualization

Monday, August 8, 2016

Using Python and QGIS for geospatial visualizations - a Case Study

##Introduction

This tutorial will guide you through a typical day in the life of a Data Scientist who needs to obtain, clean, augment and visualize a geospatial dataset. Our tools will be Python, the BeautifulSoup, pandas and Nominatim libraries and also the open source mapping software QGIS which is widely used in GIS organizations.

Our dataset will be the reports of UFO sightings across the United States which can be found here from the National UFO Reporting Center. Our goal will be to create a visualization on the map of the UFOs seen the past 12 months. The aim of the visualization will be to showcase the dataset, explore it, and understand better the behavior of the alleged UFOs. The visualization will be within the mapping program, because QGIS is particularly suited for quick exploratory analysis of geospatial data. We will also have the ability to export the visualization as a video or animation and share it with other users of the program.

##First task: Extracting the data from the web

Visiting the website of the UFO sighting reports, we see that the data is available in a pretty much structured format - every month has its own page and every monthly page has a row with the UFO sighting information with time, city, state, shape and description as the columns. So we just need to download the last 12 monthly pages and extract the data from the page's HTML. One appropriate library which can parse the DOM and can then be queried easily is BeautifulSoup, so we will use it to get the links to the last 12 months from the main page and subsequently extract the information.

###Code

from datetime import datetime
from itertools import chain
from urllib import urlopen
from bs4 import BeautifulSoup
import pandas as pd
base_url = "http://www.nuforc.org/webreports/"
index_url = "http://www.nuforc.org/webreports/ndxevent.html"
# get the index page
source = BeautifulSoup(urlopen(index_url), "html5lib")
# get all the links in the index page
monthly_urls = map(lambda x: (x.text, base_url + x['href']),source('a'))
# get the last 12 links that have a text like 06/2015
last_year_urls = filter(lambda x: can_cast_as_dt(x[0], "%m/%Y"), monthly_urls)[:12]
# extract the data from each monthly page and flatten the lists of tuples
last_year_ufos = list(chain(*map(lambda x: get_data_from_url(x[1]), monthly_urls)))
# initialize a pandas DataFrame with the list of tuples
ufos_df = pd.DataFrame(last_year_ufos, columns=["start","city","state","shape","duration_description"])
def can_cast_as_dt(dateStr, fmt):
try:
datetime.strptime(dateStr, fmt)
return True
except ValueError:
return False

One thing I'd like to point out about the code above is the use of htmllib. It turns out the HTML in the NUFORC website is not formed perfectly after all - it contains two HTML closing tags instead of one. The default parser of BeautifulSoup doesn't handle this case, so that's why we put the "htmllib" argument. Below you can see the function used to extract the data from each monthly page taking again advantage of the structure of the page and using BeautifulSoup. We also convert the date string into a Python datetime. The format of the dates on the page is actually ambiguous, generally of the form 01/21/15, but for the scope of this tutorial, datetime's inference of the century works fine - 2015 is correctly inferred.

def get_data_from_url(url):
print "Processing {}".format(url)
data = []
source = BeautifulSoup(urlopen(url), "html5lib")
for row in source('tr'):
if not row('td'):
continue # header row
row_data = row('td')
# parse the datetime from the string
datetime = parse_dt(row_data[0].text)
city = row_data[1].text
state = row_data[2].text
shape = row_data[3].text
duration = row_data[4].text
data.append((datetime, city, state, shape, duration))
return data
def parse_dt(dateStr):
# the data in the website comes in two different formats, try both
for fmt in ["%m/%d/%y %H:%M", "%m/%d/%y"]:
try:
return datetime.strptime(dateStr, fmt)
except ValueError:
continue

##Second task: Structuring and Augmenting the Data

We now have a DataFrame object (ufos_df) with all the UFO sighting information we could extract from the web. We chose to ignore the description information, since it is not structured enough to be useful for our usecase. We now have two main problems: The city and state information is probably enough to reason about which city it is in the world, but we don't have the coordinates of the locations. Additionally, the duration information extracted does not follow very precise rules - it can be formatted as "25 minutes" or "25+ Min" or "approximately 25 mins"; therefore it needs some cleaning and standardization before it can be useful.

Extracting duration and end time of UFO sighting

import re
# extract the duration in seconds
ufos_df["duration_secs"] = ufos_df["duration_description"].apply(infer_duration_in_seconds)
# now we can infer the end time of the UFO sighting as well
# which will be useful for the animation later
ufos_df["end"] = ufos_df.apply(lambda x:x["start"] + timedelta(seconds=x["duration_secs"]),axis=1)
# function that infers the duration from the text
def infer_duration_in_seconds(text):
# try different regexps to extract the total seconds
metric_text = ["second","s","Second","segundo","minute","m","min","Minute","hour","h","Hour"]
metric_seconds = [1,1,1,1,60,60,60,3600,3600,3600]
for metric,mult in zip(metric_text, metric_seconds):
regex = "\s*(\d+)\+?\s*{}s?".format(metric)
res = re.findall(regex,text)
if len(res)>0:
return int(float(res[0]) * mult)
return None

In the code above, we apply the function to return the inferred duration in seconds to every row. Usually, the reports of the duration contain a description of the form "(optional qualifier like approximately, probably, about) + a number + an optional plus sign + (a word to express the time unit, which is in non-standard format and may be in plural form or not)".

To parse this, we can use regular expressions. We can try different possibilities for time units and return the number captured by the regex, multiplied by the number of seconds the matched time unit contains. Here it is interesting to note, that, while regular expressions were not a good tool to parse HTML in the previous step, they are appropriate for extracting the duration information from the string. The presence of optional characters and special classes of characters (ie, digits) make regular expressions a good tool for the job. About 85-90% of the duration descriptions in the data can be parsed with the simple function above; that is, the function doesn't return None.

Geocoding locations

To turn the city and state information into more useful geographical information we need to use a geocoding service. I chose Nominatim because of its straightforward API.

from nominatim import Nominatim
geolocator = Nominatim()
geolocator.query("Houston, TX")

The query object returns a list of possible results. Each result is a dictionary that contains a "lat" and a "lon" column. We decide to take the coordinates of the first result or return None if the query does not succeed in identifying the location. It is also a good idea to cache the results by using a simple Python dictionary so as to not send a query for "Houston, TX" multiple times, because Nominatim needs to query a REST API every time and that takes a while. The exact code used is not shown so as not to clutter the tutorial, but we're still making use of the "apply" functionality given by pandas to use the city and state columns as inputs to create new columns with the coordinates.

At the end, we end up having a "lat" and "lon" column for every row in our DataFrame. Of course, these coordinates reflect the city the UFO sighting was reported in and not the exact location of the reporter, because more precice information was not recorded in the first place. We assume the aliens don't care.

Putting it all together

We can export the data that we have into CSV format for further processing.

# Note: dropna will drop any columns with None values, which is desirable
ufos_df[["start","end","lon","lat","shape"]].dropna().to_csv("ufo_data.csv",index=False, encoding="utf-8")

##Third task: Visualization with QGIS

Setup QGIS

We now have a CSV file with our cleaned UFO sightings data. To do the visualization with QGIS, we need to additionally install the OpenLayers plugin (for map backgrounds) and the TimeManager plugin>=2.0 (for stepping through the data in time). To install new plugins go to Plugins>Manage and install plugins, and search for "OpenLayers" and "TimeManager". If you can, use QGIS>=2.9. The TimeManager plugin was originally developed by Anita Graser and over the past year I, carolinux, have contributed a lot with refactoring, maintaining and developing new features.
The next step after the plugins are installed is to go to Web>OpenLayers Plugin and choose the underlying map that we want. In this tutorial, we have chosen OpenStreetMaps. If the TimeManager slider is not visible at the bottom of the program, it's necessary to go to Plugins>TimeManager and click Toggle visibility.

Load the data

We can go to Layer>Add Layer>Add Delimited Text Layer and choose the CSV file with the following settings.

Then, we click on the Settings button in the Time Manager and click Add Layer. We use the settings below and click OK.

Now we can slide through our UFO sightings and also choose the time frame size - we can, for instance, choose a time frame size of two hours to see on the map the UFOs that were reported during a two hour period from the current time. Using the slider, we can explore the dataset and see that on certain occasions multiple people reported seeing a UFO at the same time or in close succession from different locations. The night of March 6, 2015 sure was a busy night for our alleged extraterrestrial visitors.

A cool feature of QGIS in use here is the use of a custom SVG marker (the alien head) for visualization. Follow this link for instructions on how to setup custom markers in QGIS.

Data-defined styling

So far, so good, but we're not yet using one of the very useful features in QGIS which is data-defined styling (and is a direction I see many visualization libraries taking). We will take advantage of this capability to take into account the shape information of the UFOs for the visualization. It is possible to define "rules" within QGIS that say that if the attributes of a point fulfill a condition, a specific marker style will be used to render it.

To set these rules, right-click on the ufo_data layer, select Properties, go to Styling and select "Rule-based styling". Click on the plus symbol to add a new rule. The condition the attributes need to fulfill is defined in the Filter text box. To define a style for all UFO sightings that reported seeing a circular object, write " shape='circle' " as the filter and choose a circular marker. See the rules I defined below.

Exporting & Sharing

By clicking on "Export Video" from the TimeManager plugin, it is possible to export frames of the data - each frame being a snapshot of a time interval. These frames can then be turned into an animation with a command line utility such as ImageMagick or into a video with ffmpeg. By checking "Do not export empty frames" in the settings of the TimeManager plugin, you can avoid exporting frames where no UFO activity is present. You can also check "Display frame start time on map" to have a text label show the current time. This label can also be customized. Here is an animation I generated for two weeks in March. If the gif doesn't play, you can also load it with this direct link.

It is also possible to share the QGIS project file itself. It contains all the information to recreate the project in XML format and it also contains a relative path to the data source, so it can be shared as a folder which has the CSV and the .qgs file among a team and also put in source control.

Summary & Improvements

We extracted some semi-structured data from a website, transformed it, cleaned it and then loaded it into QGIS to step through it in time and export it as animation/video. Using a program which already had the functionality to visualize geospatial data and handle the geotemporal aspect was a good choice to save time since most of the logic of putting points on a map and handling the time had already been implemented. Feel free to suggest alternatives in the comments. A two-week snapshot of the cleaned dataset to play with can be downloaded from here.

One obvious improvement to this workflow would be to automate the data transformation pipeline fully, which is something data scientists should well keep in mind. I have written about this topic before. There is some manual work needed to launch QGIS and click around and also to turn the exported frames into an animation or a video. For the first issue, if we were to make QGIS project generation a part of a data pipeline which would need to run several times (which has happened in my workplace) we would need to script QGIS and also use a pre-made QGIS project template to load most of the settings from there. Luckily, it is possible to do so in Python using the PyQGIS bindings and take care of everything programmatically. For the second issue, I have indeed received several requests to make video and animation exporting possible with one click through the TimeManager plugin, so expect this feature to appear in the upcoming versions.
Written by Karolina Alexiou

If you found this post interesting, follow and support us.
Suggest for you:

Zero to Hero with Python Professional Python Programmer Bundle

The Python Mega Course: Build 10 Python Applications

Learning Python for Data Analysis and Visualization

Learn Python for Beginners!

Data Science: Deep Learning in Python