I was trying to using pandas to analysis a fairly large data set (~5GB). I wanted to divide the data sets into groups, then perform a Cartesian product on each group, and then aggregate the result.
The apply operation of pandas is quite expressive, I could first group, and then do the Cartesian product on each group using apply, and then aggregate the result using sum. The problem with this approach, however, is that apply is not lazy, it will compute all the intermediate results before the aggregation, and the intermediate results (Cartesian production on each group) is very large.
I was looking at Apache Spark and found one very interesting operator called cogroup. The definition is here:
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable, Iterable) tuples. This operation is also called groupWith.
This seems to be exactly what I want. If I could first cogroup and then do a sum, then the intermediate results won't be expanded (assuming cogroup works in the same lazy fashion as group).
Is there operation similar to cogroup in pandas, or how to achieve my goal efficiently?
Here is my example:
I want to group the data by id, and then do a Cartesian product for each group, and then group by cluster_x and cluster_y and aggregate the count_x and count_y using sum. The following code works, but is extremely slow and consumes too much memory.
# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1
def join_group(g): return pandas.merge(g, g, on='dummy_key')\ [['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]
df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\ groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\ [['count_x', 'count_y']].sum()A toy data set
id cluster count
0 i1 A 2
1 i1 B 3
2 i2 A 1
3 i2 B 4Intermediate result after the apply (can be large)
cluster_x count_x cluster_y count_y
id
i1 0 A 2 A 2 1 A 2 B 3 2 B 3 A 2 3 B 3 B 3
i2 0 A 1 A 1 1 A 1 B 4 2 B 4 A 1 3 B 4 B 4The desired final result
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7 4 1 Answer
My first attempt failed, sort of: while I was able to limit the memory use (by summing over the Cartesian product within each group), it was considerably slower than the original. But for your particular desired output, I think we can simplify the problem considerably:
import numpy as np, pandas as pd
def fake_data(nids, nclusters, ntile): ids = ["i{}".format(i) for i in range(1,nids+1)] clusters = ["A{}".format(i) for i in range(nclusters)] df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"])) df = df.reset_index() df = pd.concat([df]*ntile) df["count"] = np.random.randint(0, 10, size=len(df)) return df
def join_group(g): m= pd.merge(g, g, on='dummy_key') return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]
def old_method(df): df["dummy_key"] = 1 h1 = df.groupby(['id'], as_index=True).apply(join_group) h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False) h3 = h2[['count_x', 'count_y']].sum() return h3
def new_method1(df): m1 = df.groupby("cluster", as_index=False)["count"].sum() m1["dummy_key"] = 1 m2 = m1.merge(m1, on="dummy_key") m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1) return m2which gives (with df as your toy frame):
>>> new_method1(df) cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
Trueand even
>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loopWhether this will be enough of an improvement to handle your actual case, I'm not sure.
2