Analisi con dataframe pandas

matplotlib e numpy sempre presenti

Guida completa: matplotlib.pyplot.plot Full docs (numpy è parte della suite scipy) numpy Per chi viene da matlab: numpy for matlab users

modulo pandas (per usare il dataframe)

Articolo su DataCamp: Pandas Tutorial: DataFrames in Python

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Un modo utile è importare i dati usando numpy (perché fa le conversioni numeriche perfettamente) e poi creare il dataframe. Immaginiamo un tipico ascii, ad esempio testbeam SELDOM 2018. In axial_example ci sono 164 file del run 300097. Nome file: run300097_multi_000062.dat

Il file contiene 50 colonne:

In [2]:
folder   = 'axial_example/'
filename = 'run300097_multi_000062.dat'
data = np.loadtxt(folder+filename)

In pratica abbiamo importato i dati in una grande "matrice"

In [3]:
print(data)
[[1.33500e+00 8.91110e-01 1.19500e+00 ... 6.20000e+01 1.50000e+01
  3.35441e+05]
 [1.35500e+00 7.90330e-01 1.22790e+00 ... 6.20000e+01 1.50000e+01
  3.35443e+05]
 [1.37000e+00 1.11690e+00 1.30110e+00 ... 6.20000e+01 1.50000e+01
  3.35444e+05]
 ...
 [8.20000e-01 1.85970e-01 7.71070e-01 ... 6.20000e+01 1.50000e+01
  3.40936e+05]
 [1.33120e+00 8.81090e-01 1.42390e+00 ... 6.20000e+01 1.50000e+01
  3.40937e+05]
 [1.23760e+00 8.05840e-01 1.08570e+00 ... 6.20000e+01 1.50000e+01
  3.40938e+05]]
In [4]:
print(data[0])
[ 1.3350000e+00  8.9111000e-01  1.1950000e+00  1.1750000e+00
  4.5738000e+00  3.7521000e+00  1.0000000e+00  2.0000000e+00
  1.0000000e+00  1.0000000e+00  1.0000000e+00  2.0000000e+00
  1.0000000e+00  1.0000000e+00  1.0000000e+00  1.0000000e+00
  1.0000000e+00  1.0000000e+00  1.5371000e+04  1.5469000e+04
  1.5419000e+04  1.5444000e+04  1.5463000e+04  1.5647000e+04
  1.5452000e+04  1.5659000e+04  1.6000000e+01  1.2000000e+01
  2.4540000e+03  1.6000000e+01  2.3000000e+01  2.1000000e+01
  2.3000000e+01  1.7000000e+01  3.6400000e+02  2.1000000e+02
  2.5000000e+02  4.7000000e+02  4.5300000e+02  2.0900000e+02
  3.8200000e+02  7.2000000e+01  1.0983357e+03 -1.1000601e+04
  2.5199999e+01  0.0000000e+00  0.0000000e+00  6.2000000e+01
  1.5000000e+01  3.3544100e+05]

Diamo nomi alle colonne creando una lista di nomi

In [5]:
nomi=[]
for i in range(4):
        nomi.append('tele%d'%i)
        
In [6]:
for i in range(2):
    nomi.append('bc%d'%i)
print(nomi)
['tele0', 'tele1', 'tele2', 'tele3', 'bc0', 'bc1']
In [7]:
for i in range(12):
    nomi.append('clu%d'%i)
print(nomi)
['tele0', 'tele1', 'tele2', 'tele3', 'bc0', 'bc1', 'clu0', 'clu1', 'clu2', 'clu3', 'clu4', 'clu5', 'clu6', 'clu7', 'clu8', 'clu9', 'clu10', 'clu11']
In [8]:
for v in ['baseline', 'ph', 'time']:
    for i in range(8):
        nomi.append('%s%d'%(v,i))
    
print(nomi)
['tele0', 'tele1', 'tele2', 'tele3', 'bc0', 'bc1', 'clu0', 'clu1', 'clu2', 'clu3', 'clu4', 'clu5', 'clu6', 'clu7', 'clu8', 'clu9', 'clu10', 'clu11', 'baseline0', 'baseline1', 'baseline2', 'baseline3', 'baseline4', 'baseline5', 'baseline6', 'baseline7', 'ph0', 'ph1', 'ph2', 'ph3', 'ph4', 'ph5', 'ph6', 'ph7', 'time0', 'time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7']
In [9]:
for i in range(5):
    nomi.append('gonio%d'%i)
print(nomi)
['tele0', 'tele1', 'tele2', 'tele3', 'bc0', 'bc1', 'clu0', 'clu1', 'clu2', 'clu3', 'clu4', 'clu5', 'clu6', 'clu7', 'clu8', 'clu9', 'clu10', 'clu11', 'baseline0', 'baseline1', 'baseline2', 'baseline3', 'baseline4', 'baseline5', 'baseline6', 'baseline7', 'ph0', 'ph1', 'ph2', 'ph3', 'ph4', 'ph5', 'ph6', 'ph7', 'time0', 'time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7', 'gonio0', 'gonio1', 'gonio2', 'gonio3', 'gonio4']
In [10]:
nomi += ['step', 'event_n', 'event_time']
print('La lista nomi ha lunghezza %d'%len(nomi))
La lista nomi ha lunghezza 50
In [11]:
df = pd.DataFrame(data, columns=nomi)
df.head(10)
Out[11]:
tele0 tele1 tele2 tele3 bc0 bc1 clu0 clu1 clu2 clu3 ... time6 time7 gonio0 gonio1 gonio2 gonio3 gonio4 step event_n event_time
0 1.3350 0.891110 1.19500 1.17500 4.5738 3.7521 1.0 2.0 1.0 1.0 ... 382.0 72.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335441.0
1 1.3550 0.790330 1.22790 1.05500 4.4654 3.5108 1.0 3.0 2.0 1.0 ... 84.0 346.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335443.0
2 1.3700 1.116900 1.30110 1.44850 4.5738 3.9340 1.0 2.0 2.0 2.0 ... 341.0 511.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335444.0
3 1.1800 0.674790 0.95734 0.91291 4.1492 3.3532 1.0 3.0 2.0 2.0 ... 366.0 82.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335448.0
4 1.2043 1.124600 1.15700 1.43160 4.4528 3.9205 2.0 3.0 2.0 2.0 ... 463.0 506.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335449.0
5 0.6550 0.087493 0.50262 0.21500 3.7268 2.6030 1.0 2.0 2.0 1.0 ... 118.0 114.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335453.0
6 1.2019 0.541240 1.23290 0.80000 4.5496 3.2564 2.0 2.0 2.0 1.0 ... 242.0 178.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335455.0
7 1.3215 0.241380 1.32230 0.49500 4.5738 2.9176 2.0 2.0 2.0 1.0 ... 287.0 144.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335457.0
8 1.1600 0.906180 1.06500 1.19500 4.3318 3.6678 1.0 2.0 1.0 1.0 ... 100.0 383.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335461.0
9 1.0863 0.992490 0.80256 1.34000 3.9579 3.8231 2.0 2.0 2.0 1.0 ... 156.0 251.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335462.0

10 rows × 50 columns

Descrizione

In [12]:
df.describe()
Out[12]:
tele0 tele1 tele2 tele3 bc0 bc1 clu0 clu1 clu2 clu3 ... time6 time7 gonio0 gonio1 gonio2 gonio3 gonio4 step event_n event_time
count 1889.000000 1889.000000 1889.000000 1889.000000 1889.000000 1889.000000 1889.000000 1889.000000 1889.000000 1889.000000 ... 1889.000000 1889.000000 1.889000e+03 1.889000e+03 1889.000000 1889.0 1889.0 1889.0 1889.0 1889.000000
mean -7.855440 0.817931 -0.529481 -3.142273 -27.466914 6.382615 1.633139 2.292218 1.682372 1.499206 ... 266.594494 267.001588 1.098336e+03 -1.100060e+04 25.199999 0.0 0.0 62.0 15.0 338163.289571
std 94.572032 0.290830 69.049776 130.156029 435.823624 115.177870 0.627227 0.652040 0.612988 0.626164 ... 136.037720 129.768746 2.274339e-13 3.638942e-12 0.000000 0.0 0.0 0.0 0.0 1616.836296
min -1000.000000 0.086048 -3000.000000 -4000.000000 -6000.000000 0.309730 0.000000 1.000000 0.000000 0.000000 ... 51.000000 51.000000 1.098336e+03 -1.100060e+04 25.199999 0.0 0.0 62.0 15.0 335441.000000
25% 1.090000 0.598750 0.948340 0.833410 4.089800 3.256400 1.000000 2.000000 1.000000 1.000000 ... 146.000000 154.000000 1.098336e+03 -1.100060e+04 25.199999 0.0 0.0 62.0 15.0 336760.000000
50% 1.185000 0.821710 1.092100 1.099000 4.337500 3.595200 2.000000 2.000000 2.000000 1.000000 ... 266.000000 264.000000 1.098336e+03 -1.100060e+04 25.199999 0.0 0.0 62.0 15.0 338146.000000
75% 1.280000 1.036300 1.217600 1.362500 4.573800 3.958200 2.000000 3.000000 2.000000 2.000000 ... 383.000000 370.000000 1.098336e+03 -1.100060e+04 25.199999 0.0 0.0 62.0 15.0 339550.000000
max 1.894600 1.846000 1.895800 1.900900 8.727800 5009.500000 5.000000 7.000000 6.000000 6.000000 ... 511.000000 511.000000 1.098336e+03 -1.100060e+04 25.199999 0.0 0.0 62.0 15.0 340938.000000

8 rows × 50 columns

In [13]:
# histograms
for i in range(4):
    plt.subplot(221+i)
    plt.hist(df[nomi[i]],100)

clean up wrong telescope entries

In [14]:
df=df[(df.tele0>-5) & (df.tele1>-5) & (df.tele2>-5) & (df.tele3>-5)]
df.head()
Out[14]:
tele0 tele1 tele2 tele3 bc0 bc1 clu0 clu1 clu2 clu3 ... time6 time7 gonio0 gonio1 gonio2 gonio3 gonio4 step event_n event_time
0 1.3350 0.89111 1.19500 1.17500 4.5738 3.7521 1.0 2.0 1.0 1.0 ... 382.0 72.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335441.0
1 1.3550 0.79033 1.22790 1.05500 4.4654 3.5108 1.0 3.0 2.0 1.0 ... 84.0 346.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335443.0
2 1.3700 1.11690 1.30110 1.44850 4.5738 3.9340 1.0 2.0 2.0 2.0 ... 341.0 511.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335444.0
3 1.1800 0.67479 0.95734 0.91291 4.1492 3.3532 1.0 3.0 2.0 2.0 ... 366.0 82.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335448.0
4 1.2043 1.12460 1.15700 1.43160 4.4528 3.9205 2.0 3.0 2.0 2.0 ... 463.0 506.0 1098.3357 -11000.601 25.199999 0.0 0.0 62.0 15.0 335449.0

5 rows × 50 columns

Adding aliases (new dataframe columns)

In [19]:
df['teleDx'] = df['tele2'] - df['tele0']
df['teleDy'] = df['tele3'] - df['tele1']
df['divx'] = df['teleDx']/18.44
df['divy'] = df['teleDy']/18.44
In [20]:
df.tail()
Out[20]:
tele0 tele1 tele2 tele3 bc0 bc1 clu0 clu1 clu2 clu3 ... gonio2 gonio3 gonio4 step event_n event_time teleDx teleDy divx divy
1884 1.2400 1.02100 1.16180 1.3113 4.4278 3.7746 1.0 2.0 2.0 2.0 ... 25.199999 0.0 0.0 62.0 15.0 340928.0 -0.07820 0.29030 -0.004241 0.015743
1885 1.3072 0.80519 1.19000 1.0800 6.2505 4.1760 2.0 5.0 1.0 1.0 ... 25.199999 0.0 0.0 62.0 15.0 340931.0 -0.11720 0.27481 -0.006356 0.014903
1886 0.8200 0.18597 0.77107 0.1950 3.7356 3.2564 1.0 2.0 2.0 1.0 ... 25.199999 0.0 0.0 62.0 15.0 340936.0 -0.04893 0.00903 -0.002653 0.000490
1887 1.3312 0.88109 1.42390 1.1300 4.7916 3.5107 2.0 2.0 2.0 1.0 ... 25.199999 0.0 0.0 62.0 15.0 340937.0 0.09270 0.24891 0.005027 0.013498
1888 1.2376 0.80584 1.08570 1.0100 4.3197 3.4500 2.0 2.0 3.0 1.0 ... 25.199999 0.0 0.0 62.0 15.0 340938.0 -0.15190 0.20416 -0.008238 0.011072

5 rows × 54 columns

In [21]:
# histograms
for i in range(4):
    plt.subplot(221+i)
    plt.hist(df['tele%i'%i],100)
In [22]:
plt.figure(figsize=(12,5))
for i in range(2):
    plt.subplot(121+i)
    plt.hist2d(df['tele%d'%i],df['tele%d'%(i+1)],100)

Pausa. Abbiamo aperto un solo file. Creare il dataframe è un task ripetuto su tanti file, tante volte.

Vediamo come impacchettare tutto in un'unica funzione.