Data Programming Course  Check-in [6561f8c970]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:terminata la lezione 7
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1:6561f8c9701cad8bf3f36d683529210c17049b7a
User & Date: EnricoGiampieri 2017-03-13 23:29:25
Context
2017-03-13
23:30
aggiunta l'immagine svg necessaria per la lezione 7 check-in: 08339bafe1 user: EnricoGiampieri tags: trunk
23:29
terminata la lezione 7 check-in: 6561f8c970 user: EnricoGiampieri tags: trunk
2017-03-12
10:55
slide della lezione 7, dimenticate nel precedente commit check-in: 65c2141b7d user: EnricoGiampieri tags: trunk
Changes

Changes to Lezione 7 - Data pipeline e Snakemake.html.

12259
12260
12261
12262
12263
12264
12265
12266
12267
12268
12269
12270
12271
12272
12273
.....
12281
12282
12283
12284
12285
12286
12287

12288
12289
12290
12291
12292
12293
12294
12295
12296
12297
12298
12299
12300
12301
12302
12303
12304
12305
12306


















12307
12308
12309
12310
12311
12312
12313
.....
12402
12403
12404
12405
12406
12407
12408












12409
12410
12411
12412
12413
12414
12415
.....
12520
12521
12522
12523
12524
12525
12526















12527
12528
12529
12530
12531
12532
12533
.....
12590
12591
12592
12593
12594
12595
12596










12597
12598
12599
12600
12601
12602
12603
.....
12810
12811
12812
12813
12814
12815
12816
12817
12818
12819
12820
12821
12822
12823
12824
.....
12956
12957
12958
12959
12960
12961
12962












12963
12964
12965
12966
12967
12968
12969
.....
13150
13151
13152
13153
13154
13155
13156










13157
13158
13159
13160
13161
13162
13163
.....
13184
13185
13186
13187
13188
13189
13190












13191
13192
13193
13194
13195
13196
13197
.....
13344
13345
13346
13347
13348
13349
13350











13351
13352
13353
13354
13355
13356
13357
.....
13429
13430
13431
13432
13433
13434
13435











13436
13437
13438
13439
13440
13441
13442
.....
13563
13564
13565
13566
13567
13568
13569
13570

13571
13572
13573
13574
13575
13576
13577
13578

















































13579













































































































































































13580
13581
13582
13583
13584
13585
13586
13587
13588
13589
13590
13591
13592
13593
13594
13595
13596
13597
13598
13599
13600



13601






















































13602
13603
13604
13605
13606

13607
13608
13609

13610
13611
13612
13613
13614
13615
13616
<p>Questa è una struttura chiamata <strong>pull</strong> (in cui specifico il punto di arrivo), e richiede un po' di tempo per prenderci confidenza (lo standard della programmazione è di tipo <strong>push</strong>, in cui specifico il punto di partenza).</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[28]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> /home/enrico/lavoro/DataProgrammingCourse/
<span class="o">%</span><span class="k">mkdir</span> ../snakemake_lesson/
<span class="o">%</span><span class="k">cd</span> ../snakemake_lesson/
<span class="o">%</span><span class="k">pwd</span>
</pre></div>
................................................................................


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>/home/enrico/lavoro/DataProgrammingCourse

/home/enrico/lavoro/snakemake_lesson
</pre>
</div>
</div>

<div class="output_area">
<div class="prompt output_prompt">Out[28]:</div>



<div class="output_text output_subarea output_execute_result">
<pre>&#39;/home/enrico/lavoro/snakemake_lesson&#39;</pre>
</div>

</div>

</div>
</div>



















</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[41]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</pre>
</div>
</div>

</div>
</div>













</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[45]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</pre>
</div>
</div>

</div>
</div>
















</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[49]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</pre>
</div>
</div>

</div>
</div>











</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[53]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se i risultati intermedi esistono già, non eseguirli di nuovo</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[59]:</div>
................................................................................
</pre>
</div>
</div>

</div>
</div>













</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[71]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
<span class="o">%</span><span class="k">rm</span> result.txt
</pre></div>

</div>
</div>
</div>











</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[77]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --dag <span class="p">|</span> dot -Tsvg &gt; dag.svg
................................................................................
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="./immagini/snakemake_dag.svg" alt="visualizzazione pipeline"></p>













</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[78]:</div>
<div class="inner_cell">
................................................................................
</pre>
</div>
</div>

</div>
</div>












</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[85]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span>
................................................................................
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>












</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[92]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="configurazioni">configurazioni<a class="anchor-link" href="#configurazioni">&#182;</a></h3>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">

















































<div class=" highlight hl-ipython3"><pre><span></span> 













































































































































































</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span> 
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">



<div class="prompt input_prompt">






















































</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="sentinel-files">sentinel files<a class="anchor-link" href="#sentinel-files">&#182;</a></h3><p>Concetto molto semplice, creo file vuori come controlli, poi li cancello quando non mi servono più.</p>
<p>Posso aggiornarli con un <code>touch</code> per renderli più nuovi.</p>


</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Esercizio">Esercizio<a class="anchor-link" href="#Esercizio">&#182;</a></h2><p>Nel sito trovate un link a dei file per questa lezione, ciascuno con dentro una semplice tabella che indica una sequenza di versamenti fatti da delle persone.</p>







|







 







>






|












>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>







 







|







 







>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>







 







|
>





|


>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>









|


|






|
|
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

<
<
<
<
>



>







12259
12260
12261
12262
12263
12264
12265
12266
12267
12268
12269
12270
12271
12272
12273
.....
12281
12282
12283
12284
12285
12286
12287
12288
12289
12290
12291
12292
12293
12294
12295
12296
12297
12298
12299
12300
12301
12302
12303
12304
12305
12306
12307
12308
12309
12310
12311
12312
12313
12314
12315
12316
12317
12318
12319
12320
12321
12322
12323
12324
12325
12326
12327
12328
12329
12330
12331
12332
.....
12421
12422
12423
12424
12425
12426
12427
12428
12429
12430
12431
12432
12433
12434
12435
12436
12437
12438
12439
12440
12441
12442
12443
12444
12445
12446
.....
12551
12552
12553
12554
12555
12556
12557
12558
12559
12560
12561
12562
12563
12564
12565
12566
12567
12568
12569
12570
12571
12572
12573
12574
12575
12576
12577
12578
12579
.....
12636
12637
12638
12639
12640
12641
12642
12643
12644
12645
12646
12647
12648
12649
12650
12651
12652
12653
12654
12655
12656
12657
12658
12659
.....
12866
12867
12868
12869
12870
12871
12872
12873
12874
12875
12876
12877
12878
12879
12880
.....
13012
13013
13014
13015
13016
13017
13018
13019
13020
13021
13022
13023
13024
13025
13026
13027
13028
13029
13030
13031
13032
13033
13034
13035
13036
13037
.....
13218
13219
13220
13221
13222
13223
13224
13225
13226
13227
13228
13229
13230
13231
13232
13233
13234
13235
13236
13237
13238
13239
13240
13241
.....
13262
13263
13264
13265
13266
13267
13268
13269
13270
13271
13272
13273
13274
13275
13276
13277
13278
13279
13280
13281
13282
13283
13284
13285
13286
13287
.....
13434
13435
13436
13437
13438
13439
13440
13441
13442
13443
13444
13445
13446
13447
13448
13449
13450
13451
13452
13453
13454
13455
13456
13457
13458
.....
13530
13531
13532
13533
13534
13535
13536
13537
13538
13539
13540
13541
13542
13543
13544
13545
13546
13547
13548
13549
13550
13551
13552
13553
13554
.....
13675
13676
13677
13678
13679
13680
13681
13682
13683
13684
13685
13686
13687
13688
13689
13690
13691
13692
13693
13694
13695
13696
13697
13698
13699
13700
13701
13702
13703
13704
13705
13706
13707
13708
13709
13710
13711
13712
13713
13714
13715
13716
13717
13718
13719
13720
13721
13722
13723
13724
13725
13726
13727
13728
13729
13730
13731
13732
13733
13734
13735
13736
13737
13738
13739
13740
13741
13742
13743
13744
13745
13746
13747
13748
13749
13750
13751
13752
13753
13754
13755
13756
13757
13758
13759
13760
13761
13762
13763
13764
13765
13766
13767
13768
13769
13770
13771
13772
13773
13774
13775
13776
13777
13778
13779
13780
13781
13782
13783
13784
13785
13786
13787
13788
13789
13790
13791
13792
13793
13794
13795
13796
13797
13798
13799
13800
13801
13802
13803
13804
13805
13806
13807
13808
13809
13810
13811
13812
13813
13814
13815
13816
13817
13818
13819
13820
13821
13822
13823
13824
13825
13826
13827
13828
13829
13830
13831
13832
13833
13834
13835
13836
13837
13838
13839
13840
13841
13842
13843
13844
13845
13846
13847
13848
13849
13850
13851
13852
13853
13854
13855
13856
13857
13858
13859
13860
13861
13862
13863
13864
13865
13866
13867
13868
13869
13870
13871
13872
13873
13874
13875
13876
13877
13878
13879
13880
13881
13882
13883
13884
13885
13886
13887
13888
13889
13890
13891
13892
13893
13894
13895
13896
13897
13898
13899
13900
13901
13902
13903
13904
13905
13906
13907
13908
13909
13910
13911
13912
13913
13914
13915
13916
13917
13918
13919
13920
13921
13922
13923
13924
13925
13926
13927
13928
13929
13930
13931
13932
13933
13934
13935
13936
13937
13938
13939
13940
13941
13942
13943
13944
13945
13946
13947
13948
13949
13950
13951
13952
13953
13954
13955
13956
13957
13958
13959
13960
13961
13962
13963
13964
13965
13966
13967
13968
13969
13970
13971
13972
13973
13974
13975
13976
13977
13978
13979
13980
13981
13982
13983
13984
13985
13986
13987
13988
13989
13990
13991
13992
13993
13994




13995
13996
13997
13998
13999
14000
14001
14002
14003
14004
14005
14006
<p>Questa è una struttura chiamata <strong>pull</strong> (in cui specifico il punto di arrivo), e richiede un po' di tempo per prenderci confidenza (lo standard della programmazione è di tipo <strong>push</strong>, in cui specifico il punto di partenza).</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[1]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> /home/enrico/lavoro/DataProgrammingCourse/
<span class="o">%</span><span class="k">mkdir</span> ../snakemake_lesson/
<span class="o">%</span><span class="k">cd</span> ../snakemake_lesson/
<span class="o">%</span><span class="k">pwd</span>
</pre></div>
................................................................................


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>/home/enrico/lavoro/DataProgrammingCourse
mkdir: cannot create directory ‘../snakemake_lesson/’: File exists
/home/enrico/lavoro/snakemake_lesson
</pre>
</div>
</div>

<div class="output_area">
<div class="prompt output_prompt">Out[1]:</div>



<div class="output_text output_subarea output_execute_result">
<pre>&#39;/home/enrico/lavoro/snakemake_lesson&#39;</pre>
</div>

</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Uno snakefile è diviso in regole.</p>
<p>Ciascuna regola viene eseguita come un nuovo processo python a se stante.</p>
<p>Una regola ha delle sotto sezioni, di cui le più importanti sono:</p>
<ul>
<li>output: la lista dei file che la regola genererà in output (è una promessa, vanno poi effettivamente creati)</li>
<li>input: la lista dei file che la regola richiede</li>
<li>shell/run: esegue uno o più comandi di shell oppure esegue del codice python arbitrario</li>
</ul>
<p>Di default, se non si specifica altro, snakemake cerca di eseguire la regola <strong>all</strong></p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[41]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se il file di output di una regola esiste già, ed è più recente dei file di input, la regola non viene eseguita.</p>
<p>Questo comportamento è detto <strong>idempotenza</strong>, e rende l'esecuzione dello script più prevedibile.</p>
<p>È comunque possibile forzare la mano a snakamake in vari modi (se vi servisse, li trovate sul manuale)</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[45]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se i fil di input sono specificati e:</p>
<ol>
<li>non esistono</li>
<li>non esiste una regola che li produca in output</li>
</ol>
<p>Allora snakemake ritornerà un errore</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[49]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Vediamo come appare uno script con due regole distinte, una per creare i due parziali ed una per processarli.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[53]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se i risultati intermedi esistono già, non li esegue di nuovo</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[59]:</div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Per usare le wildcard devo dare da qualche parte il comando expand, che assegna di vari possibili valori alle wildcards.</p>
<p>Posso avere più wildcards allo stesso momento, l'importante è inizializarle tutte.</p>
<p>Ci sono dei meccanismi per fare inferenza automatica delle wildcard, ma vi consiglio di prendere prima confidenza con la dichiarazione esplicita.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[71]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
<span class="o">%</span><span class="k">rm</span> result.txt
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Snakamake mi permette anche di creare un grafico di flusso che visualizza tutto ciò che deve essere fatto, e ciò che invece è stato già fatto e non ha bisogno di una nuova esecuzione.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[77]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --dag <span class="p">|</span> dot -Tsvg &gt; dag.svg
................................................................................
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="./immagini/snakemake_dag.svg" alt="visualizzazione pipeline"></p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Un'altra funzione estramamente utile è la creazione di un registro di provenance, che mi indica quali file sono stati creati da quale regola e con che parametri.</p>
<p>Questo permette di tenere traccia dell'origine di ciascun file in modo semplice.</p>
<p>È anche facile impostarlo in modo da appendere la provenance ad un log completo, dando così la storia di tutti i file creati e modificati nel tempo.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[78]:</div>
<div class="inner_cell">
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se voglio eseguire più regole in parallelo (ovviamente rispettando l'ordine necessario di esecuzione di ciascun ramo), mi basta dare il comando <code>--cores &lt;N&gt;</code> e snakemake eseguirà in automatico tutto quello che riesce in parallelo.</p>
<p>Esiste un equivalente anche per lanciare la pipeline in un cluster di calcolo, rendendo molto semplice il calcolo distribuito.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[85]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span>
................................................................................
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Posso anche specificare delle risorse limitate (oltre i processori) in modo che la pipeline non ecceda nell'uso.</p>
<p>Ad esempio, se ho delle regole che richiedono una gran quantità di memoria, posso specificare il livello atteso di occupazione nella regola e poi specificare la memoria disponibile da linea di comando.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[92]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="configurazioni">configurazioni<a class="anchor-link" href="#configurazioni">&#182;</a></h3><p>Eventuali parametri di configurazione possono essere dati da linea di comando oppure caricati da un file di configurazione in formato YAML o JSON</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[14]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

numeri = [i for i in range(int(config[&#39;number&#39;]))]

rule all:
    input:
        expand(&quot;parziali{numero}.txt&quot;, numero=numeri)
    output:
        &quot;result.txt&quot;
    shell:
        &quot;cat {input} &gt; {output}&quot;
        
rule crea_parziali:
    output:
        out = &quot;parziali{numero}.txt&quot;
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, &#39;w&#39;) as file:
            print(&quot;risultato di {}&quot;.format(filename), file=file)
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>Overwriting Snakefile
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[15]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[16]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span> --resources <span class="nv">memory</span><span class="o">=</span><span class="m">12</span> --config <span class="nv">number</span><span class="o">=</span><span class="m">4</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre><span class="ansi-yellow-fg">Provided cores: 6</span>
<span class="ansi-yellow-fg">Rules claiming more threads will be scaled down.</span>
<span class="ansi-yellow-fg">Provided resources: memory=12</span>
<span class="ansi-yellow-fg">Job counts:
	count	jobs
	1	all
	4	crea_parziali
	5</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali1.txt
    jobid: 4
    wildcards: numero=1
    resources: memory=6</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali2.txt
    jobid: 2
    wildcards: numero=2
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 2.</span>
<span class="ansi-green-fg">1 of 5 steps (20%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali3.txt
    jobid: 3
    wildcards: numero=3
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 4.</span>
<span class="ansi-green-fg">2 of 5 steps (40%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali0.txt
    jobid: 1
    wildcards: numero=0
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 3.</span>
<span class="ansi-green-fg">3 of 5 steps (60%) done</span>
<span class="ansi-green-fg">Finished job 1.</span>
<span class="ansi-green-fg">4 of 5 steps (80%) done</span>

<span class="ansi-green-fg">rule all:
    input: parziali0.txt, parziali1.txt, parziali2.txt, parziali3.txt
    output: result.txt
    jobid: 0</span>

<span class="ansi-green-fg">Finished job 0.</span>
<span class="ansi-green-fg">5 of 5 steps (100%) done</span>
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[17]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> config.yaml
number: 4
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>Overwriting config.yaml
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[18]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

configfile: &quot;./config.yaml&quot;

numeri = [i for i in range(int(config[&#39;number&#39;]))]

rule all:
    input:
        expand(&quot;parziali{numero}.txt&quot;, numero=numeri)
    output:
        &quot;result.txt&quot;
    shell:
        &quot;cat {input} &gt; {output}&quot;
        
rule crea_parziali:
    output:
        out = &quot;parziali{numero}.txt&quot;
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, &#39;w&#39;) as file:
            print(&quot;risultato di {}&quot;.format(filename), file=file)
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>Overwriting Snakefile
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[19]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[20]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span> --resources <span class="nv">memory</span><span class="o">=</span><span class="m">12</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre><span class="ansi-yellow-fg">Provided cores: 6</span>
<span class="ansi-yellow-fg">Rules claiming more threads will be scaled down.</span>
<span class="ansi-yellow-fg">Provided resources: memory=12</span>
<span class="ansi-yellow-fg">Job counts:
	count	jobs
	1	all
	4	crea_parziali
	5</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali0.txt
    jobid: 4
    wildcards: numero=0
    resources: memory=6</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali3.txt
    jobid: 3
    wildcards: numero=3
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 3.</span>
<span class="ansi-green-fg">1 of 5 steps (20%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali2.txt
    jobid: 1
    wildcards: numero=2
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 4.</span>
<span class="ansi-green-fg">2 of 5 steps (40%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali1.txt
    jobid: 2
    wildcards: numero=1
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 1.</span>
<span class="ansi-green-fg">3 of 5 steps (60%) done</span>
<span class="ansi-green-fg">Finished job 2.</span>
<span class="ansi-green-fg">4 of 5 steps (80%) done</span>

<span class="ansi-green-fg">rule all:
    input: parziali0.txt, parziali1.txt, parziali2.txt, parziali3.txt
    output: result.txt
    jobid: 0</span>

<span class="ansi-green-fg">Finished job 0.</span>
<span class="ansi-green-fg">5 of 5 steps (100%) done</span>
</pre>
</div>




</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Esercizio">Esercizio<a class="anchor-link" href="#Esercizio">&#182;</a></h2><p>Nel sito trovate un link a dei file per questa lezione, ciascuno con dentro una semplice tabella che indica una sequenza di versamenti fatti da delle persone.</p>

Changes to Lezione 7 - Data pipeline e Snakemake.ipynb.

619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
...
634
635
636
637
638
639
640

641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661





















662
663
664
665
666
667
668
...
742
743
744
745
746
747
748















749
750
751
752
753
754
755
...
848
849
850
851
852
853
854












855
856
857
858
859
860
861
...
908
909
910
911
912
913
914











915
916
917
918
919
920
921
....
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
....
1181
1182
1183
1184
1185
1186
1187
1188



1189
1190
1191
1192
1193
1194
1195
....
1217
1218
1219
1220
1221
1222
1223
1224



1225
1226
1227
1228
1229
1230
1231
....
1234
1235
1236
1237
1238
1239
1240















1241
1242
1243
1244
1245
1246
1247



1248
1249
1250
1251
1252
1253
1254
....
1279
1280
1281
1282
1283
1284
1285
1286



1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299



1300
1301
1302
1303
1304
1305
1306
....
1313
1314
1315
1316
1317
1318
1319
1320



1321
1322
1323
1324
1325
1326
1327
....
1364
1365
1366
1367
1368
1369
1370
1371



1372
1373
1374
1375
1376
1377
1378
....
1388
1389
1390
1391
1392
1393
1394
1395



1396
1397
1398
1399
1400
1401
1402











1403
1404
1405
1406
1407
1408
1409



1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422



1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433



1434
1435
1436
1437
1438















1439
1440
1441
1442
1443
1444
1445



1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458



1459
1460
1461
1462
1463
1464
1465
....
1593
1594
1595
1596
1597
1598
1599
1600



1601
1602
1603
1604
1605
1606
1607
....
1609
1610
1611
1612
1613
1614
1615













1616
1617
1618
1619
1620
1621
1622



1623
1624
1625
1626
1627
1628
1629
....
1679
1680
1681
1682
1683
1684
1685
1686



1687
1688
1689
1690
1691
1692













1693
1694
1695
1696
1697
1698
1699



1700
1701
1702
1703
1704
1705
1706
....
1733
1734
1735
1736
1737
1738
1739
1740



1741
1742
1743
1744
1745
1746
1747
....
1802
1803
1804
1805
1806
1807
1808
1809



1810
1811
1812



















































































































































































































1813
1814
1815
1816
1817
1818
1819
1820
1821

1822
1823
1824
1825
1826

1827
1828
1829
1830
1831
1832
1833

1834
1835
1836
1837


1838
1839
1840
1841
1842









































1843
1844
1845
1846
1847
1848

1849
1850
1851
1852
1853
1854
1855
1856



1857
1858
1859
1860
1861
1862
1863
....
1867
1868
1869
1870
1871
1872
1873
1874




1875
1876
1877
1878
1879
1880
1881
    "Se il file di input della regola non esiste, snakemake cerca un'altra regola che abbia quel file come output e la esegue prima.\n",
    "\n",
    "Questa è una struttura chiamata **pull** (in cui specifico il punto di arrivo), e richiede un po' di tempo per prenderci confidenza (lo standard della programmazione è di tipo **push**, in cui specifico il punto di partenza)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "skip"
    }
................................................................................
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/enrico/lavoro/DataProgrammingCourse\n",

      "/home/enrico/lavoro/snakemake_lesson\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'/home/enrico/lavoro/snakemake_lesson'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%cd /home/enrico/lavoro/DataProgrammingCourse/\n",
    "%mkdir ../snakemake_lesson/\n",
    "%cd ../snakemake_lesson/\n",
    "%pwd"
   ]
  },





















  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
     ]
    }
   ],
   "source": [
    "%ls"
   ]
  },















  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
     ]
    }
   ],
   "source": [
    "!snakemake"
   ]
  },












  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
     ]
    }
   ],
   "source": [
    "!snakemake"
   ]
  },











  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Se i risultati intermedi esistono già, non eseguirli di nuovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true,
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mWorkflowError in line 2 of /home/enrico/lavoro/snakemake_lesson/Snakefile:\r\n",
................................................................................
     ]
    }
   ],
   "source": [
    "!snakemake"
   ]
  },















  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "outputs": [],
   "source": [
    "%rm result.txt "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "parziali1.txt  parziali2.txt  Snakefile\r\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 1\u001b[0m\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "risultato di parziali1.txt\r\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "outputs": [],
   "source": [
    "%rm parziali1.txt\n",
    "%rm result.txt"
   ]
  },











  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "outputs": [],
   "source": [
    "!snakemake --dag | dot -Tsvg > dag.svg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "outputs": [],
   "source": [
    "%cp dag.svg ../DataProgrammingCourse/immagini/snakemake_dag.svg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true



   },
   "source": [
    "![visualizzazione pipeline](./immagini/snakemake_dag.svg)"
   ]
  },















  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "outputs": [],
   "source": [
    "!snakemake --detailed-summary > provenance.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rm: cannot remove ‘*.txt’: No such file or directory\n",
................................................................................
     ]
    }
   ],
   "source": [
    "%rm *.txt"
   ]
  },













  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 6\u001b[0m\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "outputs": [],
   "source": [
    "%rm *.txt"
   ]
  },













  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true



   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 6\u001b[0m\n",
................................................................................
    "!snakemake --cores 6 --resources memory=12"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true



   },
   "source": [
    "### configurazioni\n"



















































































































































































































   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true

   },
   "outputs": [],
   "source": []
  },
  {

   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },

   "outputs": [],
   "source": []
  },
  {


   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },









































   "source": [
    "### sentinel files\n",
    "\n",
    "Concetto molto semplice, creo file vuori come controlli, poi li cancello quando non mi servono più.\n",
    "\n",
    "Posso aggiornarli con un `touch` per renderli più nuovi."

   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true



   },
   "source": [
    "## Esercizio\n",
    "\n",
    "Nel sito trovate un link a dei file per questa lezione, ciascuno con dentro una semplice tabella che indica una sequenza di versamenti fatti da delle persone.\n",
    "\n",
    "Ci sarà anche un file che indica gli hash md5 per ciascuno di questi file.\n",
................................................................................
    "* La cartella la trovate all'indirizzo `https://chiselapp.com/user/EnricoGiampieri/repository/DataProgrammingCourse/doc/tip/snakemake_exercise/`\n",
    "* ci sono 50 file chiamati transazioni_{}.tsv con l'indice da 00 a 49\n",
    "* il file di controllo degli hash è `md5sums.tsv`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},




   "source": [
    "### suggerimenti\n",
    "\n",
    "* la funzione di hash può essere implementata in python o con il comando da terminale `md5sum`\n",
    "* i file possono essere scaricati da terminale con `wget` oppure da python con la libreria `requests`\n",
    "* usate le wildcard per ottenere i file, o non finite più!"
   ]







|







 







>









|











>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>







 







|







 







|
>
>
>







 







|
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>






|
>
>
>







 







|
>
>
>












|
>
>
>







 







|
>
>
>







 







|
>
>
>







 







|
>
>
>







>
>
>
>
>
>
>
>
>
>
>






|
>
>
>












|
>
>
>










|
>
>
>





>
>
>
>
>
>
>
>
>
>
>
>
>
>
>






|
>
>
>












|
>
>
>







 







|
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>






|
>
>
>







 







|
>
>
>






>
>
>
>
>
>
>
>
>
>
>
>
>






|
>
>
>







 







|
>
>
>







 







|
>
>
>


|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>




|

|
|
|
>

|
<
<
|
>
|
|
|
|
|
|
<
>
|
|
<
<
>
>
|
|
|
|
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

<
<
<
<
<
>







|
>
>
>







 







|
>
>
>
>







619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
...
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
...
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
...
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
...
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
....
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
....
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
....
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
....
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
....
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
....
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
....
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
....
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
....
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
....
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
....
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
....
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
....
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219


2220
2221
2222
2223
2224
2225
2226
2227

2228
2229
2230


2231
2232
2233
2234
2235
2236

2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278





2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
....
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
    "Se il file di input della regola non esiste, snakemake cerca un'altra regola che abbia quel file come output e la esegue prima.\n",
    "\n",
    "Questa è una struttura chiamata **pull** (in cui specifico il punto di arrivo), e richiede un po' di tempo per prenderci confidenza (lo standard della programmazione è di tipo **push**, in cui specifico il punto di partenza)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "skip"
    }
................................................................................
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/enrico/lavoro/DataProgrammingCourse\n",
      "mkdir: cannot create directory ‘../snakemake_lesson/’: File exists\n",
      "/home/enrico/lavoro/snakemake_lesson\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'/home/enrico/lavoro/snakemake_lesson'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%cd /home/enrico/lavoro/DataProgrammingCourse/\n",
    "%mkdir ../snakemake_lesson/\n",
    "%cd ../snakemake_lesson/\n",
    "%pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Uno snakefile è diviso in regole.\n",
    "\n",
    "Ciascuna regola viene eseguita come un nuovo processo python a se stante.\n",
    "\n",
    "Una regola ha delle sotto sezioni, di cui le più importanti sono:\n",
    "\n",
    "* output: la lista dei file che la regola genererà in output (è una promessa, vanno poi effettivamente creati)\n",
    "* input: la lista dei file che la regola richiede\n",
    "* shell/run: esegue uno o più comandi di shell oppure esegue del codice python arbitrario\n",
    "\n",
    "Di default, se non si specifica altro, snakemake cerca di eseguire la regola **all**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
     ]
    }
   ],
   "source": [
    "%ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Se il file di output di una regola esiste già, ed è più recente dei file di input, la regola non viene eseguita.\n",
    "\n",
    "Questo comportamento è detto **idempotenza**, e rende l'esecuzione dello script più prevedibile.\n",
    "\n",
    "È comunque possibile forzare la mano a snakamake in vari modi (se vi servisse, li trovate sul manuale)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
     ]
    }
   ],
   "source": [
    "!snakemake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Se i fil di input sono specificati e:\n",
    "\n",
    "1. non esistono\n",
    "2. non esiste una regola che li produca in output\n",
    "\n",
    "Allora snakemake ritornerà un errore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
     ]
    }
   ],
   "source": [
    "!snakemake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Vediamo come appare uno script con due regole distinte, una per creare i due parziali ed una per processarli."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
................................................................................
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Se i risultati intermedi esistono già, non li esegue di nuovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true,
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mWorkflowError in line 2 of /home/enrico/lavoro/snakemake_lesson/Snakefile:\r\n",
................................................................................
     ]
    }
   ],
   "source": [
    "!snakemake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Per usare le wildcard devo dare da qualche parte il comando expand, che assegna di vari possibili valori alle wildcards.\n",
    "\n",
    "Posso avere più wildcards allo stesso momento, l'importante è inizializarle tutte.\n",
    "\n",
    "Ci sono dei meccanismi per fare inferenza automatica delle wildcard, ma vi consiglio di prendere prima confidenza con la dichiarazione esplicita."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "%rm result.txt "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "parziali1.txt  parziali2.txt  Snakefile\r\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 1\u001b[0m\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "risultato di parziali1.txt\r\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "%rm parziali1.txt\n",
    "%rm result.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Snakamake mi permette anche di creare un grafico di flusso che visualizza tutto ciò che deve essere fatto, e ciò che invece è stato già fatto e non ha bisogno di una nuova esecuzione."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "!snakemake --dag | dot -Tsvg > dag.svg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%cp dag.svg ../DataProgrammingCourse/immagini/snakemake_dag.svg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![visualizzazione pipeline](./immagini/snakemake_dag.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Un'altra funzione estramamente utile è la creazione di un registro di provenance, che mi indica quali file sono stati creati da quale regola e con che parametri.\n",
    "\n",
    "Questo permette di tenere traccia dell'origine di ciascun file in modo semplice.\n",
    "\n",
    "È anche facile impostarlo in modo da appendere la provenance ad un log completo, dando così la storia di tutti i file creati e modificati nel tempo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "!snakemake --detailed-summary > provenance.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rm: cannot remove ‘*.txt’: No such file or directory\n",
................................................................................
     ]
    }
   ],
   "source": [
    "%rm *.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Se voglio eseguire più regole in parallelo (ovviamente rispettando l'ordine necessario di esecuzione di ciascun ramo), mi basta dare il comando `--cores <N>` e snakemake eseguirà in automatico tutto quello che riesce in parallelo.\n",
    "\n",
    "Esiste un equivalente anche per lanciare la pipeline in un cluster di calcolo, rendendo molto semplice il calcolo distribuito."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 6\u001b[0m\n",
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "%rm *.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Posso anche specificare delle risorse limitate (oltre i processori) in modo che la pipeline non ecceda nell'uso.\n",
    "\n",
    "Ad esempio, se ho delle regole che richiedono una gran quantità di memoria, posso specificare il livello atteso di occupazione nella regola e poi specificare la memoria disponibile da linea di comando."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
................................................................................
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 6\u001b[0m\n",
................................................................................
    "!snakemake --cores 6 --resources memory=12"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### configurazioni\n",
    "\n",
    "Eventuali parametri di configurazione possono essere dati da linea di comando oppure caricati da un file di configurazione in formato YAML o JSON"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
     ]
    }
   ],
   "source": [
    "%%file Snakefile\n",
    "\n",
    "numeri = [i for i in range(int(config['number']))]\n",
    "\n",
    "rule all:\n",
    "    input:\n",
    "        expand(\"parziali{numero}.txt\", numero=numeri)\n",
    "    output:\n",
    "        \"result.txt\"\n",
    "    shell:\n",
    "        \"cat {input} > {output}\"\n",
    "        \n",
    "rule crea_parziali:\n",
    "    output:\n",
    "        out = \"parziali{numero}.txt\"\n",
    "    resources: \n",
    "        memory = 6\n",
    "    run:\n",
    "        filename = output.out\n",
    "        with open(filename, 'w') as file:\n",
    "            print(\"risultato di {}\".format(filename), file=file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "%rm *.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 6\u001b[0m\n",
      "\u001b[33mRules claiming more threads will be scaled down.\u001b[0m\n",
      "\u001b[33mProvided resources: memory=12\u001b[0m\n",
      "\u001b[33mJob counts:\n",
      "\tcount\tjobs\n",
      "\t1\tall\n",
      "\t4\tcrea_parziali\n",
      "\t5\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali1.txt\n",
      "    jobid: 4\n",
      "    wildcards: numero=1\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali2.txt\n",
      "    jobid: 2\n",
      "    wildcards: numero=2\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 2.\u001b[0m\n",
      "\u001b[32m1 of 5 steps (20%) done\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali3.txt\n",
      "    jobid: 3\n",
      "    wildcards: numero=3\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 4.\u001b[0m\n",
      "\u001b[32m2 of 5 steps (40%) done\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali0.txt\n",
      "    jobid: 1\n",
      "    wildcards: numero=0\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 3.\u001b[0m\n",
      "\u001b[32m3 of 5 steps (60%) done\u001b[0m\n",
      "\u001b[32mFinished job 1.\u001b[0m\n",
      "\u001b[32m4 of 5 steps (80%) done\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule all:\n",
      "    input: parziali0.txt, parziali1.txt, parziali2.txt, parziali3.txt\n",
      "    output: result.txt\n",
      "    jobid: 0\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 0.\u001b[0m\n",
      "\u001b[32m5 of 5 steps (100%) done\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!snakemake --cores 6 --resources memory=12 --config number=4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting config.yaml\n"
     ]
    }
   ],
   "source": [
    "%%file config.yaml\n",
    "number: 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting Snakefile\n"
     ]
    }
   ],
   "source": [
    "%%file Snakefile\n",
    "\n",
    "configfile: \"./config.yaml\"\n",
    "\n",
    "numeri = [i for i in range(int(config['number']))]\n",
    "\n",
    "rule all:\n",
    "    input:\n",
    "        expand(\"parziali{numero}.txt\", numero=numeri)\n",
    "    output:\n",
    "        \"result.txt\"\n",
    "    shell:\n",
    "        \"cat {input} > {output}\"\n",
    "        \n",
    "rule crea_parziali:\n",
    "    output:\n",
    "        out = \"parziali{numero}.txt\"\n",
    "    resources: \n",
    "        memory = 6\n",
    "    run:\n",
    "        filename = output.out\n",
    "        with open(filename, 'w') as file:\n",
    "            print(\"risultato di {}\".format(filename), file=file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "%rm *.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [


    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mProvided cores: 6\u001b[0m\n",
      "\u001b[33mRules claiming more threads will be scaled down.\u001b[0m\n",
      "\u001b[33mProvided resources: memory=12\u001b[0m\n",
      "\u001b[33mJob counts:\n",

      "\tcount\tjobs\n",
      "\t1\tall\n",
      "\t4\tcrea_parziali\n",


      "\t5\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali0.txt\n",
      "    jobid: 4\n",
      "    wildcards: numero=0\n",

      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali3.txt\n",
      "    jobid: 3\n",
      "    wildcards: numero=3\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 3.\u001b[0m\n",
      "\u001b[32m1 of 5 steps (20%) done\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali2.txt\n",
      "    jobid: 1\n",
      "    wildcards: numero=2\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 4.\u001b[0m\n",
      "\u001b[32m2 of 5 steps (40%) done\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule crea_parziali:\n",
      "    output: parziali1.txt\n",
      "    jobid: 2\n",
      "    wildcards: numero=1\n",
      "    resources: memory=6\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 1.\u001b[0m\n",
      "\u001b[32m3 of 5 steps (60%) done\u001b[0m\n",
      "\u001b[32mFinished job 2.\u001b[0m\n",
      "\u001b[32m4 of 5 steps (80%) done\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mrule all:\n",
      "    input: parziali0.txt, parziali1.txt, parziali2.txt, parziali3.txt\n",
      "    output: result.txt\n",
      "    jobid: 0\u001b[0m\n",
      "\u001b[32m\u001b[0m\n",
      "\u001b[32mFinished job 0.\u001b[0m\n",
      "\u001b[32m5 of 5 steps (100%) done\u001b[0m\n"
     ]
    }
   ],
   "source": [





    "!snakemake --cores 6 --resources memory=12"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Esercizio\n",
    "\n",
    "Nel sito trovate un link a dei file per questa lezione, ciascuno con dentro una semplice tabella che indica una sequenza di versamenti fatti da delle persone.\n",
    "\n",
    "Ci sarà anche un file che indica gli hash md5 per ciascuno di questi file.\n",
................................................................................
    "* La cartella la trovate all'indirizzo `https://chiselapp.com/user/EnricoGiampieri/repository/DataProgrammingCourse/doc/tip/snakemake_exercise/`\n",
    "* ci sono 50 file chiamati transazioni_{}.tsv con l'indice da 00 a 49\n",
    "* il file di controllo degli hash è `md5sums.tsv`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### suggerimenti\n",
    "\n",
    "* la funzione di hash può essere implementata in python o con il comando da terminale `md5sum`\n",
    "* i file possono essere scaricati da terminale con `wget` oppure da python con la libreria `requests`\n",
    "* usate le wildcard per ottenere i file, o non finite più!"
   ]

Changes to Lezione 7 - Data pipeline e Snakemake.slides.html.

12335
12336
12337
12338
12339
12340
12341


















12342
12343
12344
12345
12346
12347
12348
.....
12440
12441
12442
12443
12444
12445
12446












12447
12448
12449
12450
12451
12452
12453
.....
12557
12558
12559
12560
12561
12562
12563















12564
12565
12566
12567
12568
12569
12570
.....
12628
12629
12630
12631
12632
12633
12634










12635
12636
12637
12638
12639
12640
12641
.....
12847
12848
12849
12850
12851
12852
12853
12854
12855
12856
12857
12858
12859
12860
12861
.....
12918
12919
12920
12921
12922
12923
12924
12925
12926
12927
12928
12929
12930
12931
12932
.....
12962
12963
12964
12965
12966
12967
12968
12969
12970
12971
12972
12973
12974
12975
12976
.....
12993
12994
12995
12996
12997
12998
12999



13000









13001
13002
13003
13004
13005
13006
13007
.....
13040
13041
13042
13043
13044
13045
13046
13047
13048
13049
13050
13051
13052
13053
13054
13055
13056
13057
13058
13059
13060
13061
13062
13063
13064
13065
13066
13067
.....
13082
13083
13084
13085
13086
13087
13088
13089
13090
13091
13092
13093
13094
13095
13096
.....
13141
13142
13143
13144
13145
13146
13147
13148
13149
13150
13151
13152
13153
13154
13155
.....
13173
13174
13175
13176
13177
13178
13179
13180
13181
13182
13183
13184
13185
13186
13187
13188
13189
13190
13191
13192
13193



13194







13195
13196
13197
13198
13199
13200
13201
13202
13203
13204
13205
13206
13207
13208
13209
13210
13211
13212
13213
13214
13215
13216
13217
13218
13219
13220
13221
13222
13223
13224
13225
13226
13227
13228
13229



13230









13231
13232
13233
13234
13235
13236
13237
13238
13239
13240
13241
13242
13243
13244
13245
13246
13247
13248
13249
13250
.....
13351
13352
13353
13354
13355
13356
13357
13358
13359
13360
13361
13362
13363
13364
13365
.....
13381
13382
13383
13384
13385
13386
13387



13388








13389
13390
13391
13392
13393
13394
13395
.....
13453
13454
13455
13456
13457
13458
13459
13460
13461
13462
13463
13464
13465
13466
13467
13468
13469
13470
13471
13472



13473








13474
13475
13476
13477
13478
13479
13480
.....
13515
13516
13517
13518
13519
13520
13521
13522
13523
13524
13525
13526
13527
13528
13529
.....
13594
13595
13596
13597
13598
13599
13600
13601
13602
13603
13604
13605
13606
13607

13608
13609





























13610




















13611
13612
13613
13614
13615
13616













































































































































































13617
13618
13619
13620
13621
13622
13623
13624
13625
13626
13627
13628
13629
13630
13631
13632
13633
13634
13635
13636
13637
13638
13639


13640
13641
13642

13643








13644











































13645
13646

13647



13648
13649
13650
13651
13652
13653
13654
.....
13657
13658
13659
13660
13661
13662
13663
13664
13665
13666
13667
13668
13669
13670
13671
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>La struttura fondamentale di Snakemake è una <strong>regola</strong>, che rappresenta un programma (tipicamente uno script), i file che ha bisogno di avere in input e quelli che restituirà in output.</p>
<p>Se il file di output della regola esiste già, la regola non viene eseguita (a meno di non constringerlo).</p>
<p>Se il file di input della regola non esiste, snakemake cerca un'altra regola che abbia quel file come output e la esegue prima.</p>
<p>Questa è una struttura chiamata <strong>pull</strong> (in cui specifico il punto di arrivo), e richiede un po' di tempo per prenderci confidenza (lo standard della programmazione è di tipo <strong>push</strong>, in cui specifico il punto di partenza).</p>



















</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[41]:</div>
<div class="inner_cell">
................................................................................
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>












<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[45]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>
















</div></div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[49]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>










<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[53]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se i risultati intermedi esistono già, non eseguirli di nuovo</p>

</div>
</div>
</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[59]:</div>
................................................................................
<div class="text_cell_render border-box-sizing rendered_html">
<p>In questo script la stessa funzione python crea tutti i file uno alla volta, in modo indipendente.</p>
<p>In realtà potrei eseguirlo in modo concorrente, senza doversi aspettare. Per fare questo posso usare le wildcards.</p>
<p>Possono provocare torsioni della materia grigia, ma atteniamoci al caso più semplice</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[61]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[62]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>




</div>









<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[71]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[66]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> result.txt 
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[72]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">ls</span>
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[73]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[74]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cat</span> result.txt
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[75]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> parziali1.txt
<span class="o">%</span><span class="k">rm</span> result.txt
</pre></div>

</div>
</div>
</div>




</div>







<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[77]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --dag <span class="p">|</span> dot -Tsvg &gt; dag.svg
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cp</span> dag.svg ../DataProgrammingCourse/immagini/snakemake_dag.svg
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="./immagini/snakemake_dag.svg" alt="visualizzazione pipeline"></p>

</div>
</div>



</div>









<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[78]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --detailed-summary &gt; provenance.tsv
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[81]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;provenance.tsv&quot;</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
................................................................................
</div>

</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[83]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>




</div>








<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[85]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span>
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[91]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>




</div>








<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[92]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[93]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span> --resources <span class="nv">memory</span><span class="o">=</span><span class="m">12</span>
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="configurazioni">configurazioni<a class="anchor-link" href="#configurazioni">&#182;</a></h3>

</div>
</div>





























</div>




















<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span> 













































































































































































</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span> 
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>


<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="sentinel-files">sentinel files<a class="anchor-link" href="#sentinel-files">&#182;</a></h3><p>Concetto molto semplice, creo file vuori come controlli, poi li cancello quando non mi servono più.</p>

<p>Posso aggiornarli con un <code>touch</code> per renderli più nuovi.</p>




















































</div>
</div>

</div>



<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Esercizio">Esercizio<a class="anchor-link" href="#Esercizio">&#182;</a></h2><p>Nel sito trovate un link a dei file per questa lezione, ciascuno con dentro una semplice tabella che indica una sequenza di versamenti fatti da delle persone.</p>
<p>Ci sarà anche un file che indica gli hash md5 per ciascuno di questi file.</p>
................................................................................
<li>La cartella la trovate all'indirizzo <code>https://chiselapp.com/user/EnricoGiampieri/repository/DataProgrammingCourse/doc/tip/snakemake_exercise/</code></li>
<li>ci sono 50 file chiamati transazioni_{}.tsv con l'indice da 00 a 49</li>
<li>il file di controllo degli hash è <code>md5sums.tsv</code></li>
</ul>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="suggerimenti">suggerimenti<a class="anchor-link" href="#suggerimenti">&#182;</a></h3><ul>
<li>la funzione di hash può essere implementata in python o con il comando da terminale <code>md5sum</code></li>







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>







 







|







 







|







 







|







 







>
>
>

>
>
>
>
>
>
>
>
>







 







|












|







 







|







 







|







 







|













>
>
>

>
>
>
>
>
>
>












<
<
|
<
<
<
<
<
<
<
<
<
<
<









>
>
>

>
>
>
>
>
>
>
>
>












|







 







|







 







>
>
>

>
>
>
>
>
>
>
>







 







|












>
>
>

>
>
>
>
>
>
>
>







 







|







 







|





|
>


>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


|


|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>






|


|


|






|
<
|
<
>
>
|
|
<
>
|
>
>
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


>

>
>
>







 







|







12335
12336
12337
12338
12339
12340
12341
12342
12343
12344
12345
12346
12347
12348
12349
12350
12351
12352
12353
12354
12355
12356
12357
12358
12359
12360
12361
12362
12363
12364
12365
12366
.....
12458
12459
12460
12461
12462
12463
12464
12465
12466
12467
12468
12469
12470
12471
12472
12473
12474
12475
12476
12477
12478
12479
12480
12481
12482
12483
.....
12587
12588
12589
12590
12591
12592
12593
12594
12595
12596
12597
12598
12599
12600
12601
12602
12603
12604
12605
12606
12607
12608
12609
12610
12611
12612
12613
12614
12615
.....
12673
12674
12675
12676
12677
12678
12679
12680
12681
12682
12683
12684
12685
12686
12687
12688
12689
12690
12691
12692
12693
12694
12695
12696
.....
12902
12903
12904
12905
12906
12907
12908
12909
12910
12911
12912
12913
12914
12915
12916
.....
12973
12974
12975
12976
12977
12978
12979
12980
12981
12982
12983
12984
12985
12986
12987
.....
13017
13018
13019
13020
13021
13022
13023
13024
13025
13026
13027
13028
13029
13030
13031
.....
13048
13049
13050
13051
13052
13053
13054
13055
13056
13057
13058
13059
13060
13061
13062
13063
13064
13065
13066
13067
13068
13069
13070
13071
13072
13073
13074
.....
13107
13108
13109
13110
13111
13112
13113
13114
13115
13116
13117
13118
13119
13120
13121
13122
13123
13124
13125
13126
13127
13128
13129
13130
13131
13132
13133
13134
.....
13149
13150
13151
13152
13153
13154
13155
13156
13157
13158
13159
13160
13161
13162
13163
.....
13208
13209
13210
13211
13212
13213
13214
13215
13216
13217
13218
13219
13220
13221
13222
.....
13240
13241
13242
13243
13244
13245
13246
13247
13248
13249
13250
13251
13252
13253
13254
13255
13256
13257
13258
13259
13260
13261
13262
13263
13264
13265
13266
13267
13268
13269
13270
13271
13272
13273
13274
13275
13276
13277
13278
13279
13280
13281
13282
13283


13284











13285
13286
13287
13288
13289
13290
13291
13292
13293
13294
13295
13296
13297
13298
13299
13300
13301
13302
13303
13304
13305
13306
13307
13308
13309
13310
13311
13312
13313
13314
13315
13316
13317
13318
13319
13320
13321
13322
13323
13324
13325
13326
.....
13427
13428
13429
13430
13431
13432
13433
13434
13435
13436
13437
13438
13439
13440
13441
.....
13457
13458
13459
13460
13461
13462
13463
13464
13465
13466
13467
13468
13469
13470
13471
13472
13473
13474
13475
13476
13477
13478
13479
13480
13481
13482
.....
13540
13541
13542
13543
13544
13545
13546
13547
13548
13549
13550
13551
13552
13553
13554
13555
13556
13557
13558
13559
13560
13561
13562
13563
13564
13565
13566
13567
13568
13569
13570
13571
13572
13573
13574
13575
13576
13577
13578
.....
13613
13614
13615
13616
13617
13618
13619
13620
13621
13622
13623
13624
13625
13626
13627
.....
13692
13693
13694
13695
13696
13697
13698
13699
13700
13701
13702
13703
13704
13705
13706
13707
13708
13709
13710
13711
13712
13713
13714
13715
13716
13717
13718
13719
13720
13721
13722
13723
13724
13725
13726
13727
13728
13729
13730
13731
13732
13733
13734
13735
13736
13737
13738
13739
13740
13741
13742
13743
13744
13745
13746
13747
13748
13749
13750
13751
13752
13753
13754
13755
13756
13757
13758
13759
13760
13761
13762
13763
13764
13765
13766
13767
13768
13769
13770
13771
13772
13773
13774
13775
13776
13777
13778
13779
13780
13781
13782
13783
13784
13785
13786
13787
13788
13789
13790
13791
13792
13793
13794
13795
13796
13797
13798
13799
13800
13801
13802
13803
13804
13805
13806
13807
13808
13809
13810
13811
13812
13813
13814
13815
13816
13817
13818
13819
13820
13821
13822
13823
13824
13825
13826
13827
13828
13829
13830
13831
13832
13833
13834
13835
13836
13837
13838
13839
13840
13841
13842
13843
13844
13845
13846
13847
13848
13849
13850
13851
13852
13853
13854
13855
13856
13857
13858
13859
13860
13861
13862
13863
13864
13865
13866
13867
13868
13869
13870
13871
13872
13873
13874
13875
13876
13877
13878
13879
13880
13881
13882
13883
13884
13885
13886
13887
13888
13889
13890
13891
13892
13893
13894
13895
13896
13897
13898
13899
13900
13901
13902
13903
13904
13905
13906
13907
13908
13909
13910
13911
13912
13913
13914
13915
13916
13917
13918
13919
13920
13921
13922
13923
13924
13925
13926
13927
13928
13929
13930
13931
13932
13933
13934
13935
13936
13937
13938
13939
13940
13941
13942
13943
13944
13945
13946
13947
13948
13949
13950
13951
13952
13953
13954
13955
13956
13957

13958

13959
13960
13961
13962

13963
13964
13965
13966
13967
13968
13969
13970
13971
13972
13973
13974
13975
13976
13977
13978
13979
13980
13981
13982
13983
13984
13985
13986
13987
13988
13989
13990
13991
13992
13993
13994
13995
13996
13997
13998
13999
14000
14001
14002
14003
14004
14005
14006
14007
14008
14009
14010
14011
14012
14013
14014
14015
14016
14017
14018
14019
14020
14021
14022
14023
14024
14025
14026
14027
14028
14029
14030
.....
14033
14034
14035
14036
14037
14038
14039
14040
14041
14042
14043
14044
14045
14046
14047
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>La struttura fondamentale di Snakemake è una <strong>regola</strong>, che rappresenta un programma (tipicamente uno script), i file che ha bisogno di avere in input e quelli che restituirà in output.</p>
<p>Se il file di output della regola esiste già, la regola non viene eseguita (a meno di non constringerlo).</p>
<p>Se il file di input della regola non esiste, snakemake cerca un'altra regola che abbia quel file come output e la esegue prima.</p>
<p>Questa è una struttura chiamata <strong>pull</strong> (in cui specifico il punto di arrivo), e richiede un po' di tempo per prenderci confidenza (lo standard della programmazione è di tipo <strong>push</strong>, in cui specifico il punto di partenza).</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Uno snakefile è diviso in regole.</p>
<p>Ciascuna regola viene eseguita come un nuovo processo python a se stante.</p>
<p>Una regola ha delle sotto sezioni, di cui le più importanti sono:</p>
<ul>
<li>output: la lista dei file che la regola genererà in output (è una promessa, vanno poi effettivamente creati)</li>
<li>input: la lista dei file che la regola richiede</li>
<li>shell/run: esegue uno o più comandi di shell oppure esegue del codice python arbitrario</li>
</ul>
<p>Di default, se non si specifica altro, snakemake cerca di eseguire la regola <strong>all</strong></p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[41]:</div>
<div class="inner_cell">
................................................................................
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se il file di output di una regola esiste già, ed è più recente dei file di input, la regola non viene eseguita.</p>
<p>Questo comportamento è detto <strong>idempotenza</strong>, e rende l'esecuzione dello script più prevedibile.</p>
<p>È comunque possibile forzare la mano a snakamake in vari modi (se vi servisse, li trovate sul manuale)</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[45]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se i fil di input sono specificati e:</p>
<ol>
<li>non esistono</li>
<li>non esiste una regola che li produca in output</li>
</ol>
<p>Allora snakemake ritornerà un errore</p>

</div>
</div>
</div></div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[49]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile
................................................................................
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Vediamo come appare uno script con due regole distinte, una per creare i due parziali ed una per processarli.</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[53]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se i risultati intermedi esistono già, non li esegue di nuovo</p>

</div>
</div>
</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[59]:</div>
................................................................................
<div class="text_cell_render border-box-sizing rendered_html">
<p>In questo script la stessa funzione python crea tutti i file uno alla volta, in modo indipendente.</p>
<p>In realtà potrei eseguirlo in modo concorrente, senza doversi aspettare. Per fare questo posso usare le wildcards.</p>
<p>Possono provocare torsioni della materia grigia, ma atteniamoci al caso più semplice</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[61]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[62]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Per usare le wildcard devo dare da qualche parte il comando expand, che assegna di vari possibili valori alle wildcards.</p>
<p>Posso avere più wildcards allo stesso momento, l'importante è inizializarle tutte.</p>
<p>Ci sono dei meccanismi per fare inferenza automatica delle wildcard, ma vi consiglio di prendere prima confidenza con la dichiarazione esplicita.</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[71]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[66]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> result.txt 
</pre></div>

</div>
</div>
</div>

</div></div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[72]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">ls</span>
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[73]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[74]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cat</span> result.txt
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[75]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> parziali1.txt
<span class="o">%</span><span class="k">rm</span> result.txt
</pre></div>

</div>
</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Snakamake mi permette anche di creare un grafico di flusso che visualizza tutto ciò che deve essere fatto, e ciò che invece è stato già fatto e non ha bisogno di una nuova esecuzione.</p>

</div>
</div>
</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[77]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --dag <span class="p">|</span> dot -Tsvg &gt; dag.svg
</pre></div>

</div>
</div>
</div>



</div></div><div class="fragment">











<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="./immagini/snakemake_dag.svg" alt="visualizzazione pipeline"></p>

</div>
</div>
</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Un'altra funzione estramamente utile è la creazione di un registro di provenance, che mi indica quali file sono stati creati da quale regola e con che parametri.</p>
<p>Questo permette di tenere traccia dell'origine di ciascun file in modo semplice.</p>
<p>È anche facile impostarlo in modo da appendere la provenance ad un log completo, dando così la storia di tutti i file creati e modificati nel tempo.</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[78]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --detailed-summary &gt; provenance.tsv
</pre></div>

</div>
</div>
</div>

</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[81]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&quot;provenance.tsv&quot;</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
................................................................................
</div>

</div>

</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[83]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Se voglio eseguire più regole in parallelo (ovviamente rispettando l'ordine necessario di esecuzione di ciascun ramo), mi basta dare il comando <code>--cores &lt;N&gt;</code> e snakemake eseguirà in automatico tutto quello che riesce in parallelo.</p>
<p>Esiste un equivalente anche per lanciare la pipeline in un cluster di calcolo, rendendo molto semplice il calcolo distribuito.</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[85]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span>
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[91]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Posso anche specificare delle risorse limitate (oltre i processori) in modo che la pipeline non ecceda nell'uso.</p>
<p>Ad esempio, se ho delle regole che richiedono una gran quantità di memoria, posso specificare il livello atteso di occupazione nella regola e poi specificare la memoria disponibile da linea di comando.</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[92]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[93]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span> --resources <span class="nv">memory</span><span class="o">=</span><span class="m">12</span>
</pre></div>
................................................................................
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="configurazioni">configurazioni<a class="anchor-link" href="#configurazioni">&#182;</a></h3><p>Eventuali parametri di configurazione possono essere dati da linea di comando oppure caricati da un file di configurazione in formato YAML o JSON</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[14]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

numeri = [i for i in range(int(config[&#39;number&#39;]))]

rule all:
    input:
        expand(&quot;parziali{numero}.txt&quot;, numero=numeri)
    output:
        &quot;result.txt&quot;
    shell:
        &quot;cat {input} &gt; {output}&quot;
        
rule crea_parziali:
    output:
        out = &quot;parziali{numero}.txt&quot;
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, &#39;w&#39;) as file:
            print(&quot;risultato di {}&quot;.format(filename), file=file)
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>Overwriting Snakefile
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[15]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[16]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span> --resources <span class="nv">memory</span><span class="o">=</span><span class="m">12</span> --config <span class="nv">number</span><span class="o">=</span><span class="m">4</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre><span class="ansi-yellow-fg">Provided cores: 6</span>
<span class="ansi-yellow-fg">Rules claiming more threads will be scaled down.</span>
<span class="ansi-yellow-fg">Provided resources: memory=12</span>
<span class="ansi-yellow-fg">Job counts:
	count	jobs
	1	all
	4	crea_parziali
	5</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali1.txt
    jobid: 4
    wildcards: numero=1
    resources: memory=6</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali2.txt
    jobid: 2
    wildcards: numero=2
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 2.</span>
<span class="ansi-green-fg">1 of 5 steps (20%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali3.txt
    jobid: 3
    wildcards: numero=3
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 4.</span>
<span class="ansi-green-fg">2 of 5 steps (40%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali0.txt
    jobid: 1
    wildcards: numero=0
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 3.</span>
<span class="ansi-green-fg">3 of 5 steps (60%) done</span>
<span class="ansi-green-fg">Finished job 1.</span>
<span class="ansi-green-fg">4 of 5 steps (80%) done</span>

<span class="ansi-green-fg">rule all:
    input: parziali0.txt, parziali1.txt, parziali2.txt, parziali3.txt
    output: result.txt
    jobid: 0</span>

<span class="ansi-green-fg">Finished job 0.</span>
<span class="ansi-green-fg">5 of 5 steps (100%) done</span>
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[17]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> config.yaml
number: 4
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>Overwriting config.yaml
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[18]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%</span><span class="k">file</span> Snakefile

configfile: &quot;./config.yaml&quot;

numeri = [i for i in range(int(config[&#39;number&#39;]))]

rule all:
    input:
        expand(&quot;parziali{numero}.txt&quot;, numero=numeri)
    output:
        &quot;result.txt&quot;
    shell:
        &quot;cat {input} &gt; {output}&quot;
        
rule crea_parziali:
    output:
        out = &quot;parziali{numero}.txt&quot;
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, &#39;w&#39;) as file:
            print(&quot;risultato di {}&quot;.format(filename), file=file)
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>Overwriting Snakefile
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[19]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">rm</span> *.txt
</pre></div>

</div>
</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[20]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>snakemake --cores <span class="m">6</span> --resources <span class="nv">memory</span><span class="o">=</span><span class="m">12</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">

<div class="output">



<div class="output_area">
<div class="prompt"></div>


<div class="output_subarea output_stream output_stdout output_text">
<pre><span class="ansi-yellow-fg">Provided cores: 6</span>
<span class="ansi-yellow-fg">Rules claiming more threads will be scaled down.</span>
<span class="ansi-yellow-fg">Provided resources: memory=12</span>
<span class="ansi-yellow-fg">Job counts:
	count	jobs
	1	all
	4	crea_parziali
	5</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali0.txt
    jobid: 4
    wildcards: numero=0
    resources: memory=6</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali3.txt
    jobid: 3
    wildcards: numero=3
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 3.</span>
<span class="ansi-green-fg">1 of 5 steps (20%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali2.txt
    jobid: 1
    wildcards: numero=2
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 4.</span>
<span class="ansi-green-fg">2 of 5 steps (40%) done</span>

<span class="ansi-green-fg">rule crea_parziali:
    output: parziali1.txt
    jobid: 2
    wildcards: numero=1
    resources: memory=6</span>

<span class="ansi-green-fg">Finished job 1.</span>
<span class="ansi-green-fg">3 of 5 steps (60%) done</span>
<span class="ansi-green-fg">Finished job 2.</span>
<span class="ansi-green-fg">4 of 5 steps (80%) done</span>

<span class="ansi-green-fg">rule all:
    input: parziali0.txt, parziali1.txt, parziali2.txt, parziali3.txt
    output: result.txt
    jobid: 0</span>

<span class="ansi-green-fg">Finished job 0.</span>
<span class="ansi-green-fg">5 of 5 steps (100%) done</span>
</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Esercizio">Esercizio<a class="anchor-link" href="#Esercizio">&#182;</a></h2><p>Nel sito trovate un link a dei file per questa lezione, ciascuno con dentro una semplice tabella che indica una sequenza di versamenti fatti da delle persone.</p>
<p>Ci sarà anche un file che indica gli hash md5 per ciascuno di questi file.</p>
................................................................................
<li>La cartella la trovate all'indirizzo <code>https://chiselapp.com/user/EnricoGiampieri/repository/DataProgrammingCourse/doc/tip/snakemake_exercise/</code></li>
<li>ci sono 50 file chiamati transazioni_{}.tsv con l'indice da 00 a 49</li>
<li>il file di controllo degli hash è <code>md5sums.tsv</code></li>
</ul>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="suggerimenti">suggerimenti<a class="anchor-link" href="#suggerimenti">&#182;</a></h3><ul>
<li>la funzione di hash può essere implementata in python o con il comando da terminale <code>md5sum</code></li>