Data Programming Course  Check-in [57209fe163]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:terminate le slide dell'esercitazione finale
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 57209fe1637674983661c65c413264f224c78945
User & Date: EnricoGiampieri 2017-03-17 09:17:07.108
Context
2017-03-17
09:20
corretta la visualizzazione delle slide dell'esercitazione finale Leaf check-in: 41e97fa514 user: EnricoGiampieri tags: trunk
09:17
terminate le slide dell'esercitazione finale check-in: 57209fe163 user: EnricoGiampieri tags: trunk
2017-03-16
15:47
aggiunti gli script di soluzione delle vecchie lezioni e di dimostrazione usati check-in: 5cabac5e1f user: EnricoGiampieri tags: trunk
Changes
Unified Diff Ignore Whitespace Patch
Changes to Esercitazione finale.html.
11840
11841
11842
11843
11844
11845
11846














11847
11848
11849
11850
11851
11852
11853
11854
11855
11856
11857
11858
11859
11860
11861
<div class="text_cell_render border-box-sizing rendered_html">
<p>Per i più audaci: esiste la rete di citazioni dei paper teorici</p>
<h2 id="http://snap.stanford.edu/data/cit-HepTh.html"><a href="http://snap.stanford.edu/data/cit-HepTh.html">http://snap.stanford.edu/data/cit-HepTh.html</a><a class="anchor-link" href="#http://snap.stanford.edu/data/cit-HepTh.html">&#182;</a></h2><p>che contiene anche i metadati sugli articoli, inclusi i nomi degli autori.</p>
<p>Provate a fare il parsing del file degli autori ed estrarre il numero di autori per paper, correlandolo con il successo del paper in questione!</p>

</div>
</div>














</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[41]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">mkdir</span> esercitazione
<span class="o">%</span><span class="k">cd</span> esercitazione
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">







>
>
>
>
>
>
>
>
>
>
>
>
>
>



|


|
<







11840
11841
11842
11843
11844
11845
11846
11847
11848
11849
11850
11851
11852
11853
11854
11855
11856
11857
11858
11859
11860
11861
11862
11863
11864
11865
11866
11867

11868
11869
11870
11871
11872
11873
11874
<div class="text_cell_render border-box-sizing rendered_html">
<p>Per i più audaci: esiste la rete di citazioni dei paper teorici</p>
<h2 id="http://snap.stanford.edu/data/cit-HepTh.html"><a href="http://snap.stanford.edu/data/cit-HepTh.html">http://snap.stanford.edu/data/cit-HepTh.html</a><a class="anchor-link" href="#http://snap.stanford.edu/data/cit-HepTh.html">&#182;</a></h2><p>che contiene anche i metadati sugli articoli, inclusi i nomi degli autori.</p>
<p>Provate a fare il parsing del file degli autori ed estrarre il numero di autori per paper, correlandolo con il successo del paper in questione!</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="file-di-riferimento">file di riferimento<a class="anchor-link" href="#file-di-riferimento">&#182;</a></h2><h3 id="fisica-delle-alte-energie,-sperimentale">fisica delle alte energie, sperimentale<a class="anchor-link" href="#fisica-delle-alte-energie,-sperimentale">&#182;</a></h3><h4 id="collegamenti-fra-gli-ID-degli-articoli">collegamenti fra gli ID degli articoli<a class="anchor-link" href="#collegamenti-fra-gli-ID-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepPh.txt.gz">http://snap.stanford.edu/data/cit-HepPh.txt.gz</a></p>
<h4 id="data-di-pubblicazione-degli-articoli">data di pubblicazione degli articoli<a class="anchor-link" href="#data-di-pubblicazione-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz">http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz</a></p>
<h3 id="fisica-delle-alte-energie,-teorica">fisica delle alte energie, teorica<a class="anchor-link" href="#fisica-delle-alte-energie,-teorica">&#182;</a></h3><h4 id="collegamenti-fra-gli-ID-degli-articoli">collegamenti fra gli ID degli articoli<a class="anchor-link" href="#collegamenti-fra-gli-ID-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepTh.txt.gz">http://snap.stanford.edu/data/cit-HepTh.txt.gz</a></p>
<h4 id="data-di-pubblicazione-degli-articoli">data di pubblicazione degli articoli<a class="anchor-link" href="#data-di-pubblicazione-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz">http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz</a></p>
<h4 id="abstract-degli-articoli-con-gli-autori">abstract degli articoli con gli autori<a class="anchor-link" href="#abstract-degli-articoli-con-gli-autori">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz">http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz</a></p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[1]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> esercitazione

</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
11873
11874
11875
11876
11877
11878
11879
11880
11881
11882
11883
11884
11885
11886
11887
11888
11889
11890
11891
11892
11893
11894
11895











































11896
11897
11898
11899
11900
11901
11902
11903
11904
11905


11906
11907
11908






11909










11910

11911
11912









11913
11914
11915
11916
11917
11918
11919
11920
11921
11922
11923
11924
11925
11926
11927
11928
11929
11930
11931
11932
11933
11934
11935
11936
11937
11938
11939
11940
11941
11942
11943
11944
11945
11946
11947
11948
11949
11950
11951
11952
11953
11954
11955
11956
11957
11958
11959
11960
11961
11962
11963













11964
11965
11966
11967
11968
11969
11970
11971
11972
11973
11974
11975
11976
11977
11978
11979
11980
11981
11982
11983
11984
11985
11986
11987
11988
11989
11990
11991
11992
11993
11994
11995
11996
11997
11998
11999
12000
12001
12002
12003
12004
12005
12006
12007
12008
12009
12010
12011
12012
12013
12014
12015
12016
12017
12018
12019
12020
12021
12022
12023
12024
12025
12026
12027
12028
12029
12030
12031
12032
12033
12034
12035
12036
12037
12038
12039
12040
12041
12042
12043

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[42]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepPh.txt.gz
<span class="o">!</span>gunzip -k cit-HepPh.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">











































<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:34--  http://snap.stanford.edu/data/cit-HepPh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1664504 (1.6M) [application/x-gzip]
Saving to: ‘cit-HepPh.txt.gz.1’



cit-HepPh.txt.gz.1  100%[=====================&gt;]   1.59M   805KB/s   in 2.0s   

2017-03-16 15:45:47 (805 KB/s) - ‘cit-HepPh.txt.gz.1’ saved [1664504/1664504]

















</pre>

</div>
</div>










</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[43]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz
<span class="o">!</span>gunzip -k cit-HepPh-dates.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:48--  http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96569 (94K) [application/x-gzip]
Saving to: ‘cit-HepPh-dates.txt.gz.1
cit-HepPh-dates.txt 100%[=====================&gt;]  94.31K   174KB/s   in 0.5s   

2017-03-16 15:45:49 (174 KB/s) - ‘cit-HepPh-dates.txt.gz.1’ saved [96569/96569]

</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[44]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz













<span class="o">!</span>gunzip -k cit-HepTh-dates.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:50--  http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96569 (94K) [application/x-gzip]
Saving to: ‘cit-HepTh-dates.txt.gz’

cit-HepTh-dates.txt 100%[=====================&gt;]  94.31K   175KB/s   in 0.5s   

2017-03-16 15:45:51 (175 KB/s) - ‘cit-HepTh-dates.txt.gz’ saved [96569/96569]

</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[45]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh.txt.gz
<span class="o">!</span>gunzip -k cit-HepTh.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:53--  http://snap.stanford.edu/data/cit-HepTh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1317497 (1.3M) [application/x-gzip]
Saving to: ‘cit-HepTh.txt.gz’

cit-HepTh.txt.gz    100%[=====================&gt;]   1.26M   709KB/s   in 1.8s   

2017-03-16 15:45:55 (709 KB/s) - ‘cit-HepTh.txt.gz’ saved [1317497/1317497]

</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz







|


|
|











>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
|
|
|
|
|
|
<
<

>
>
|

|
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>
>
|
>


>
>
>
>
>
>
>
>
>



<



|


|
<














|



|
|

|

|











|


|
>
>
>
>
>
>
>
>
>
>
>
>
>
|














<
<
<
<
<
<
|
<
|
<
<








|
<
|
<
<
<
<
<
<

<
<
<
<
|
|
|
<
<

<
<
<
<
<
<
<
<
<
<
<
<
<


<
<
<
<







11886
11887
11888
11889
11890
11891
11892
11893
11894
11895
11896
11897
11898
11899
11900
11901
11902
11903
11904
11905
11906
11907
11908
11909
11910
11911
11912
11913
11914
11915
11916
11917
11918
11919
11920
11921
11922
11923
11924
11925
11926
11927
11928
11929
11930
11931
11932
11933
11934
11935
11936
11937
11938
11939
11940
11941
11942
11943
11944
11945
11946
11947
11948
11949
11950
11951
11952
11953
11954
11955
11956
11957
11958


11959
11960
11961
11962
11963
11964
11965
11966
11967
11968
11969
11970
11971
11972
11973
11974
11975
11976
11977
11978
11979
11980
11981
11982
11983
11984
11985
11986
11987
11988
11989
11990
11991
11992
11993
11994
11995
11996
11997

11998
11999
12000
12001
12002
12003
12004

12005
12006
12007
12008
12009
12010
12011
12012
12013
12014
12015
12016
12017
12018
12019
12020
12021
12022
12023
12024
12025
12026
12027
12028
12029
12030
12031
12032
12033
12034
12035
12036
12037
12038
12039
12040
12041
12042
12043
12044
12045
12046
12047
12048
12049
12050
12051
12052
12053
12054
12055
12056
12057
12058
12059
12060
12061
12062
12063
12064
12065
12066
12067
12068
12069
12070
12071






12072

12073


12074
12075
12076
12077
12078
12079
12080
12081
12082

12083






12084




12085
12086
12087


12088













12089
12090




12091
12092
12093
12094
12095
12096
12097

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[2]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
<span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">()</span>
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt output_prompt">Out[2]:</div>



<div class="output_text output_subarea output_execute_result">
<pre>&#39;/home/enrico/lavoro/DataProgrammingCourse/esercitazione&#39;</pre>
</div>

</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>vediamo scaricare ed estrarre i file con python3, usando solo le librerie di base</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[3]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">urllib.request</span> <span class="k">import</span> <span class="n">urlretrieve</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">&quot;http://snap.stanford.edu/data/&quot;</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s2">&quot;cit-HepPh.txt.gz&quot;</span>
<span class="n">local_filename</span><span class="p">,</span> <span class="n">headers</span> <span class="o">=</span> <span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="o">+</span><span class="n">filename</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[9]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">gzip</span>
<span class="k">with</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">&#39;cit-HepPh.txt.gz&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">source</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;cit-HepPh_python.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">destination</span><span class="p">:</span>
        <span class="n">destination</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">source</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</pre></div>



</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>in generale posso estrarre gli archivi in modo più facile, ma per qualche motivo a me oscuro il formato ".gz" puro non sembra essere supportato direttamente.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">shutil</span> <span class="k">import</span> <span class="n">unpack_archive</span>
<span class="n">unpack_archive</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>dalla linea di comando posso usare i comandi wget ed gunzip</p>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[5]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepPh.txt.gz

</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-17 09:48:12--  http://snap.stanford.edu/data/cit-HepPh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1664504 (1.6M) [application/x-gzip]
Saving to: ‘cit-HepPh.txt.gz’

cit-HepPh.txt.gz    100%[=====================&gt;]   1.59M   555KB/s   in 2.9s   

2017-03-17 09:48:25 (555 KB/s) - ‘cit-HepPh.txt.gz’ saved [1664504/1664504]

</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[6]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>gunzip -k cit-HepPh.txt.gz
</pre></div>

</div>
</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[10]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>md5sum cit-HepPh.txt
<span class="o">!</span>md5sum cit-HepPh_python.txt
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">






<pre>e79f6ef17a4b0a2e94959af6fa88de72  cit-HepPh.txt

e79f6ef17a4b0a2e94959af6fa88de72  cit-HepPh_python.txt


</pre>
</div>
</div>

</div>
</div>

</div>
<div class="cell border-box-sizing text_cell rendered">

<div class="prompt input_prompt">






</div>




<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Per chi volesse scaricare il file degli abstract, non essendo in formato gz, è necessario usare il comando <code>tar</code> invece di <code>gunzip</code>.</p>
















</div>
</div>




</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz
Changes to Esercitazione finale.ipynb.
1
2
3
4
5


6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


25
26
27
28
29
30
31
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {


    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Rete di citazioni di ArXiv per la fisica delle particelle\n",
    "\n",
    "## http://snap.stanford.edu/data/cit-HepPh.html\n",
    "\n",
    "Contiene un database delle citazioni degli articoli pubblicati su arxiv a proposito di fisica delle alte energie.\n",
    "\n",
    "Sono presenti due tabelle:\n",
    "* la coppia di paper citante-citato\n",
    "* la data di pubblicazione di ciascun paper"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {


    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Passi dell'esercitazione:\n",
    "\n",





>
>



















>
>







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Rete di citazioni di ArXiv per la fisica delle particelle\n",
    "\n",
    "## http://snap.stanford.edu/data/cit-HepPh.html\n",
    "\n",
    "Contiene un database delle citazioni degli articoli pubblicati su arxiv a proposito di fisica delle alte energie.\n",
    "\n",
    "Sono presenti due tabelle:\n",
    "* la coppia di paper citante-citato\n",
    "* la data di pubblicazione di ciascun paper"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Passi dell'esercitazione:\n",
    "\n",
58
59
60
61
62
63
64


65
66
67
68
69
70
71
72
73
74
75
76
77
78
79



































80
81
82
83


84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

127




128
129
130



























131
132

133
134
135
136




























137























138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164















165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231

232
233
234
235
236
237
238


239
240
241
242
243
244
245
246
247
    "* evoluzione nel tempo delle citazioni\n",
    "    - qual è il numero di citazioni nel tempo di ciascun paper?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {


    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Per i più audaci: esiste la rete di citazioni dei paper teorici\n",
    "\n",
    "## http://snap.stanford.edu/data/cit-HepTh.html\n",
    "\n",
    "che contiene anche i metadati sugli articoli, inclusi i nomi degli autori.\n",
    "\n",
    "Provate a fare il parsing del file degli autori ed estrarre il numero di autori per paper, correlandolo con il successo del paper in questione!"
   ]
  },
  {



































   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false,


    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/enrico/lavoro/DataProgrammingCourse/esercitazione\n"
     ]
    }
   ],
   "source": [
    "%mkdir esercitazione\n",
    "%cd esercitazione"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-03-16 15:45:34--  http://snap.stanford.edu/data/cit-HepPh.txt.gz\n",
      "Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80\n",
      "Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1664504 (1.6M) [application/x-gzip]\n",
      "Saving to: ‘cit-HepPh.txt.gz.1’\n",
      "\n",
      "cit-HepPh.txt.gz.1  100%[=====================>]   1.59M   805KB/s   in 2.0s   \n",
      "\n",
      "2017-03-16 15:45:47 (805 KB/s) - ‘cit-HepPh.txt.gz.1’ saved [1664504/1664504]\n",
      "\n"

     ]




    }
   ],
   "source": [



























    "!wget http://snap.stanford.edu/data/cit-HepPh.txt.gz\n",
    "!gunzip -k cit-HepPh.txt.gz"

   ]
  },
  {
   "cell_type": "code",




























   "execution_count": 43,























   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-03-16 15:45:48--  http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz\n",
      "Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80\n",
      "Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 96569 (94K) [application/x-gzip]\n",
      "Saving to: ‘cit-HepPh-dates.txt.gz.1’\n",
      "\n",
      "cit-HepPh-dates.txt 100%[=====================>]  94.31K   174KB/s   in 0.5s   \n",
      "\n",
      "2017-03-16 15:45:49 (174 KB/s) - ‘cit-HepPh-dates.txt.gz.1’ saved [96569/96569]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz\n",















    "!gunzip -k cit-HepPh-dates.txt.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-03-16 15:45:50--  http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz\n",
      "Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80\n",
      "Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 96569 (94K) [application/x-gzip]\n",
      "Saving to: ‘cit-HepTh-dates.txt.gz’\n",
      "\n",
      "cit-HepTh-dates.txt 100%[=====================>]  94.31K   175KB/s   in 0.5s   \n",
      "\n",
      "2017-03-16 15:45:51 (175 KB/s) - ‘cit-HepTh-dates.txt.gz’ saved [96569/96569]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz\n",
    "!gunzip -k cit-HepTh-dates.txt.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-03-16 15:45:53--  http://snap.stanford.edu/data/cit-HepTh.txt.gz\n",
      "Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80\n",
      "Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1317497 (1.3M) [application/x-gzip]\n",
      "Saving to: ‘cit-HepTh.txt.gz’\n",
      "\n",
      "cit-HepTh.txt.gz    100%[=====================>]   1.26M   709KB/s   in 1.8s   \n",
      "\n",
      "2017-03-16 15:45:55 (709 KB/s) - ‘cit-HepTh.txt.gz’ saved [1317497/1317497]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://snap.stanford.edu/data/cit-HepTh.txt.gz\n",
    "!gunzip -k cit-HepTh.txt.gz"

   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,


    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "!wget http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz\n",
    "# questo comando creerà diverse cartelle con dentro i singoli file degli abstract\n",
    "!tar -xzf cit-HepTh-abstracts.tar.gz"







>
>















>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

|


>
>

|












<





|



|




|
<
|
<
<
<
<
<
<
<
<
<
<
<
>
|
>
>
>
>



>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
|
>




>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



|







|



|
|

|

|





|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|




|



|







|
<
<
<
<
|
<
<
<
<
<




|
|



|
<

<




<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<

<
<
>







>
>

|







62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155

156











157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311




312





313
314
315
316
317
318
319
320
321
322

323

324
325
326
327



















328


329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
    "* evoluzione nel tempo delle citazioni\n",
    "    - qual è il numero di citazioni nel tempo di ciascun paper?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Per i più audaci: esiste la rete di citazioni dei paper teorici\n",
    "\n",
    "## http://snap.stanford.edu/data/cit-HepTh.html\n",
    "\n",
    "che contiene anche i metadati sugli articoli, inclusi i nomi degli autori.\n",
    "\n",
    "Provate a fare il parsing del file degli autori ed estrarre il numero di autori per paper, correlandolo con il successo del paper in questione!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## file di riferimento\n",
    "\n",
    "### fisica delle alte energie, sperimentale\n",
    "\n",
    "#### collegamenti fra gli ID degli articoli\n",
    "\n",
    "http://snap.stanford.edu/data/cit-HepPh.txt.gz\n",
    "\n",
    "#### data di pubblicazione degli articoli\n",
    "\n",
    "http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz\n",
    "\n",
    "### fisica delle alte energie, teorica\n",
    "\n",
    "#### collegamenti fra gli ID degli articoli\n",
    "\n",
    "http://snap.stanford.edu/data/cit-HepTh.txt.gz\n",
    "\n",
    "#### data di pubblicazione degli articoli\n",
    "\n",
    "http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz\n",
    "\n",
    "#### abstract degli articoli con gli autori\n",
    "\n",
    "http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/enrico/lavoro/DataProgrammingCourse/esercitazione\n"
     ]
    }
   ],
   "source": [

    "%cd esercitazione"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {

      "text/plain": [











       "'/home/enrico/lavoro/DataProgrammingCourse/esercitazione'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "os.getcwd()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "vediamo scaricare ed estrarre i file con python3, usando solo le librerie di base"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from urllib.request import urlretrieve\n",
    "url = \"http://snap.stanford.edu/data/\"\n",
    "filename = \"cit-HepPh.txt.gz\"\n",
    "local_filename, headers = urlretrieve(url+filename, filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import gzip\n",
    "with gzip.open('cit-HepPh.txt.gz', 'rb') as source:\n",
    "    with open('cit-HepPh_python.txt', 'wb') as destination:\n",
    "        destination.write(source.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "in generale posso estrarre gli archivi in modo più facile, ma per qualche motivo a me oscuro il formato \".gz\" puro non sembra essere supportato direttamente."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from shutil import unpack_archive\n",
    "unpack_archive(filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "dalla linea di comando posso usare i comandi wget ed gunzip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-03-17 09:48:12--  http://snap.stanford.edu/data/cit-HepPh.txt.gz\n",
      "Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80\n",
      "Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1664504 (1.6M) [application/x-gzip]\n",
      "Saving to: ‘cit-HepPh.txt.gz’\n",
      "\n",
      "cit-HepPh.txt.gz    100%[=====================>]   1.59M   555KB/s   in 2.9s   \n",
      "\n",
      "2017-03-17 09:48:25 (555 KB/s) - ‘cit-HepPh.txt.gz’ saved [1664504/1664504]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://snap.stanford.edu/data/cit-HepPh.txt.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "!gunzip -k cit-HepPh.txt.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "e79f6ef17a4b0a2e94959af6fa88de72  cit-HepPh.txt\n",




      "e79f6ef17a4b0a2e94959af6fa88de72  cit-HepPh_python.txt\n"





     ]
    }
   ],
   "source": [
    "!md5sum cit-HepPh.txt\n",
    "!md5sum cit-HepPh_python.txt"
   ]
  },
  {
   "cell_type": "markdown",

   "metadata": {

    "slideshow": {
     "slide_type": "slide"
    }
   },



















   "source": [


    "Per chi volesse scaricare il file degli abstract, non essendo in formato gz, è necessario usare il comando `tar` invece di `gunzip`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "!wget http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz\n",
    "# questo comando creerà diverse cartelle con dentro i singoli file degli abstract\n",
    "!tar -xzf cit-HepTh-abstracts.tar.gz"
Changes to Esercitazione finale.slides.html.
11922
11923
11924
11925
11926
11927
11928
























11929
11930
11931
11932
11933
11934




11935














11936
11937
11938
11939
11940
11941
11942
11943
11944
11945
11946
11947
11948
11949
11950
11951
11952
11953


11954
11955
11956
11957
11958
11959
11960
11961
11962
11963
11964
11965
11966
11967
11968
11969
11970
11971
11972
11973
11974
11975
11976
11977
11978
11979
11980
11981
11982
11983
11984
11985
11986
11987
11988
11989
11990
11991
11992
11993
11994
11995
11996
11997
11998
11999
12000
12001
12002
12003
12004
12005
12006
12007
12008
12009
12010
12011
12012
12013
12014
12015
12016
12017
12018
12019
12020
12021
12022
12023
12024
12025
12026
12027
12028
12029
12030
12031
12032
12033
12034
12035
12036
12037









12038




12039
12040
12041
12042
12043
12044
12045
12046
12047
12048
12049
12050
12051
12052
12053
12054
12055
12056
12057
12058
12059
12060
12061
12062
12063
12064
12065
12066
12067
12068
12069
12070
12071
12072
12073
12074
12075
12076
12077
12078
12079
12080
12081
12082
12083
12084
12085
12086
12087
12088
12089
12090
12091
12092
12093
12094
12095
12096
12097
12098
12099
12100
12101
12102
12103
12104
12105
12106
12107
12108
12109
12110
12111
12112
12113
12114
12115
12116
12117
12118
12119
12120
12121
12122
12123
12124
12125
<p>Per i più audaci: esiste la rete di citazioni dei paper teorici</p>
<h2 id="http://snap.stanford.edu/data/cit-HepTh.html"><a href="http://snap.stanford.edu/data/cit-HepTh.html">http://snap.stanford.edu/data/cit-HepTh.html</a><a class="anchor-link" href="#http://snap.stanford.edu/data/cit-HepTh.html">&#182;</a></h2><p>che contiene anche i metadati sugli articoli, inclusi i nomi degli autori.</p>
<p>Provate a fare il parsing del file degli autori ed estrarre il numero di autori per paper, correlandolo con il successo del paper in questione!</p>

</div>
</div>
</div></section></section><section><section>
























<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[41]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">mkdir</span> esercitazione




<span class="o">%</span><span class="k">cd</span> esercitazione














</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>/home/enrico/lavoro/DataProgrammingCourse/esercitazione
</pre>
</div>
</div>



</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[42]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepPh.txt.gz
<span class="o">!</span>gunzip -k cit-HepPh.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:34--  http://snap.stanford.edu/data/cit-HepPh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1664504 (1.6M) [application/x-gzip]
Saving to: ‘cit-HepPh.txt.gz.1’

cit-HepPh.txt.gz.1  100%[=====================&gt;]   1.59M   805KB/s   in 2.0s   

2017-03-16 15:45:47 (805 KB/s) - ‘cit-HepPh.txt.gz.1’ saved [1664504/1664504]

</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[43]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz
<span class="o">!</span>gunzip -k cit-HepPh-dates.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:48--  http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96569 (94K) [application/x-gzip]
Saving to: ‘cit-HepPh-dates.txt.gz.1
cit-HepPh-dates.txt 100%[=====================&gt;]  94.31K   174KB/s   in 0.5s   

2017-03-16 15:45:49 (174 KB/s) - ‘cit-HepPh-dates.txt.gz.1’ saved [96569/96569]

</pre>
</div>
</div>

</div>
</div>










</div></section></section><section><section>




<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[44]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz
<span class="o">!</span>gunzip -k cit-HepTh-dates.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:50--  http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96569 (94K) [application/x-gzip]
Saving to: ‘cit-HepTh-dates.txt.gz’

cit-HepTh-dates.txt 100%[=====================&gt;]  94.31K   175KB/s   in 0.5s   

2017-03-16 15:45:51 (175 KB/s) - ‘cit-HepTh-dates.txt.gz’ saved [96569/96569]

</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[45]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh.txt.gz
<span class="o">!</span>gunzip -k cit-HepTh.txt.gz
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-16 15:45:53--  http://snap.stanford.edu/data/cit-HepTh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1317497 (1.3M) [application/x-gzip]
Saving to: ‘cit-HepTh.txt.gz’

cit-HepTh.txt.gz    100%[=====================&gt;]   1.26M   709KB/s   in 1.8s   

2017-03-16 15:45:55 (709 KB/s) - ‘cit-HepTh.txt.gz’ saved [1317497/1317497]

</pre>
</div>
</div>

</div>
</div>

</div></section></section><section><section>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz
<span class="c1"># questo comando creerà diverse cartelle con dentro i singoli file degli abstract</span>







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


|


|
>
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>






|
|
<
<
<
|
<
<
<
<

|
>
>



<
|


|


|
|






<
<
|
|
<
|
|
|
<
<
<
<
<
<
|
<
|
<

<


<
|
<
<
<


|


|
<














|



|
|

|

|








>
>
>
>
>
>
>
>
>
|
>
>
>
>


|


|
|














<
<
<
<
<
<
|
<
|
<
<







|
|
<
|
<
<
<
<
<
<

<
<
<
<
|
|
|
<
<

<
<
<
<
<
<
<
<
<
<
<
<
<


<

<
<
<







11922
11923
11924
11925
11926
11927
11928
11929
11930
11931
11932
11933
11934
11935
11936
11937
11938
11939
11940
11941
11942
11943
11944
11945
11946
11947
11948
11949
11950
11951
11952
11953
11954
11955
11956
11957
11958
11959
11960
11961
11962
11963
11964
11965
11966
11967
11968
11969
11970
11971
11972
11973
11974
11975
11976
11977
11978
11979
11980
11981
11982
11983
11984
11985



11986




11987
11988
11989
11990
11991
11992
11993

11994
11995
11996
11997
11998
11999
12000
12001
12002
12003
12004
12005
12006
12007


12008
12009

12010
12011
12012






12013

12014

12015

12016
12017

12018



12019
12020
12021
12022
12023
12024

12025
12026
12027
12028
12029
12030
12031
12032
12033
12034
12035
12036
12037
12038
12039
12040
12041
12042
12043
12044
12045
12046
12047
12048
12049
12050
12051
12052
12053
12054
12055
12056
12057
12058
12059
12060
12061
12062
12063
12064
12065
12066
12067
12068
12069
12070
12071
12072
12073
12074
12075
12076
12077
12078
12079
12080
12081
12082
12083
12084
12085
12086
12087
12088
12089
12090
12091






12092

12093


12094
12095
12096
12097
12098
12099
12100
12101
12102

12103






12104




12105
12106
12107


12108













12109
12110

12111



12112
12113
12114
12115
12116
12117
12118
<p>Per i più audaci: esiste la rete di citazioni dei paper teorici</p>
<h2 id="http://snap.stanford.edu/data/cit-HepTh.html"><a href="http://snap.stanford.edu/data/cit-HepTh.html">http://snap.stanford.edu/data/cit-HepTh.html</a><a class="anchor-link" href="#http://snap.stanford.edu/data/cit-HepTh.html">&#182;</a></h2><p>che contiene anche i metadati sugli articoli, inclusi i nomi degli autori.</p>
<p>Provate a fare il parsing del file degli autori ed estrarre il numero di autori per paper, correlandolo con il successo del paper in questione!</p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="file-di-riferimento">file di riferimento<a class="anchor-link" href="#file-di-riferimento">&#182;</a></h2><h3 id="fisica-delle-alte-energie,-sperimentale">fisica delle alte energie, sperimentale<a class="anchor-link" href="#fisica-delle-alte-energie,-sperimentale">&#182;</a></h3><h4 id="collegamenti-fra-gli-ID-degli-articoli">collegamenti fra gli ID degli articoli<a class="anchor-link" href="#collegamenti-fra-gli-ID-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepPh.txt.gz">http://snap.stanford.edu/data/cit-HepPh.txt.gz</a></p>
<h4 id="data-di-pubblicazione-degli-articoli">data di pubblicazione degli articoli<a class="anchor-link" href="#data-di-pubblicazione-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz">http://snap.stanford.edu/data/cit-HepPh-dates.txt.gz</a></p>
<h3 id="fisica-delle-alte-energie,-teorica">fisica delle alte energie, teorica<a class="anchor-link" href="#fisica-delle-alte-energie,-teorica">&#182;</a></h3><h4 id="collegamenti-fra-gli-ID-degli-articoli">collegamenti fra gli ID degli articoli<a class="anchor-link" href="#collegamenti-fra-gli-ID-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepTh.txt.gz">http://snap.stanford.edu/data/cit-HepTh.txt.gz</a></p>
<h4 id="data-di-pubblicazione-degli-articoli">data di pubblicazione degli articoli<a class="anchor-link" href="#data-di-pubblicazione-degli-articoli">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz">http://snap.stanford.edu/data/cit-HepTh-dates.txt.gz</a></p>
<h4 id="abstract-degli-articoli-con-gli-autori">abstract degli articoli con gli autori<a class="anchor-link" href="#abstract-degli-articoli-con-gli-autori">&#182;</a></h4><p><a href="http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz">http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz</a></p>

</div>
</div>
</div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>vediamo scaricare ed estrarre i file con python3, usando solo le librerie di base</p>

</div>
</div>
</div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[3]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">urllib.request</span> <span class="k">import</span> <span class="n">urlretrieve</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">&quot;http://snap.stanford.edu/data/&quot;</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s2">&quot;cit-HepPh.txt.gz&quot;</span>
<span class="n">local_filename</span><span class="p">,</span> <span class="n">headers</span> <span class="o">=</span> <span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="o">+</span><span class="n">filename</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
</pre></div>

</div>
</div>
</div>

</div></div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[9]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">gzip</span>
<span class="k">with</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">&#39;cit-HepPh.txt.gz&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">source</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;cit-HepPh_python.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">destination</span><span class="p">:</span>
        <span class="n">destination</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">source</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</pre></div>

</div>
</div>
</div>

</div></div><div class="fragment">
<div class="cell border-box-sizing text_cell rendered">



<div class="prompt input_prompt">




</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>in generale posso estrarre gli archivi in modo più facile, ma per qualche motivo a me oscuro il formato ".gz" puro non sembra essere supportato direttamente.</p>

</div>
</div>

</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">shutil</span> <span class="k">import</span> <span class="n">unpack_archive</span>
<span class="n">unpack_archive</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
</pre></div>

</div>
</div>
</div>



</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">

<div class="prompt input_prompt">
</div>
<div class="inner_cell">






<div class="text_cell_render border-box-sizing rendered_html">

<p>dalla linea di comando posso usare i comandi wget ed gunzip</p>



</div>
</div>

</div><div class="fragment">



<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[5]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepPh.txt.gz

</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">
<pre>--2017-03-17 09:48:12--  http://snap.stanford.edu/data/cit-HepPh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1664504 (1.6M) [application/x-gzip]
Saving to: ‘cit-HepPh.txt.gz’

cit-HepPh.txt.gz    100%[=====================&gt;]   1.59M   555KB/s   in 2.9s   

2017-03-17 09:48:25 (555 KB/s) - ‘cit-HepPh.txt.gz’ saved [1664504/1664504]

</pre>
</div>
</div>

</div>
</div>

</div></div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[6]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>gunzip -k cit-HepPh.txt.gz
</pre></div>

</div>
</div>
</div>

</div></div><div class="fragment">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[10]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>md5sum cit-HepPh.txt
<span class="o">!</span>md5sum cit-HepPh_python.txt
</pre></div>

</div>
</div>
</div>

<div class="output_wrapper">
<div class="output">


<div class="output_area">
<div class="prompt"></div>

<div class="output_subarea output_stream output_stdout output_text">






<pre>e79f6ef17a4b0a2e94959af6fa88de72  cit-HepPh.txt

e79f6ef17a4b0a2e94959af6fa88de72  cit-HepPh_python.txt


</pre>
</div>
</div>

</div>
</div>

</div></div></section></section><section><section>
<div class="cell border-box-sizing text_cell rendered">

<div class="prompt input_prompt">






</div>




<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Per chi volesse scaricare il file degli abstract, non essendo in formato gz, è necessario usare il comando <code>tar</code> invece di <code>gunzip</code>.</p>
















</div>
</div>

</div>



<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget http://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz
<span class="c1"># questo comando creerà diverse cartelle con dentro i singoli file degli abstract</span>