Evolution of Collaboration in Rap

(1)

Bachelor thesis Information Science

Faculty of Science

Evolution of Collaboration in Rap

Author: Mathijs Parmentier 10788557 Supervisor: dr. Jacobijn Sandberg Examinator: dr. Maarten Marx July 2018

(2)

Abstract

This paper tried to find out whether the structure of the co-authorship network of rappers has changed significantly over time. It was hypothesised that new technologies have made it easier for rappers to collaborate with each other and that this may have caused a denser collaboration network. The network, in which two rappers are linked when they have collaborated on a song, was created using data retrieved with the Genius API (n.d.) and a list of rappers from The Original Hip-Hop Lyrics Archive. First, network theory was used to characterise the full network. Then, the network was was split up into a sequence of cumulative yearly networks, with each subnetwork containing data from a year and all years prior. Plotting the correlation coefficient for each of these years showed a stable graph that did not increase or decrease significantly. The degree increased steadily as rappers continued releasing new songs with features. These results show that new technologies have not changed the structure co-authorship network and thus the hypothesis was rejected.

(3)

Abstract i 1 Introduction 1 2 Related Work 3 2.1 Networks . . . 3 2.1.1 Pathfinding . . . 4 2.2 Co-Authorship . . . 5 2.2.1 Co-Authorship in Rap . . . 5 2.3 Rap . . . 6 3 Methodology 7 4 Results 9 4.1 General Results . . . 10 4.2 Yearly Results . . . 12 5 Discussion 15 5.1 Conclusion . . . 15 5.2 Limitations . . . 16 5.3 Further Research . . . 16 References . . . 17 A Enlarged egographs 18 B Haiti subnetwork 21 C Code 23 ii

(4)

Chapter 1

Introduction

Hip-Hop or rap is a musical genre that has become more and more popular over the years. In 2017 it even surpassed rock as the most popular genre in the United States (Nielsen, 2018). What makes rap stand out against other genres, is the way rap artists collaborate. Songs often feature one or more guest features wherein other rappers perform a verse in the song. These features offer a good taste of the featured artists catalogues. Linking together rappers when they have collaborated creates a complex structure. This web of connections allows listeners to easily explore the genre and discover new music. The same structure also allows researchers to study the genre quantitatively. The structure of this network is similar to that of collaborating scientists, in which 2 scientists are linked if they are authors of the same research paper. These types of networks are called co-authorship networks and scientific co-authorship networks are a popular research subject.

Co-authorship studies in music are less popular than those of science. Gleiser and Danon (2003) were the first researchers to publish a paper on music co-authorship. They analysed the collaboration structure of jazz musicians. Smith (2006) then did a similar study of rappers. He analysed the structure and compared it to other types of co-authorship networks. In the paper he also showed that the most well-connected rapper was Snoop Dogg. There is however, much more potential in this kind of research. In particular, the evolution of these kinds of networks is what inspired this paper. Since the inception of rap, the genre has underwent many changes. Whereas in the early days, music distribution happened mainly through physical CDs, digital sales have now taken over as the main type of distribution (IFPI,2017). Not only has the distribution side of music been radically changed by technology, the production side has also under-went major changes. Nowadays its possible to collaborate across the world in real time by sending files back and forth.

(5)

To find out whether there has truly been a change in the way rappers collaborate, this research will construct the aforementioned co-authorship network for rappers and analyse it. The central question that is trying to be answered is:

Have new technologies made an impact on the structure of collaboration in the rap genre?

A co-authorship network can be used to detect changes in the structure. A change in the structure does not necessarily mean that new technologies have caused this change though. To make sure detected changes are actually relevant for this paper, care must be taken when selecting and using the data.

(6)

Chapter 2

Related Work

2.1 Networks

Networks are a popular method of modeling relationships and interactions between ac-tors. Such networks can be constructed by representig actors with nodes. An interaction between two actors in the network is then signified by linking these nodes with an edge. The resulting network can be used to research many different kinds of social or techno-logical phenomena. The methods used to study these networks is called network theory. In a network, the amount of connections a node has (signified by edges), is called the degree. Nodes with a high degree are thus better connected and are assumed to be of greater importance to the integrity of the network. To get an idea of the connectedness of a network, one could calculate the average degree of it. Denser networks logically have a higher average degree. A better way to get an idea of the structure of a network, is to look at the degree distribution. This can be done by plotting how many nodes there are with each degree. A degree distribution is a graph that shows how many nodes in the network have what degree. The shape of the distribution is distinctive for different types of networks (Albert & Barab´asi,2002). Random networks, for example, that are formed by randomly adding edges to a set of nodes, have a degree distribution that looks like a bell curve. Outliers in this curve are nodes with either much more or much less edges than the average node.

Most real-life networks dont follow the rules for random networks however. In social networks, co-authorship networks, or internet networks, the degree distribution usually displays a peak with a long tail. In these types of networks, there are a few nodes that amass most of the edges. Plotting this distribution with two logarithmical axes will result in a straight diagonal line, or sometimes a slightly curved line (Newman,

(7)

2001), indicating that the networks scale logarithmically. Networks that have these characteristics are called scale-free networks. As the name suggests, these characteristics can show up on any scale. The cause of the long tail is often attributed to the concept of preferential attachment, which is also known as the rich get richer phenomenon (Albert & Barab´asi,2002). When high degree nodes have a higher chance to get new incoming edges, the unequality in the network will grow. This concept plays an important factor in real-life social networks. A twitter user with many followers, will gain followers quicker than one with less followers, because their tweets have more exposure to potential new followers.

The degree of a node isnt the only factor influencing whether another node connects to it. Newman (2002) shows that the degree of both nodes matter when new edges are formed. This process, which he calls assortative mixing, happens when high degree nodes have a tendency to connect to other high degree nodes. With this definition, Newman split up networks into assortative, disassortative, and non assortative networks. Social networks tend to be assortative, while technological and biological networks tend to be more disassortative. In disassortative networks, nodes have a higher chance to connect to nodes with a different degree. Aditionally, there are networks in which nodes have no preference in the degree of neighbours. McPherson, Smith-Lovin, and Cook(2001) take the definition of assortative mixing further by stating that nodes tend to connect to nodes that share characteristics. These characteristics can be sociodemographic, geographic, or behavioural. Hence, preferential attachment is a type of assortative attachment; one where nodes have a higher chance to connect to high degree nodes.

Looking at degrees isnt the only way of analysing networks. Calculating the average clustering coefficient is a different method (Albert & Barab´asi, 2002). This coefficient is a measure for the connectedness of a node. It is calculated by determining how many of the node’s neighbours are connected. If all neighbours are connected to all other neighbours, the coefficient is 1. In a network with a high average clustering coefficient, there is a high amount of connectivity and nodes are connected to a high amount of other nodes. Lower correlation coefficients indicate a sparser network.

2.1.1 Pathfinding

In the structure of a network, you can traverse between nodes by following edges they are connected to. The set of edges between a pair of nodes is called the path. There are often many different paths between nodes, but the most important path is the shortest path. Travers and Milgram (1967) showed that the shortest paths between randomly chosen nodes are generally much shorter than thought. This is also known as the small

(8)

Contents 5

world phenomenon. In Travers & Milgram’s experiment it was found that the average distance between two people in America was 5 acquaintances. The average shortest path can also be used in other types of networks to get an idea of the connectedness of them. In addition to the average shortest path, the longest shortest path is another measure used to characterise networks. The longest shortest path is usually called the diameter. In some networks it isnt possible to find a path between every pair of nodes. Networks can consist of different detached components. Random networks, for example, are likely to consist of multiple components (Erdos,1959; Albert & Barab´asi,2002). The size of these components usually varies greatly. A random network has a high probability to contain one giant component and multiple much smaller components.

2.2 Co-Authorship

Collaboration in a group of people can be studied using networks by having the edges in a network denote a joint effort towards something. For example, if two scientists work together to write a paper, they would be connected via an edge. The network that arises from this, is called a co-authorship network. This concept can be employed in many different fields to study the collaboration in that field. Some examples are: scientists writing papers (Newman,2004), jazz musicians collaborating (Gleiser & Danon,2003), and film actors acting in the same film. Studying co-authorship networks can reveal information about the way people collaborate with each other. Gleiser and Danon(2003) revealed, for example, that there is a racial segregation in the community structure of jazz musicians.

Newman (2001) showed that degree distributions with both axes scaled logarithmically for scientific co-authorship networks didnt have straight diagonal lines. Instead, these distributions had an exponential cutoff. He suggested that this might have been caused by the used timescale of the data. In addition to this, he also showed that scientific co-authorship networks have one giant component that makes up most of the nodes and that there are multiple smaller components, similar to the structure of random networks.

2.2.1 Co-Authorship in Rap

In an article bySmith(2006), a co-authorship network for rappers is constructed in which rappers are connected via guest verses on songs. This network showed no assortativity; well-connected rappers had no preference with whom they work together. Furthermore, the most well-connected rapper was sought out based on betweenness centrality and

(9)

the degree. With both methods, the rapper Snoop Dogg turned out to be the most well-connected. ’

2.3 Rap

Hip-hop as a genre differs from other musical genres in its song structure. Rappers deliver their vocals in form of verses over beats made by producers. In this way multiple different rappers can be featured on the same song. There are multiple reasons why an artist might be featured on someone elses song. It could merely be an artistic choice, but there are also situations in which a featuring might benefit either of the rappers. When there is a disparity in the popularity of the rappers involved, the less popular artist gets exposed to the larger audience of the more popular artist. Popular artists can feature less popular artists as a way of adding variety to their songs, or as a way to help them with their career.

From the beginning, hip-hop has been a highly territorial genre (Forman, 2002; Hess, 2009). Many rappers represent the neighbourhood they grew up in by making references in their verses to local places or landmarks. These references are made to increase the credibility of the stories told in the verses. This territorialism can sometimes lead to a war between neighbourhoods or areas. The most well-known example of this is the battle between the west coast and the east coast during the 90s. Such battles are often expressed through songs in which the opposing area is denounced. Although there were tensions between areas, this did not mean that there was absolutely no collaboration between rappers from both areas. Not all rappers adhered to the rules imposed by the battles (Hess,2009).

The introduction of new technologies has allowed rappers to collaborate remotely. With the internet, it has become possible to send beats and verses across the world and make songs that way (Hess, 2009). This begs the question if this has made the structure of collaboration more tight-knit and less geographically localised. Speaking in terms of network theory, a more recent rapper network may have a higher global clustering coefficient than an earlier one. Hess also claims that the future of rap collaboration lays less emphasis on geographical spaces and has more geographically diverse artists collaborating with each other.

This paper addresses the question whether new technologies have made an impact on the structure of collaboration in the rap genre by looking at its features over different time periods. It is hypothesised that the degree of collaboration will increase along with the ease of collaboration.

(10)

Chapter 3

Methodology

Testing the hypothesis was done by creating a co-authorship network of rappers linked by their collaborations. The network had to be of sufficient size and have songs from a diversity of time periods to see if there have been any changes.

The data was obtained from the Genius API (Genius, n.d.). The dataset is a continu-ously growing archive of songs that are being annotated by users of the website. These users earn points by annotating song lyrics or by adding information such as release dates and featured artists to songs. The resulting database can be queried with the Genius API to retrieve information about artists or songs. Originally, the website was just focused on rap songs, but the website has grown to also include songs from different genres (Genius,2009). A major limitation of using this API was that there is no central list of rappers on the website. The workaround for this problem was using a separate list of rappers. This list came from The Original Hip-Hop Lyrics Archive (www.ohhla.com), hereafter referred to as ohhla. In order to find all songs of each artist in the Genius API, this external list of rappers had to be linked to the internal IDs that are used in the Genius database. By using the API search function with rapper names as query an attempt was made to get the IDs. Because the search function looks through songs in the dataset instead of artists, this attempt wasnt perfect however. Many rappers were wrongly linked to different rappers IDs. An attempt was made to manually restore some of the artist IDs by going through the list and correcting the IDs of major artists. In total, there were 3381 artists listed on ohhla. 2374 of these were correctly linked to their IDs, which means there was a success rate of 70.2%.

The successfully linked IDs were used to retrieve lists of all songs in which said artist was the primary artist. Genius API returns at most 50 songs at once, so multiple queries had to be sent for each artist in order to get all songs. From these songs, only the songs which featured other artists were used. For each of these songs, another query was sent

(11)

to get more specific information about them. For the purpose of this paper, the song ID; the full title of the song; the title without artist names; the title with featured artists; the main artist along with their ID; a list of the featured artists along with their IDs; the release date of the song and the recording location of the song (if available) were queried. Each song was stored in this way in a CSV file with tabs as separators. In this file, each row represents one song.

To get information about all artists and their songs, a large amount of queries had to be sent. To do this, the list of rappers was split up into 23 parts and each part was retrieved separately. This was done so the algorithm wouldnt have to start all over again in case it encountered any error while querying the API. After all 23 parts of the data were retrieved, they were appended and put into a single file.

A second limitation of the API was that not all results from the queries were actual songs. It is not uncommon for users on the website to annotate interivews or series of tweets by artists. These annotations showed up along with actual songs in the search results. This resulted in some interviewers or otherwise irrelevant people showing up in the network as nodes. This limitation was overcome by simply filtering the retrieved collaborators with the original rapper list from ohhla.

In order to create the edges between artists, the primary artist of each song was appended to the list of featured artists of that song. The result was a list of all collaborators for each song. For each of these lists, all combinations of 2 artists were added to a main list of edges. Thus, if a song has 4 collaborators, 6 edges get added to this main list. A list of all songs between each pair of artists was kept as attribute along with the total amount of collaborations between them. The total amount of collaborations could be used as weight of each edge. From this list of edges, the set of artists can be derived in order to get the nodes required for a network.

After creating the full network, the attributes of the edges were used to create a number of subnetworks. Each subnetwork contains only the collaborations from a single year and all prior years, so the evolution of collaboration in rap can be studied. Creating these subnetworks was done by firstly creating a graph with only the rappers as nodes for each year. Then only the edges that were formed in the year corresponding to each graph were added. Afterwards, all nodes with a degree of 0 were removed from each network. Afterwards, some cleaning of the data had to be done: a dozen or so songs were wrongly annotated to have years of 0001 and 0099. These subnetworks were simply omitted from the list. The loose years were finally combined so each year contained all nodes and edges from each prior year as well, to make it cumulative.

(12)

Chapter 4

Results

Reviewing the quality of the network is easiest by visualising it. With a visualisation, one can see at a glance whether the connections and nodes make sense and if there is anything missing, given that the observer is familiar with the genre. The size of the network made it impossible to visualise in its entirety though. To get around this, visualisations were made of small subnetworks centred on single nodes; usually called egographs. Egographs show a node and all its neighbours. Additionally, the radius can be increased to show more nodes in that vicinity. A radius of 2 would not only show the central nodes neighbours, but also the neighbours of those neighbours. Increasing the radius can quickly clutter the visualisation and make it unreadable. Figure 4.1 depicts a egograph of radius 1, centred on Frank Ocean. With these egographs it is easy to see if the network makes sense or if something went wrong during the creation of it. It can also reveal interesting patterns in the data. In this case for example, you can see that Frank Ocean acts as a gateway between 2 clusters. The cluster on the right is that of rap group Odd Future and the one on the left are other various popular rappers Frank has worked with. After inspecting several egographs, the network seemed to represent the collaboration structure of rappers well. No glaring errors could be found.

The results are split up into 2 parts. The first part considers the network as a whole. All collaborations are featured in this network, regardless of whether these songs have dates or not. These results show the characteristics of collaboration in rap as a whole. By comparing the characteristics to similar co-authorship networks from previous research, the reliability of the rapper network can be tested to see if repeated experiments yield the same results.

In the second section of the results, each year is dealt with separately. The same measures that were used in the first section are calculated here per year. This method can show trends in the evolution of collaboration in rap and expose developments in the genre.

(13)

Figure 4.1: Larger version in appendixA. This visualisation shows Frank Ocean and all his direct neighbours. Thin striped lines indicate that they collaborated on 1 song. The slightly thicker lines indicate 2 or 3 collaborations and the thickest lines are for 4

or more collaborations.

The second part also brings that supports or denies the central hypothesis of an increase in collaboration between rappers.

4.1 General Results

The final network consisted of 6314 rappers and 43643 collaborations between these rappers. The latter number is not the amount of songs, but rather the amount of rapper pairs. Each edge can consist of multiple songs, which was indicated by the weight. In total, 76195 songs were used to create the network.

On average, each artist worked with 13.8 different artists and had 24.1 artists featured on their songs. These results are heavily skewed by a few prolific rappers that have many collaborations. The first graph in figure 4.2 shows the degree distribution of the whole network. The distribution has a high peak at the lower degrees and a long tail towards the higher degrees, indicating that there are many rappers with only a handful of collaborations. This shape is reminiscent of scale-free networks, which means that the same type of structure would show up at similar networks of smaller or bigger scales. The same shape has also been observed in other co-authorship networks.

(14)

Contents 11

((a)) ((b))

Figure 4.2: (A) The degree distribution of the full network. The degree was measured by the amount of other rappers a rapper has worked with, not the amount of songs.

(B) The same distribution as (a), but scaled logarithmically on both axes.

Curiously, there is another peak at a degree of 57. Upon further inspection, the cause for this peak was found. In 2010, a group of roughly 60 Canadian artists came together to make a song to help Haiti after it was struck by a large earthquake (HiphopDX, 2010). This song gave each of the collaborators 57 neighbours and that created a noticeable peak in the distribution. Appendix Bshows the subnetwork for these 57 collaborators. In the second graph in figure 4.2, both axes of the degree distribution are scaled loga-rithmically. If the network is indeed scale-free, this should look like a diagonal line. In this case, the line is not fully diagonal, because there are outliers present in the data. This does not mean though, that the network is not scale-free.

Not all nodes were connected in the network. Table 4.1 shows the sizes of the various components that arose. As can be seen, the size of the giant component greatly out-weighed the other components. The nodes of the giant component made up 90.0% of the network. For the purpose of analysing the connectedness of the network, only the nodes and edges located in this component were used for the next calculations. Without this restriction, it is not possible to calculate measures like the average shortest path between all nodes or the diameter.

The largest component showed no signs of assortativity or disassortativity: the assor-tativity was measured to be 0.059. This means that rappers did not have a preference in working with similarly or differently connected rappers. High degree rappers did not tend to work with more other high degree rappers and at the same time they also did not work more with low degree rappers. This result was also found in Smiths (2006) rapper collaboration network.

The average shortest path length found in the largest component was 3.569. This means that it takes on average 3.569 steps to reach a random other rapper when following the edges. The longest shortest path that was found was 9.

(15)

Table 4.1: The size of each component in the network. Amount of Nodes Amount of Edges

6253 43469 15 106 7 17 7 11 6 15 5 6 3 3 2 3 2 1 2 2 2 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1

4.2 Yearly Results

When making the yearly graphs, only songs that had a date as attribute were taken into account. Due to incompleteness in a lot of the data, many songs were discarded in this way. Furthermore, the years were added cumulatively, so each successive year contains the data from the previous years as well. This was done to make the measures from the latter year more robust. Figure 4.3 shows the cumulative amount of nodes and edges for each year. In the graph, there is a sudden upwards spurt starting at 2009. This is likely because Genius was released in 2009. The lower amount of songs prior to this could either be because there simply there were simply less songs or because these songs arent properly annotated. Either way makes the measures for these years less accurate. To make further graphs clearer, all songs released before 1993 are ignored.

The first graph in figure4.4 shows the average clustering coefficient for each year. The graph stays relatively stable over the years. This suggests that rappers have not begun working with more or fewer other rappers. It cannot be concluded from this graph whether rappers also have not increased the amount of songs they make with features on them, because the clustering coefficient in Figure 4 only takes edges into account, not their weights.

To see if the average amount of songs a rapper records with other artists has changed, we can look at the average degree. Earlier, the average degree was calculated to be 13.8

(16)

Contents 13

Figure 4.3: Cumulative size of the network each year.

for all years combined. The average degree with the weights of the edges taken into account was 24.1. This means that on average, a rapper works with 13.8 other rappers on 24.1 songs. The second graph in figure 4.4shows these measures for each year since 1993.

In contrast to 4.4, there are some fluctuations in this graph. The degree increased somewhat steadily, which is to be expected. When a rapper with just 1 collaboration joins the network, the average degree declines, but at the same time, rappers that were already in the network keep making new songs and increasing their own degrees. Under normal circumstances, these dual effects should keep the cumulative average degree in a steady increase or decline. Similar to the sudden jump in new rappers and edges in figure 4.3, this graph shows a spike after 2009. This can most likely be attributed to the nature of the dataset. The website was created in 2009 and thus more songs were annotated for these years, because there is a higher tendency to annotate songs that have recently been released. An increase in the amount of annotated songs also means an increase in the amount of possible songs a rapper is affiliated with and in that way also an increase in the degree.

The average degree with the weights taken into account is much lower than the one found in the first results section that calculated it for the whole dataset. The reason for this is because there were a lot of dateless songs in the whole dataset that arent used for the cumulative measure.

(17)

((a)) clustering

((b)) avgdegree

Figure 4.4: (A) The clustering coefficient per year. (B) The average degree of the cumulative network per year. The orange line takes the weights of each edge in account,

(18)

Chapter 5

Discussion

5.1 Conclusion

The central question in this paper was whether there has been an increase in collab-oration in the rap genre. It was hypothesised that there has indeed been an increase and this increase was thought to be supported by the fact that new technologies have afforded easier collaboration between rappers. For example, the internet allows rappers and producers to send and receive beats, verses, and ideas instantaneously without hav-ing to be at the same place in person. It allows artists from different areas in America or even different countries to work together on songs.

An attempt to answer this central question was made by using network theory to anal-yse a co-authorship network of rap collaboration. The dataset, retrieved from Genius (n.d.), contained information about songs from a large amount of rappers and songs and with this dataset a number of networks were constructed. Firstly there was the full network, that contained all rappers and songs from the dataset and secondly there was the sequence of cumulative networks, that contained all songs with information about their date. The full network could be used to compare collaboration in rap to other types of collaboration that were found in different papers. The sequence of cumulative networks was used to show a possible evolution of the network. In this sequence, each network contains data from a year and all years prior.

The amount of collaboration as measured by the clustering coefficient did not change dramatically. The stagnancy of the coefficient contradicts the hypothesis and suggests that there has been no increase in collaboration in the genre. A possible reason for this is that this is due to the nature of rap; artists have limits to the extent of which they collaborate with others. This conclusion does not mean however that technology has

(19)

had no influence on the genre whatsoever. While the amount of collaboration might have stayed unchanged, the type of collaboration might have changed. It is possible that rappers nowadays are more prone to work with artists that are further away than in the early years of rap.

These findings provide quantitative evidence for Hip Hop historians to work with. In particular they give insight to claims made by Hess (2009) that the future of rap has less emphasis on space. Whereas there might be more geographically diverse collaborations, there has been no increase in the amount of collaborations.

5.2 Limitations

The dataset that was used limited the research in some ways. Firstly, there was the problem that there was no central list of rappers. Therefore a secondary data source had to be used and linking these together brought up some issues. The Genius API had no way to look up an artist directly to get its internal ID. The solution to this, a query that looks up artists via their songs, wasnt a perfect solution; many artists were not correctly linked to their IDs and this meant that 29.8% of artists were discarded from the list.

A second limitation of the dataset was its bias to years beyond 2009. Because the website was created in that year, the amount of songs that were annotated after 2009 was much greater than those from the years before. This caused some spikes to show up at or after 2009. In addition to this, the early years of the rap genre had very little representation in the dataset.

Future research should be cautious of these limitations when making use of this API. Aside from these disadvantages, the data was very thorough and contained much in-formation that was beyond the scope of this paper. The database also contains, for example, the producers of songs and the recording locations.

5.3 Further Research

Several questions remain and some new questions have emerged during the research presented in this paper. Firstly it would be interesting to find out more about the nature of the collaboration that is being done. It has been found that the amount of collaboration has not changed. Further research could give more insight in new types of collaboration that new technologies have brought with them. Provided that the proper

(20)

Bibliography 17

data is available, this research could look at the average geographical distance between collaborating artists over the years, or the distribution of nationalities in the genre.

References

Albert, R., & Barab´asi, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74 (1), 47.

Erdos, P. (1959). On random graphs. Publicationes mathematicae, 6 , 290–297.

Forman, M. (2002). The’hood comes first: Race, space, and place in rap and hip-hop. Wesleyan University Press.

Genius. (n.d.). Genius api. Retrieved fromdocs.genius.com

Genius. (2009). About genius. Retrieved from https://genius.com/Genius-about -genius-annotated

Gleiser, P. M., & Danon, L. (2003). Community structure in jazz. Advances in complex systems, 6 (04), 565–573.

Hess, M. (2009). Hip hop in america: A regional guide. ABC-CLIO.

IFPI. (2017, apr). Ifpi global music report 2017. Retrieved fromhttp://www.ifpi.org/ news/IFPI-GLOBAL-MUSIC-REPORT-2017

McPherson, M., Smith-Lovin, L., & Cook, J. M. (2001). Birds of a feather: Homophily in social networks. Annual review of sociology, 27 (1), 415–444.

Newman, M. E. (2001). The structure of scientific collaboration networks. Proceedings of the national academy of sciences, 98 (2), 404–409.

Newman, M. E. (2002). Assortative mixing in networks. Physical review letters, 89 (20), 208701.

Newman, M. E. (2004). Who is the best connected scientist? a study of scientific coauthorship networks. In Complex networks (pp. 337–370). Springer.

Nielsen. (2018, jan). 2017 u.s. music year-end report. Retrieved from http://www.nielsen.com/us/en/insights/reports/2018/2017-music-us -year-end-report.html

Smith, R. D. (2006). The network of collaboration among rappers and its community structure. Journal of Statistical Mechanics: Theory and Experiment , 2006 (02), P02006.

Travers, J., & Milgram, S. (1967). The small world problem. Phychology Today, 1 (1), 61–67.

(21)

Enlarged egographs

(22)

(23)

(24)

Appendix B

Haiti subnetwork

(25)

(26)

Appendix C

Code

(27)

June 28, 2018

1 imports

In [8]: import pandas as pd import numpy as np import networkx as nx import matplotlib.pyplot as plt import matplotlib.ticker as ticker

import ast import itertools import requests import csv import sys import collections from lxml import html

from requests.adapters import HTTPAdapter

from requests.packages.urllib3.util.retry import Retry

% matplotlib inline

2 data gathering

2.1 artist list from ohhla.com

In [3]: artist_urls = ['http://ohhla.com/all_five.html', 'http://ohhla.com/all_four.html', \ 'http://ohhla.com/all_three.html', 'http://ohhla.com/all_two.html', \ 'http://ohhla.com/all.html']

artist_names = []

for url in artist_urls: page = requests.get(url)

tree = html.fromstring(page.content) root = tree.getroottree()

for artist in root.findall('//pre/a'):

(28)

if artist.text is not '\xa0':

artist_names.append(artist.text) In [4]: len(artist_names)

Out[4]: 3421 After the

2.2 set up genius API connection

In [ ]: # https://stackoverflow.com/a/35636367 s = requests.Session()

retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ]) s.mount('http://', HTTPAdapter(max_retries=retries))

client_access_token = # TOKEN HERE url = 'https://api.genius.com/'

2.3 match ohhla artists to genius ids

because of limitations in the api, we have to use the search function find artist ids this method is flawed, because we dont always find the right artist

In [ ]: with open('artists.csv', 'w') as artistfile:

artistwriter = csv.writer(artistfile, delimiter='\t', quotechar='|')

for artist in artist_names:

## 1. search for artist id path1 = 'search/'

request_uri1 = url+path1 params1 = {'q': artist[0]}

token = 'Bearer {}'.format(client_access_token) headers = {'Authorization': token}

# search for artist name

r1 = s.get(request_uri1, params=params1, headers=headers) # get id of the first response (if it exists)

try:

artist_id = \

r1.json()['response']['hits'][0]['result']['primary_artist']['id'] artist_name = \

r1.json()['response']['hits'][0]['result']['primary_artist']['name'] artistwriter.writerow([artist[1], artist[0], artist_id, artist_name, \

(29)

print(artist[1] + ' : None found')

after doing this, I manually corrected some of the wrongly assigned ids in the csv file In [6]: artists = pd.read_csv('artists.csv', \

sep='\t', header=None, \

names=['ohhla_name', 'ohhla_name_dash', \

'genius_id', 'genius_name', 'names_match'])

2.4 put all song information in a csv

this process takes approximately 24 hours In [ ]: %%time

batchnr = 10

for batch in batches: i = 1

batchname = 'batch' + str(batchnr) + '.csv' print(batchname)

with open(batchname, 'w') as csvfile:

for _, current_artist in batch[batch.names_match].iterrows(): current_id = str(current_artist.genius_id)

## 1. search for songs path2 = 'artists/' path2b = '/songs'

request_uri2 = url+path2+current_id+path2b token = 'Bearer {}'.format(client_access_token) headers = {'Authorization': token}

song_idlist = [] next_page = 1

# go through every page while next_page:

try:

params2 = {'id': current_id, 'per_page': '50', \ 'page': str(next_page)}

r2 = s.get(request_uri2, params=params2, headers=headers) next_page = r2.json()['response']['next_page']

for song in r2.json()['response']['songs']:

# only take songs where current artist is the primary artist if str(song['primary_artist']['id']) == current_id:

(30)

song_idlist.append(song['id']) except KeyError:

break except:

print(sys.exc_info())

print(str(i) + ': ' + str(len(song_idlist)) + ' songs by ' \ + current_artist['ohhla_name'] + ' found')

i += 1

songwriter = csv.writer(csvfile, delimiter='\t', quotechar='|') ## 2. get song information

for song_id in song_idlist: try:

path3 = 'songs/'

request_uri3 = url+path3+str(song_id) params3 = {'id': song_id}

r3 = s.get(request_uri3, params=params3, headers=headers)

# only take songs with features

if len(r3.json()['response']['song']['featured_artists']) > 0: song_json = r3.json()['response']['song'] # song information song_id = song_json['id'] song_full_title = song_json['full_title'] song_title = song_json['title'] song_title_with_featured = song_json['title_with_featured'] song_main_artist = (song_json['primary_artist']['id'], \ song_json['primary_artist']['name']) song_featured_artists = []

for featured_artist in song_json['featured_artists']: song_featured_artists.append((featured_artist['id'], \

featured_artist['name'])) song_release_date = song_json['release_date']

song_recording_location = song_json['recording_location'] songwriter.writerow([song_id, song_full_title, song_title, \

song_title_with_featured, \

song_main_artist, song_featured_artists, \ song_release_date, \

song_recording_location]) 4

(31)

print(str(song_id)) continue

batchnr += 1

2.5 concatenate the various batches

In [ ]: songs = []

for i in range(23):

filename = 'batches/batch' + str(i+1) + '.csv' batch = pd.read_csv(filename, sep='\t', \

header=None, names=['song_id', 'song_full_title', \

'song_title', 'song_title_with_featured', \ 'song_main_artist', 'song_featured_artists', \ 'song_release_date', 'song_recording_location']) songs.append(batch)

allsongs = pd.concat(allsongs)

allsongs.to_csv('allsongs.csv', sep='\t', header=False)

3 creating the network

3.1 song csv to dataframe

In [9]: all_songs = pd.read_csv('allsongs.csv', \

sep='\t', header=None, \

names=['song_id', 'song_full_title', \

'song_title', 'song_title_with_featured', \ 'song_main_artist', 'song_featured_artists', \ 'song_release_date', 'song_recording_location'])

all_songs.song_featured_artists = all_songs.song_featured_artists.apply(ast.literal_eval) all_songs.song_main_artist = all_songs.song_main_artist.apply(ast.literal_eval)

In [10]: all_songs.head()

Out[10]: song_id song_full_title \

0 3427566 After the Storm byăKaliăUchis (Ft.ăBootsyăColl... 1 3581988 Just a Stranger byăKaliăUchis (Ft.ăSteveăLacy)

2 3581989 Miami byăKaliăUchis (Ft.ăBIA)

3 3190066 MULTI byăKaliăUchis (Ft.ăJFK)

4 3189716 Nuestro Planeta byăKaliăUchis (Ft.ăReykon)

song_title song_title_with_featured \

0 After the Storm After the Storm (Ft.ăBootsyăCollins & Tyler,ăT...

1 Just a Stranger Just a Stranger (Ft.ăSteveăLacy)

2 Miami Miami (Ft.ăBIA)

3 MULTI MULTI (Ft.ăJFK)

4 Nuestro Planeta Nuestro Planeta (Ft.ăReykon)

(32)

song_main_artist song_featured_artists \ 0 (160041, Kali Uchis) [(685, Tyler, The Creator), (6471, Bootsy Coll...

1 (160041, Kali Uchis) [(512850, Steve Lacy)]

2 (160041, Kali Uchis) [(273084, BIA)]

3 (160041, Kali Uchis) [(28671, JFK)]

4 (160041, Kali Uchis) [(406029, Reykon)]

song_release_date song_recording_location

0 2018-01-12 NaN

1 2018-04-06 Interscope Studios (Santa Monica, California)

2 2018-04-06 Whitelines (Los Angeles, CA)

3 2013-01-01 NaN

4 2017-08-25 NaN

3.2 artist csv to dataframe

In [11]: artists = pd.read_csv('artists.csv', sep='\t', \

header=None, names=['ohhla_name', 'ohhla_name_dash', \

'genius_id', 'genius_name', 'names_match']) In [12]: artists.head()

Out[12]: ohhla_name ohhla_name_dash genius_id \

0 Kali Uchis Kali-Uchis 160041

1 UGK UGK 401

2 Ugly Ducklings Ugly-Ducklings 376717

3 U-God U-God 1060

4 Ultra (Kool Keith & Tim Dog) Ultra-(Kool-Keith-&-Tim-Dog) 9579 genius_name names_match

0 Kali Uchis True

1 UGK True

2 The Ugly Ducklings True

3 U-God True

4 Ultramagnetic MC's True

4 make pairs of artists

In [13]: main_artists = list(all_songs.song_main_artist.unique())

pairs = {}

for _,row in all_songs.iterrows():

# some songs have wrongly been annotated and have # the primary artist as one of the featured artists # these are checked for and removed

for featured_artist in row.song_featured_artists: 6

(33)

if featured_artist[1] in list(artists.genius_name): # append primary artist to featured artists # to get a list of all artists involved per song

involved_artists = [row.song_main_artist] + row.song_featured_artists rapper_pairs = list(itertools.combinations(involved_artists, 2)) rapper_pairs_sorted = [tuple(sorted(pair)) for pair in rapper_pairs] attributes = {'song_title' : row.song_full_title, \

'song_release_date' : row.song_release_date, \

'song_recording_location' : row.song_recording_location}

for pair in rapper_pairs_sorted:

if pair in pairs.keys():

pairs[pair]['songs'][row.song_title] = attributes

else:

pairs[pair] = {'songs' : {row.song_title : attributes}} In [14]: # add the amount of songs to the dictionary

for pair in pairs:

pairs[pair]['amount'] = len(pairs[pair]['songs'].keys())

In [15]: artist_list = []

for pair in pairs.keys():

artist_list.append(pair[0]) artist_list.append(pair[1]) artist_list = list(set(artist_list))

len(artist_list)

Out[15]: 6314

4.1 creating the network

In [16]: rappers = nx.Graph()

rappers.add_nodes_from(artist_list)

rappers.add_edges_from(((k[0],k[1],d) for k,d in pairs.items()))

5 visualising the network

In [17]: subgraph = nx.ego_graph(rappers, (1450, 'YG'), radius=1) plt.figure(figsize=(30, 20))

pos = nx.spring_layout(subgraph)

nx.draw_networkx_nodes(subgraph, pos, alpha=0.5) nx.draw_networkx_edges(subgraph, pos, alpha=0.5)

(34)

nx.draw_networkx_labels(subgraph, pos) plt.show()

In [18]: def ego(graph, name, r):

matching_artists = artists[artists['names_match'] == True]

genius_id = matching_artists[matching_artists['ohhla_name'] == \ name].iloc[0]['genius_id']

if not genius_id:

print('no artist found with that name')

return

subgraph = nx.ego_graph(graph, (genius_id, name), radius=r) plt.figure(figsize=(30, 20))

elargest=[(u,v) for (u,v,d) in subgraph.edges(data=True) if d['amount'] >3] elarge=[(u,v) for (u,v,d) in subgraph.edges(data=True) if \

(d['amount'] >1 and d['amount'] <= 3)]

esmall=[(u,v) for (u,v,d) in subgraph.edges(data=True) if d['amount'] <=1] nx.draw_networkx_nodes(subgraph, pos, alpha=0.5, node_color='pink')

(35)

nx.draw_networkx_edges(subgraph,pos,edgelist=elarge,\ alpha=0.5,edge_color='r',width=3) nx.draw_networkx_edges(subgraph,pos,edgelist=esmall,\

alpha=0.5,edge_color='r',style='dashed') nx.draw_networkx_labels(subgraph,pos,alpha=0.8)

plt.show()

In [19]: ego(rappers, 'Frank Ocean', 1)

6 basic information about the full network

In [33]: print(nx.info(rappers)) Name: Type: Graph Number of nodes: 6314 Number of edges: 43643 Average degree: 13.8242 9

(36)

In [42]: # amount of songs with features per rapper

amount_degree = sorted(rappers.degree(weight='amount'), reverse=True, \ key=lambda x: x[1])

In [35]: amount_degree[:10]

Out[35]: [((4, 'Lil Wayne'), 1553), ((88, 'Rick Ross'), 1302), ((46, 'Snoop Dogg'), 1190), ((13, 'Gucci Mane'), 962), ((85, 'T.I.'), 937), ((1583, 'French Montana'), 857), ((338, 'Busta Rhymes'), 814), ((42, 'The Game'), 796), ((405, 'E-40'), 775), ((142, 'Master P'), 764)]

In [36]: np.mean([x[1] for x in amount_degree]) Out[36]: 24.135254988913527

In [37]: sum(x[1] for x in amount_degree)/2

Out[37]: 76195.0

In [43]: # amount of unique featured artists per rapper

degree = sorted(rappers.degree(), reverse=True, key=lambda x: x[1])

In [39]: degree[:10]

Out[39]: [((46, 'Snoop Dogg'), 421), ((338, 'Busta Rhymes'), 345), ((405, 'E-40'), 333), ((4, 'Lil Wayne'), 328), ((88, 'Rick Ross'), 319), ((42, 'The Game'), 299), ((288, 'Bun B'), 297), ((14325, '2 Chainz'), 289), ((85, 'T.I.'), 275), ((56, 'Nas'), 269)]

In [40]: np.mean([x[1] for x in degree]) Out[40]: 13.824200190053849

In [44]: ego(rappers, 'Lil Wayne', 1)

(37)

(38)

In [46]: degree_sequence = sorted([d for n, d in rappers.degree()], reverse=True) degree_distr = pd.Series(collections.Counter(degree_sequence))

degree_sequence_weighted = sorted([d for n, d in rappers.degree(weight='amount')], \ reverse=True)

degree_distr_weighted = pd.Series(collections.Counter(degree_sequence_weighted)) degree_distr.plot(kind='bar', figsize=(10, 5))

Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x7f658f544da0>

In [47]: degree_distr_weighted.plot(figsize=(5,4), \

title='Weighted degree distribution of the rap network') Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x7f658f532940>

(39)

Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x7f658f2d9be0>

(40)

In [50]: subgraph = nx.ego_graph(rappers, (290415, 'Stephan Moccio'), radius=1) plt.figure(figsize=(30, 20))

elargest=[(u,v) for (u,v,d) in subgraph.edges(data=True) if d['amount'] >3] elarge=[(u,v) for (u,v,d) in subgraph.edges(data=True) if \

(d['amount'] >1 and d['amount'] <= 3)]

esmall=[(u,v) for (u,v,d) in subgraph.edges(data=True) if d['amount'] <=1] nx.draw_networkx_nodes(subgraph, pos, alpha=0.5, node_color='pink')

nx.draw_networkx_edges(subgraph,pos,edgelist=elargest,alpha=0.5,\ edge_color='r',width=6)

nx.draw_networkx_edges(subgraph,pos,edgelist=elarge,alpha=0.5,\ edge_color='r',width=3)

nx.draw_networkx_edges(subgraph,pos,edgelist=esmall,alpha=0.5,\ edge_color='r',style='dashed')

nx.draw_networkx_labels(subgraph,pos,alpha=0.8) plt.show()

(41)

In [53]: nx.average_clustering(rappers_giant) Out[53]: 0.7097085518936848

In [54]: np.mean([x[1] for x in nx.degree(rappers_giant)]) Out[54]: 13.903406364944827 In [57]: nx.average_shortest_path_length(rappers_giant) Out[57]: 3.5688932524160637 In [56]: nx.degree_assortativity_coefficient(rappers_giant) Out[56]: 0.058960174223529845 In [55]: nx.diameter(rappers_giant) Out[55]: 9

7 creating yearly networks

In [20]: per_year = {}

# first loop through each rapper pair

for edge in list(rappers.edges(data=True)): pair_years = {}

# then check each song

for song in edge[2]['songs'].keys(): # if there's a release date

if isinstance(edge[2]['songs'][song]['song_release_date'], str):

# take only the year ([:4])

year = edge[2]['songs'][song]['song_release_date'][:4]

# add the songs that were released in that year to a dictionary if year in pair_years.keys():

pair_years[year]['songs'][song] = edge[2]['songs'][song]

else:

pair_years[year] = {'songs':{song:edge[2]['songs'][song]}} # then add each year for that pair to the main dictionary

for year in pair_years.keys():

# also add the 'amount' attribute 15

(42)

pair_years[year]['amount'] = len(pair_years[year]['songs'].keys())

if year in per_year.keys():

per_year[year][edge[:2]] = pair_years[year]

else:

per_year[year] = {edge[:2]:pair_years[year]} In [21]: print(sorted(list(per_year.keys())))

['0001', '0099', '1977', '1982', '1984', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']

In [23]: yearly_graphs = []

# create an emtpy graph for each year for year in per_year.keys():

temp = nx.Graph() temp.name = year

yearly_graphs.append(temp)

# add the nodes and edges to each year's graph for i, year in enumerate(per_year.keys()):

yearly_graphs[i].add_nodes_from(nodes=artist_list)

yearly_graphs[i].add_edges_from(((k[0],k[1],d) for k,d in per_year[year].items())) # remove nodes with no edges

yearly_graphs[i].remove_nodes_from(nx.isolates(yearly_graphs[i].copy())) # sort the graphs by year and remove the first two (input errors)

yearly_graphs = sorted(yearly_graphs, key=lambda x:x.name)[2:]

In [24]: yearly_info = [(len(graph.nodes()), graph.size()) for graph in yearly_graphs] yearly_years = [graph.name for graph in yearly_graphs]

yearly_info_df = pd.DataFrame(data=yearly_info, index=yearly_years, \

columns=['amount of nodes', 'amount of edges']) yearly_info_df.plot(kind='bar')

Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x7f65a105da90>

(43)

yearly_clustering = pd.Series(clustering[1], index=clustering[0]) yearly_clustering.plot(yticks=np.arange(0,1.1,0.2))

Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x7f658ecd06d8>

(44)

In [27]: ego(yearly_graphs[-6], 'Kendrick Lamar', 1)

8 making the years cumulative

In [28]: cumulative_years = [yearly_graphs[0]]

for i, year in enumerate(yearly_graphs[1:]):

cumulative_years.append(nx.compose(cumulative_years[i], year))

In [29]: good_yearly_info = [(len(graph.nodes()), graph.size()) for graph in cumulative_years] good_yearly_years = [graph.name for graph in cumulative_years]

good_yearly_info_df = pd.DataFrame(data=good_yearly_info, \ index=good_yearly_years, \ columns=['amount of nodes', \

'amount of edges']) good_yearly_info_df.plot(kind='bar')

Out[29]: <matplotlib.axes._subplots.AxesSubplot at 0x7f65b7a22a90>

(45)

yearly_clustering = pd.Series(clustering[1], index=clustering[0]) yearly_clustering.plot(yticks=np.arange(0,1.1,0.2))

Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0x7f659017b668>

(46)

In [31]: assortativity = list(zip(*[(graph.name, nx.degree_assortativity_coefficient(graph)) for graph in cumulative_years])) yearly_assortativity = pd.Series(assortativity[1], index=assortativity[0])

yearly_assortativity.plot(yticks=np.arange(0,1.1,0.2))

/home/mathijs/anaconda3/lib/python3.6/site-packages/networkx/algorithms/assortativity/correlation.py:287: RuntimeWarning: invalid value encountered in double_scalars return (xy*(M-ab)).sum()/numpy.sqrt(vara*varb)

Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x7f658ffc9ba8>

In [32]: degree = list(zip(*[(graph.name, np.mean([x[1] for x in \

nx.degree(max(nx.connected_component_subgraphs(graph), \ key=len))])) for graph in \

cumulative_years[10:]]))

yearly_degree = pd.Series(degree[1], index=degree[0]) yearly_degree.plot(label='unweighted', legend=True)

degreew = list(zip(*[(graph.name, np.mean([x[1] for x in \

nx.degree(max(nx.connected_component_subgraphs(graph), \ key=len), \

weight='amount')])) for graph in\ cumulative_years[10:]]))

yearly_degreew = pd.Series(degreew[1], index=degreew[0]) yearly_degreew.plot(label='weighted', legend=True)

(47)