How ‘selfie’ got into the dictionary: an examination of Internet linguistics and language change online

(1)

How ‘selfie’ got into the dictionary: an examination

of Internet linguistics and language change online

By Nora McLeese

University of Amsterdam MA New Media and Digital Culture June 26, 2015 Supervisor: dhr. M.D. (Marc) Tuters Second Reader: mw. dr. S. C. (Sabrina) Sauer

(2)

INTRODUCTION 3 RESEARCH QUESTION 7 CHAPTER ONE: THE SCOPE OF THE RESEARCH 7 I. Research Overview 7 II. New Media Positioning 9 III. The Corpus 13 A. Corpus Limitations 14 B. The Corpus of the Dictionary 16 CHAPTER TWO: LANGUAGE VARIATION 16 I. Language Change and Language Variation 16 II. Sociolinguistics 19 III. Speech Communities 20 CHAPTER THREE: INTERNET LINGUISTICS 21 I. Computermediated Communication 22 II. Internet Linguistics and David Crystal 24 III. Vocabulary Change 28 CHAPTER FOUR: THE DICTIONARY 30 I. The Dictionary and Its Authority 30 II. UrbanDictionary 32 CHAPTER FIVE: THE CASE OF THE ‘SELFIE’ 34 I. Origins and Etymology 34 II. Tracking [selfie] 37 A. Methodology and Results 38 CONCLUSION 49 FURTHER RESEARCH 50 BIBLIOGRAPHY 52

(3)

ABSTRACT

Internet linguistics is an emerging field that deals with how language is impacted by online communication. Media environments and social forces combine to determine what is appropriate and drive language change. An email is written differently than a Tweet, however the character limits of the later may prompt creation of abbreviations and neologisms. What Internet linguistist. and British linguist David Crystal, OBE, as the leader of the field, have shown is that many of the slang and jargon associated with new media forms in fact have historical roots. Even if the new language is not revolutionary, when a certain word associated with the Internet and the Millennial generation catches on, there tends to be cries of the debasement of the English language. Empirical research at the end of this study follows the journey that the word [selfie] made from its first recorded appearance, through its popularity growth and acceptance into the lexicon of the Oxford English Dictionary, which is considered the official record keeper of the English language. I hereby allow the library of the University of Amsterdam to archive a digital copy of this thesis in a repository, and to publicise and make it available for consultation upon request. Nora McLeese June 26, 2015

(4)

INTRODUCTION

William Shakespeare has been credited with adding more than 1700 words to the English lexicon. The merits of his contribution have been endlessly debated. Criticisms range from questioning his authorship to alleging that he merely provided the most comprehensive record of language that was easily accessible at the time the first dictionaries were being compiled1_{. Nevertheless, the specter of his influence looms large. No other person, group or} force has been as heavily associated with English language change as Shakespeare, that is until the Internet emerged, and its jargon became a perceived threat to the purity of the English language. In recent years, dictionaries have been adding large numbers of new words and meanings that have origins in digital culture and new media technologies. What this study seeks to examine is whether the impact of Internetpowered language change can be seen as having a similar impact. Are we in truth seeing a new wave of rapid language change, or do we just perceive it as such because the use of language has never before been so comprehensively recorded in the corpus of the Internet? The topic for this thesis paper stems from my interest in language and how it is used differently in different settings, especially the patterns of use emerging online. In 2013, the humorous and feminist site (with a literary slant), The Toast, published a blog post by Tia Baheri called “Your Ability To Can Even: A Defense of Internet Linguistics” that argued that language used online was not a degradation of the language, but rather was a creative means of expression under new technological restrictions. The titular example—the phrase “I can’t even”—has over a dozen variants that all express the same sense of emotion that so overwhelms to the point that the speaker can no longer speak. In contrast to real life, inperson communication, the limited dimensions of text on screen requires words to do more heavy lifting. A sentence fragment can convey something profound on a social media site such as Tumblr but sound like nonsense in a real world setting. This is the anchor to which this study is tied. In sociolinguistics, language variation can emerge as a result of social forces, such as the culture of an online community. With the virtual borders on the Internet becoming more porous, its language is being increasingly used and remarked upon in the mainstream. 1_{The Oxford English Dictionary is responsible for citing Shakespeare’s works as the first recorded and}

(5)

Cultural writers and critics have for years been musing on the abbreviations, acronyms, erased punctuation and capitalization that have become the markers of Internet slang. However, up until now linguistic examination of the circumstances surrounding these changes and the impact they have had has been sparse. This is in part due to the relative novelty of the Internet. Linguistics, as a field, rewards time. The Internet is temporal. However, there is an emerging field focused on studying the new language forms and styles shaped under the influence of the Internet and other new media formats: Internet linguistics. British linguist David Crystal, OBE, is the preeminent scholar in this field, coining the term in a 2005 paper. Crystal has written over two dozen books and articles on Internet language, a majority of which speak in its defense and rail against the moral panic that surrounds Internetcharged language change. Other works provide more pedantic resourcesidentifying, recording and defining words that have emerged or taken on new meaning because of the Internet and digital culture. In this paper, the emerging field of Internet linguistics will be examined, looked at in relation to media theory, and discussed in regard to the integration of internet language into mainstream discourse and the English lexicon, represented by the dictionary. One of the most visible aspects of Internet linguistics is vocabulary change and the emergence of new words. This study is interested in seeing why and how certain words make it into the dictionary while others disappear or are not accepted. Specifically, empirical research has been conducted to track the journey of the word [selfie] from genesis to its acceptance in the Oxford English Dictionary (OED) and to observe its impact on digital discourse. By using tools such as Google Trends and the Google News scraper, this study seeks to represent the word’s growth in popularity and to discuss how this growth demonstrates a need for this particular word to be integrated into the lexicon. Lexicography is the process of writing or compiling dictionaries, which act as a record of language. The discipline is considered separate from linguistics, whose object of study is language rather than the dictionary. However, the dictionary is the recognized authority on what is considered a part of the language. With the Internet wielding influence over the creation of seemingly endless neologisms, i.e. new words not yet accepted into the mainstream, the arrival of these words in the dictionary offers legitimacy. In the case of [selfie], even the validity of an OED entry did not deter its critics. Internet language is frequently considered to a degradation to the language overall, but the mainstream reaction to [selfie]’s inclusion in the dictionary was particularly acerbic. This in part due to its Internet

(6)

origins, and in part due to what the word represents. Its definition may be recorded in the lexicon, but its cultural impact illustrates the sociolinguistic perspective of Internet linguistics. Social processes and social forces can shape words and their meanings. In Internet linguistics, this applies not only to the social norms that are developed in regards to different forms of communication—the difference between a blog post and a Tweet, for example – but also to the way society reacts and relates to the impact of the Internet on language. In the case of the word [selfie] , the perception of the word is colored by its association with narcissism and superficiality that some attribute to the Millennial generation. The cultural practice of taking selfies is possibly more condemned than the word itself. Its inclusion in the dictionary acts as an invitation for criticism. This study will further examine Internet linguistics and how its inventions and influences move from the idiosyncratic new media landscape to infiltrating the mainstream, through newspapers and the dictionary. The emerging field of Internet linguistics draws from a vast corpus of material, but also faces the challenges of a relatively shallow field. Linguistics as a whole, its predecessor and reference point for history and methodology, operates over a much wider scope. Internet linguistics is effectively a subset, utilizing the various frameworks laid out over the years and applying them to a specific, new media corpus. However, therein lies the challenge. The influence of new media is omnipresent, touching not only the evolution of vocabulary and grammar, but the reasons they change, the catalysts for change, the methods of change, and the cultural effects of these changes. Sociolinguistics provides a framework through which to examine language variation and change. American linguist William Labov is credited with creating the discipline of variationist sociolinguistics and outlining a methodology for wider sociolinguistic research, including correlating linguistic variants with sociological categories such as age, class, location and gender. According to Labov, a complex society begets complex linguistic variation. Sociolinguistic variables can have impact on language when they are frequent, not consciously suppressed, easily quantifiable and scalable, and integral to the structure of the language (Social Stratification 3). While phonetic variation is most common and easily applied to these criteria, in this study, the focus will be on lexical and grammatical variation. Online text strips language of its phonetic dimension.

(7)

On the Internet and in digital culture, one of the most influential sociological categories in language variation is age. A generation of digital natives are more susceptible to creating and adopting words related to new media technologies. This can been seen in the [selfie] case, where there is a generational divide in the frequency of taking selfies (Faw; Blow). However, age is not the only influencer. Language change emerges because of a combination of social factors. When considered in the context of media, social processes intersect with materiality and the requirements of the particular medium.

Welsh academic Raymond Williams championed the notion of social determinism, as a critique of the technological determinism of media theorists such as Marshall McLuhan, whose mantra “the medium is the message” is a bellwether for media theory (McLuhan and Fiore). While McLuhan believes that the social contexts of media are relevant, Williams asserts that a complex set of social factors defines the meaning of media processes. In the context of sociolinguistics and Internet linguistics, social factors still have the most impact on the medium of language. However, Internet linguistics does allow for some media determinism as new digital forms are influencing the language through features such as character limits. This paper will be laid out in five chapters. Chapter One deals with the scope of the study, examining why Internet linguistics is worth studying, how it related to the academic definition of new media and the challenges that are faced when dealing with the Internet as a corpus for language. Never before has there been such an expansive record of language, but phenomena such as the Deep Web and corporate interests can place limitations on access and research methods. Chapter Two will delve into language change and language variation. Change can be examined diachronically—historically—or synchronically, which deals with a moment in time and a particular generation. Internet linguistics is heavily associated with the Millennial generation. While the medium in which language is used can have influence of shaping language, social processes and social factors play a vital part, as demonstrated in the study of sociolinguistics. In Chapter Three, the new field of Internet linguistics and its field leaders, including Crystal, will be discussed. The area most relevant to this study is the examination of vocabulary change, showing how new words emerge and that they are often rooted historically. Chapter Four will discuss the dictionary, in particular the Oxford English Dictionary, which is the leading authority on the English language. In order to determine whether Internet lingo will infiltrate mainstream discourse, their inclusion in the lexicon can signal official acceptance. Finally, Chapter Five will examine the case of

(8)

[selfie] as it evolved from Australian slang online to one of the most popular words to come out of the Internet and into the dictionary.

RESEARCH QUESTION

How does the inclusion of ‘selfie’ in the Oxford English Dictionary demonstrate language change in the age of the Internet? To what extent has Internet Linguistics influenced mainstream discourse and the lexicon?

CHAPTER ONE: THE SCOPE OF RESEARCH

I. Research Overview Internet linguistics is a field that has emerged from the application of linguistic principles to the corpus of language available online. The Internet, its technical requirements, its ecosystem, its uses, and its communities all have influence on language and how it is used. This study is particularly interested in language variation and how the Internet has played a role in the creation of new language forms such as vocabulary and grammatical structures. This paper will begin by positioning the study and the field of Internet linguistics in relation to new media studies. In colloquial use, the term new media has a different connotation than when used in academics and the humanities. Defining new media studies as an academic field will help demonstrate which linguistic theories coalesce with media theories and which have a more fractious relationship. Even though new media studies is its own discipline, the approach to new media from other academic disciplines often focuses only on how particular new media objects impact their area of examination. Internet linguistics has been guilty of this as well. It tends to focus on language through the prism of a new media forms, such as email or text messages, and the related social processes. In this paper, new media theory will be used to develop these linguistic examinations further, positioning and challenging established theories about language variation.

(9)

While Internet linguistics spans all linguistics practices observed in digital discourse, this paper is most interested in how new media forms have influenced language change. In linguistics, language change is closely associated with (socio)linguistic variation, which deals with differences in language through elemental variations such as pronunciation, word choice and syntax (Tagliamonte; Holmes). Word choice refers to vocabulary and lexicon. Language can vary distinctly from person to person—ideolects are an “individual’s way of speaking, including sounds, words, grammar and style”—group to group and environment to environment (Wardhaugh 4). However, variation has limits; altering pronunciation, inflection, grammar or syntax can result in gibberish and communication breakdown (Wardhaugh 5). This is a key idea to remember when examining language influence emerging from the Internet. There is a popular feeling that the Internet has had a major impact on how we use language. However, one observes that new language forms and neologisms that have been accepted into the mainstream have generally not strayed too far from what already existed (Crystal, Language and the Internet 122). The empirical case study in Chapter Five looks at how the term [selfie] made the journey from Internet slang to an Oxford English Dictionary entry, a hallmark of official acceptance into the lexicon. Starting as an Australian English slang term, it used a common diminutive convention to modify an already existing word. It then existed online for over a decade before its popularity started rising. The OED took note in 2013. Mainstream media sources, represented by the news media in this study, would eventually adopt the word, with most only incorporating it into articles after it was already in the official lexicon. Even after that milestone, mainstream usage still remains low, which can be attributed to a combination of perceptions of the word’s meaning, common usages, and its continued characterization as informal slang (Zimmer). The informality of Internet language, both officially recognized and not, receives a lot of negative attention (Chayka; Poole; McCrum; Crystal, Txtng: the Gr8 Db8). Some linguists have pointed out that new language conventions (such as text speak) do not hinder a child’s acquisition of formal language. In fact, children must be aware of the established rules of language in order to incorporate these variations (Crystal, The Scope of Internet Linguistics; Crystal, Txting). Written prose, also, can remain unaffected or even thrive under new

creative challenges imposed by technology and software requirements. The web allows for more writing to be produced, without the limitations of space and material that existed

(10)

previously (Poole). Slang is often associated with class, even though it “indicates a style of language rather than a level, formality or cultivation” (Jones; Drake 64) . There exist language purists who echo the Orwellian sentiment that written use of the English language has become “ugly and inaccurate” under overwhelming political power. These people may, in the case of the Internet, frame these new, informal ways of speaking and writing as debasements instead of acknowledging that certain new words or language forms may have inherent value. They may also ignore the social forces and institutions of language that are policing the Internet frontier (Orwell; McCrum; Chayka). Over the years, the Internet has become multilingual, which poses its own set of challenges for linguistics study, but this study will be restricted to English (Crystal, Internet Linguistics 7891). The mostused language on the web is technically not known but as of June 2015, “English is used by 55.4% of all websites whose content language [is known]” (W3Tech). Therefore the Oxford English Dictionary will be considered the authority that determines when a word is officially accepted into the lexicon. While the Internet, as a public sphere, has become a democratic place for language innovation, societal power structures still exist and the dictionary has taken on the role of oligarchical arbiter. Language can be used to wield or challenge power, and establish power relationships. Foucauldian discourse analysis applies the theories of Michel Foucault to examining these power relationships in society through language, analyzing how the social world, and the language through which it is expressed, are influenced by power sources (Given 248). This power dynamic can be applied to the relationship between the dictionary and language itself, and between the dictionary and the speakers of the language. II. New Media Positioning As a new media study, this paper requires a definition of ‘new media’ and its positioning within academia. Many of the most influential media theories were developed before the emergence of modern digital mediums and thus were based on older media such as the printed book, film and television. This calls for a reexamination and reevaluation in order to compensate for the new challenges of engaging with digital forms (Blotter, 2003: 16). In media studies, and in the humanities as its umbrella field, the approach to theory can be broken down five ways: (1) theory as an unproven explanation, rather than fact or law; (2) theory as an explanation of empirical facts, such as a social phenomenon; (3) theories as a technique for analysis rather than explanation, often across disciplines; (4) normative or

(11)

critical theories that are concerned not only with what is but what ought, vital in media studies as media are seen as critical for power, influence and domination; and (5) theory as a genre, source and resource for understanding, interpretation, conceptualization, and criticism. (Rieder 47). Theory as a resource can be harnessed when performing interdisciplinary studies, as is the case in this paper. Here, media studies and new media studies will provide a backdrop against which linguistic and sociolinguistic theories will be examined and utilized, a combination which has already helped develop the emerging field of Internet linguistics. It is media that “form[s] the infrastructural basis, the quasitranscendental condition, for experience and understanding” (Mitchell and Hansen vii). Mitchell and Hansen go on to claim that media is and should be central to all humanities and humanistic social science research (vii). Media studies continue to be associated with a set of approaches, rather than a unified field, allowing for integration of many other disciplines and areas of research (Mitchell and Hansen vii). Language and media are deeply entangled. However, in part due to the ambivalent, miscellaneous nature of media studies, there is no established discipline for examining the two together (in contrast to disciplines that fall firmly under the umbrella of linguistics such as sociolinguistics, pragmatics, and stylistics). (Durant and Lambrou 2). Media theories should be approached as conceptual resources “for the active interpretive practice of making sense of and critiquing the world” (Rieder 10). Marshall McLuhan, a pioneer of media theory, is most known for two main ideas: “the medium is the message” and the “global village” (Understanding Media; The Gutenberg Galaxy). The later may have predicted the web almost 30 years before it achieved wide acceptance: The next medium, whatever it is – it may be the extension of consciousness – will include television as its content, not as its environment, and will transform television into an art form. A computer as a research and communication instrument could enhance retrieval, obsolesce mass library organisation, retrieve the individual’s encyclopedic function and flip it into a private line to speedily tailored data of a saleable kind (McLuhan The Gutenberg Galaxy, 293). The term “global village” describes the contraction of the world through electronic technology, allowing the instantaneous transfer of information and “bringing all social and

(12)

political functions together in a sudden implosion” that “has heightened human awareness of responsibility to an intense degree” (McLuhan Understanding Media, 5). The world is continually shrinking and expanding simultaneously. Communication facilitated by the Internet has diminished barriers caused by physical distance and fostered the creation of communities online while folding in more and more individuals as they get connected. This language, when appropriated by linguists, can be applied to studying the language of new media and sociolinguistic variation through speech communities.

McLuhan’s theories fall under the concept of technological determinism, meaning that media shape society and human experience independent of their content (Rieder 17). The opposite end of this particular spectrum of debate is social determinism, championed by Raymond Williams, an ardent critic of technological determinism. Social determinism acknowledges that the materiality of media is important but social practices do not wholly flow from it (Rieder 17); technology, its rate and direction of innovation, are influenced by the social processes in which they are created (Wei). “Determinism is a real social process, but never… a wholly controlling, wholly predicting set of causes” (Williams 133). Determination is not controlled by a single factor, but rather a set of forces, including “the distribution of power or capital, social and physical inheritance, relations of scale and size between groups” and how they intersect (Williams 133). When looking at Internet linguistics and language change, Williams’ approach and social determinism can be more confidently applied. While a major input into the study of language online is the material that influences how communication and communication processes are performed, already established social factors also have great influence. Language is inherently social; the study of social factors impacting language, and vice versa, is the content of sociolinguistics. This is not to negate the role of technology, media and materiality in shaping communication and language. Internet linguistics, as the field of this research, is heavily influenced by new technology, both in the creation of new vocabulary as well as new environments for communication that shape communication norms and determine how language is used based on technological factors and requirements (Beard). For example, the way an email is written is distinctly different from how a text message is written or a blog post is written (Beard). With the emphasis on how certain new media forms and ecologies can impact language, it is necessary to characterize what new media is. New media forms and tools will be referenced throughout the study, but the field extends beyond a popular definition of digital technology.

(13)

The academic definition of new media calls for much more specificity than does its colloquial usage (Hansen 173). Because this study is inherently concerned with how words are formed and defined, it is important to draw the distinction. Just as the empirical section of this study will demonstrate colloquial usage of a word originating out of digital culture versus its acceptance into the dictionary and the English lexicon as official language, so will the same distinction be drawn for new media. Popular use of new media invokes the digital and is a catchall term for anything related to the Internet and its intersection with technology, images and sound (“What is New Media?”). Academically, this interpretation is challenged. Lev Manovich says that terms such as “digital” and “interactivity” can only be applied to new media with qualification (The Language of New Media 6874). Digital representation, commonly thought of as the way media has been redefined, is a complex idea which can be broken down in three unrelated concepts: “analogtodigital conversion (digitization), a common representational code, and numerical representation” (Manovich, The Language of New Media 68). When something is claimed to be new media because it is digital, one must also specify which of these concepts is applied. This can lead to ambiguity (Manovich, The Language of New Media 68). Numerical representation is what turns media into programmable data and is the interpretation of digital that can most closely be aligned with the common perception of radical reinvention of media (Manovich, The Language of New Media 68). It is what builds the environment in which this study can take place. Most significant still, for this study, is Manovich’s view of interactivity, which requires qualification as does digital. “Used in relation to computerbased media, the concept of interactivity is a tautology” because

humancomputer interfaces are inherently interactive (Manovich, The Language of New Media 71),. In the study of Internet language and computermediated communication, interactivity is viewed from a sociopsychological perspective, however when used in regards new media studies, it intersect with other concepts, such as open or closed interactivity (Manovich, The Language of New Media 71), For this study of Internet language, interactivity is inherent, as Manovich says is the case in all humancomputer interaction, and language can be shaped by the technical requirements of a new media form, established through the numerical representation of digital.

(14)

III. The Corpus As a field of study, Internet linguistics relates to corpus linguistics, which takes preexisting and naturally occurring texts as its objects of study and analysis – “real life language use” (McEnery and Wilson 1). These texts can be derived from film, literature or indeed the Internet, with its bountiful record of language and communication. Historically, corpora were collected by hand, but now automated processes can scrape extremely large amounts of data. The first computerized corpus was a study of grammatical variation from transcribed spoken language, done by the Montreal French Project in 1971 (Sankoff and Sankoff). Methods of research and analysis in corpus linguistics can be divided into three perspectives: Annotation, which consists of the application of a scheme to a text, including partofspeech tags and sentence of word structure markups; Abstraction, which takes the annotation scheme and translates it into a theory or model; and Analysis, in which the scheme or dataset is probed or manipulated to expose statistical patterns or rule discoveries (Wallis and Nelson). One of the most prominent linguists of the modern era, Noam Chomsky, argued against corpus linguistics, saying “it doesn’t mean anything” (Andor 97). He was vehement in his view that we can learn more about language by taking a purely scientific approach. Chomsky compared linguistics to physics, and stated that linguists should emulate physicists who might videotape happenings in the world and then come up with insights based on hours of footage (Andor 97). Chomsky urged welldefined experimentation. While some corpus linguists have challenged Chomsky’s stance, it would be naive to deny his influence. The clash between these two perspectives represents the essence of a rationalempirical divide that exists in linguistics as well as in other academic fields (McEnery and Wilson 5). In linguistics, the rationalist theory is “based on artificial behavioural [sic] data and conscious introspective judgements” and has the fundamental goal of cognitive plausibility (McEnery and Wilson 5). “The aim is to develop a theory of language that not only emulates the external effects of human language processing, but actively seeks to claim that it represents how the processing is actually undertaken” (McEnery and Wilson 5). The empirical approach, on the other hand, is less concerned with the cognitive processes of language, instead looking at naturally occurring data – grammar and spelling, for example – to compare one specific occurrence to the larger corpus in order to determine its validity. Since the basis of Chomsky’s linguistic theory is that the cognitive principles underlying the structure of language, making communication through its prescribed rules possible, are

(15)

biologically determined and innate, his contribution falls in the realm of language acquisition and understanding (Lyons 4; McEnery and Wilson 5). He argues that all humans have the same basic underlying linguistic structure, regardless of social or cultural influences (Lyons 6). People are able to understand a wide array of messages expressed through varying formal grammar structures due to the human’s innate grammatical cognizance. While Chomsky’s assertions are valid, this study is concerned with examining the changes that happen to a corpus superficially, through social and cultural influence. Chomsky was interested in competence, rather than performance, which is the hallmark of corpus linguistics (McEnery and Wilson 6). A. Corpus Limitations One of the challenges of corpus linguistics, independent of Chomsky’s critique, is the practicality of access. Prior to the Internet, access was limited physically through destroyed or distant texts. A corpus has definable physical boundaries in the pages of a book, temporal boundaries in the spoken word, and easily established authorship (Crystal, Brave New World 1). On the Internet, these boundaries are thrown into question. Authorship and where a text starts and begins may be unclear, which is problematic when putting together a corpus (Crystal, Brave New World 1). In the field of Internet linguistics, the corpus, and to what extent it can be constructed, is determined by what is available on the eponymous Internet. However, the Internet is impossibly large and there are many layers of restriction that determine to what extent a researcher might have access to desired data. The Deep Web, also known as the invisible web, “comprises all the information sources available on the World Wide Web that are overlooked by conventional search engines, including Google” (Devine and EggerSider). It is the largest share of the web and the fastest growing category of information and information sources on the web (Devine and EggerSider). This content, not indexed by search engines, runs the spectrum of information and activity, including both academic databases and illegal marketplaces (Devine and EggerSider; Poladian and Stone). While much of the Deep Web is accessible to anyone, navigating it requires knowhow, making building a corpus for research purposes that includes this invisible content more challenging (Poladian and Stone). Tracking language change accurately is compromised when access is impeded.

(16)

More specifically, the deep web is a challenge to the research for this paper, since the empirical research seeks to track a word’s origins and growth over time. Data scraped was reliant on Google’s search engine and algorithms. Deep web content is not surfaced by search engines, resulting in a significant gap in the scrapes that were returned. In this study, this problem was addressed by scraping major news sources to examine mainstream adoption of Internet jargon, represented by the word [selfie] and utilizing Google Trends to visualize the growth of its popularity. While the first recorded usage of [selfie] was identified on the public web, it could have been used previously in areas that restrict access. The true genesis of new words is often ambiguous, and the limiting of the corpus as a result of unindexed web content or restricted access makes it even more difficult to track and analyze language change online. The deep web contains innumerable chat threads, online forums and other modes of recorded communication that might be missed in linguistic research, and that in theory could have been relevant to this study. However, to a certain extent this can be balanced out by the principle of relevance (Crystal, Internet Linguistics 140). The notion that “human

cognition is geared to the maximisation of relevance” and that “utterances create expectation of optimal relevance” illuminates how the context of what is written online can determine what we choose to examine in linguistic research (Wilson and Sperber 249). Coherence and the identification of spam contribute to determining relevance and building a corpus, as does, crucially, the process of web indexing that underpins all online linguistic research (Crystal, Internet Linguistics 141). Not only do web crawlers and web indexers miss out on a majority of web content, but different tools and algorithms can produce different results, affecting the makeup of the corpus. David Crystal provides an example in his book Internet Linguistics showing how a search for [linguistics] and [phonetics] in four different search engines (Yahoo, Google, Bing, Ask), produced results that varied from 62.2 million hits to 3.2 million for [linguistics] and 8.1 million to 0.4 million for [phonetics] (141). Each of these search engines have determined relevance in a different way and built their algorithms to reflect that. This paper’s empirical research relies on Google exclusively because of the limitation of the tools available. While the results produced reflect a certain pattern and progression that fits with the narrative and research of language change, it must be acknowledged that results might have varied if other search engines or web crawlers had been used.

(17)

B. The Corpus of the Dictionary The history and power of the dictionary will be further discussed in Chapter Four, but it is appropriate to briefly comment here on its corpus and on lexicological practices. Oxford Dictionaries and Oxford University Press, the publishing house that puts out the Oxford English Dictionary, have a carefully curated corpus made up of collected materials including web pages, printed texts and certain academic journals to supplement subject areas which lack materials elsewhere. While I was unable to gain direct access to their corpus of processes beyond what they publish publically, Oxford Dictionaries has, on its website, a fairly comprehensive overview of their requirements for new words and their criteria for inclusion . This, along with their Words Blog, provided sufficient insight for this study. Oxford employs lexicographers whose job is to sift through text and identify new words, new uses for words or new meanings (“How do you decide if new words should enter Oxford Dictionaries?”). Due to the pervasiveness of the Internet in today’s communication and archiving practices, many of these texts are now hosted online and in databases (“the Oxford English Corpus”). For this reason Oxford must deal with the challenges of the Deep Web and corporate influences. However, because of their perceived prestige and authority as recordkeepers of the English language, they are likely granted access to data out of reach for researchers such as myself. For example, with their resources, they would be able to make use of tools that scrape and sort language in social media but come with a hefty price tag, whereas I would not have these systems available.

CHAPTER TWO: LANGUAGE VARIATION

I. Language Change and Language Variation English is a living language and is constantly evolving. Therefore studying how it is used in reallife social contexts is challenging (Aheam 1). The types of language change that this study is most concerned with are lexical changes, semantic changes, and to some extent syntax changes. While phonetic and phonological changes, dealing with pronunciation and accents, are significant, this study focuses on textbased communication online, eliminating the oral dimension. Wordformation, the main driver of lexical change, can be broken down into two approaches: onomasiological and semasiological (Stekauer 207). The first, which studies the act of naming, has historically been sidelined in the examination of

(18)

wordformation in English (Stekauer 207). The second focuses on “analysis of the alreadyexisting word stock”, moving from meaning to formation rather than the other way around (Stekauer 207). While onomasiology asks which word should be applied to an object or concept, semasiology starts with words and then applies meaning. When examining lexical changes, especially in a time of rapid technological change, the onomasiological approach seems to be a better framework through which to examine new words. Lexical changes can be diachronic, occurring historically or over time, or synchronic, occurring at a specific point in time. Diachronic linguistics studies the succession of and links between synchronic periods, comparing the grammar of at least two stages of language (Lightfoot 5). Since Internet linguistics has thus far been bound to the recent era, since the dawn of the world wide web in the early 1990s, a diachronic study would be superficial. With the Internet’s penchant for neologisms, a synchronic examination is more appropriate (Crystal 59). The case study in Chapter Five will further examine a word coined as a result of a specific need at a specific moment in time. Semantic and syntax changes, however, happen over longer periods of time, with few exceptions. When new grammatical constructions have emerged online, they overwhelmingly affect punctuation and capitalization, and have not always been accepted as legitimate rules. For example, the question mark is often left out in instant message chat and Twitter feeds where Standard English rules dictation there ought to be one (Cohen). This new grammatical form likely emerged as a result of Internet user convenience, not wanting to hit the Shift key during their conversational exchange, which is an example of how new media technology has impacted language. Caleb Melby, writing for Forbes Magazine, asserted that leaving off the question mark has no effect on the inference of asking a question. He does, however, note that there may be a generational divide in this understanding. These minor grammatical adjustments can add together to ultimately change what language looks like over the course of time. Syntactic change, however, alters natural sentence structure, affecting language forms such as how verbs are conjugated, how they agree with their subject and in what order parts of speech appear (Kroch). Such changes are among the most impactful to language. It may appear that new syntactical structures are rampant online. However, one of the few tangible changes that has emerged from the Internet and has been integrated into mainstream language is the evolution of the word [because] from merely a subordinating conjunction to acting as a preposition in the parts of speech of a sentence (Garber). As a subordinating

(19)

conjunction, [because] had two distinct forms, either followed by a finite clause or by a prepositional phrase. The prepositional phrase has become optional, allowing [because] to fulfill its function as the prepositionalbecause (Garber). This is how [because + x] clauses such as [because reasons] have become an accepted grammatical form. The origin of this construction are disputed, like many linguistic forms, and are traced back to either the Three Word Phrase comic in 2011, according to Internet and pop culture linguist Gretchen

McCulloch, a 1987 Saturday Night Live sketch, according to linguist Neal Whitman, or a combination of uses already existing in the linguistic landscape (Soloman). While not yet officially sanctioned under the rules of Standard English, the dictionary definition of [preposition]2_{allows for the new sentence construction to be employed in} informal language (McCulloch, Because). McCulloch challenges the idea that [because] is a true preposition, observing that it has canonically been used with pronouns, interjections and certain verbs, unlike other prepositions. Geoffrey K. Pullum, writing for the University of Pennsylvania’s Language Log, maintains that [because] has always been a preposition, stating that prepositions historically have paired with pronouns, interjections and verbs. The American Dialect Society named [because] their word of the year in 2013, in line with this grammatical evolution (“‘Because’ is the 2013 Word of the Year”). Incidentally, [because] beat out [selfie], the object word examined in this study’s empirical research, which was named word of the year by the Oxford English Dictionary that same year. Linguists still disagree about whether [because] should be given the designation of preposition, yet its mainstream and academic recognition demonstrates that the Internet can have syntactical influence, even if is it minor and incremental. This example demonstrates how difficult it is to determine the root of a change, yet the causes of language change play a significant role. Language economy, language contact, migration and the medium of communication all can create language change (Vicentini 37; Thomas and Kaufman). The Internet is home to all of these concepts;users tend to abbreviate for speed and character economy; the Internet is multilingual, allowing both for language contact and crosspollination; and new media formats have builtin rules that shape how text is put in and presented (Crystal, Internet Linguistics 5; 7882; Crystal, Language and the Internet). Language variation in all its iterations drives language change. 2_{McColloch cites the MirriamWebster definition of [preposition], which reads: “}_{a word or group of words} that is used with a noun, pronoun, or noun phrase to show direction, location, or time, or to introduce an object”

(20)

American linguist William Labov is widely regarded as the foremost expert on language variation, and in particular variationist sociolinguistics, which became internationally recognized after Labov presented the first sociolinguistics research report to the annual meeting of the Linguistic Society of America in December 1962 (Chambers and Schilling 2). His success was due in part to the comprehensive framework for sociolinguistic research he laid out, including “correlating linguistic variants with class, age, sex, and other social attributes”, considering style as a separate variable and tracking changes generationally, over apparent time (Chambers and Schilling 2). In the context of this study, the millennial generation is the preeminent group tied to most of the linguistic innovations coming out of the Internet. Millennial speech is considered so distinct and generational, that a survey by John Sutherland, professor of English at University College, reported that over 80% of parents could not could not correctly define the argot their children used online (Press Association, ICYMI, English language is changing faster than ever, says expert). This generational division cannot be solely attributed to the Internet. Every generation demonstrates a “distribution of linguistic behavior through the various age levels of the population” (Labov 133). Age is the social attribute that gets the most attention when discussing Internet language because a growing percentage of the population—in the developed world, in particular—have grown up as digital natives, inherently embedding vocabulary that has to do with new media technology into their individual lexicons (Prensky 1). II. Sociolinguistics Sociolinguistics is treated by some practitioners as “complementary to ‘linguistics proper’” in reaction to its pervasive neglect of socially conditioned variation in language (Fairclough, Language and Power 6). Its study explores the “systematic correlation between variations in linguistic form … and social variables,” such as age, class, the nature of relationships between speakers and different social settings (Fairclough, Language and Power 6). But sociolinguistics is too often concerned with the ‘what’ rather than the ‘why’ or ‘how’ when observing sociolinguistic variation (Fairclough, Language and Power 6). When a variation is observed, sociolinguistics is weak in recording what factors led to the change, what relationships and power structures wielded influence, how they came to exist, how they are being sustained and to what extent language is a weapon of influence therein (Fairclough 6). This paper does not seek to rectify the flaws of sociolinguistic study, but by drawing on new media theory can give insight as to how the social influences present online have impact on Internet linguistic variation, and how these influences were built and maintained.

(21)

Words are socially charged. “Language is not a neutral medium for communication but rather a set of socially embedded practices” (Aheam 8). As Raymond Williams proposed, social influence has more weight in shaping human processes than material and technology. Technological determination and McLuhan’s medium as the message can give insight into how any object can contain meaning, and the package in which a message is sent or received can alter its implications. However, when isolating language as its own medium, it is evident that its use and meaning are intrinsically tied to social processes and perceptions rather than just their objects of presentation. It falls to the conclusion that language should be considered through social determinism, and an acknowledgement that language is shaped by a collection of complex factors (Williams, Television 133). Williams was concerned with language in the context of cultural discussions too, defining the vocabulary used in discourse in his book Keywords: A Vocabulary of Culture and Society. Specifically, he is preoccupied by the eponymous ‘culture’, through which he observed a shift in use and meaning. At first he encounters the word used as either a demarcation of a kind of social superiority or a word describing the creation of artifacts such as books and film (Williams, Keyword 12). Later, he saw it used in reference to “ some central formation of values” and a more general use which nearly equated it with the word ‘society’ or a particular way of life (Williams, Keywords 12). He noted how these different meanings surfaced from his experience with various groups of people and concluded that ’culture’, like so many others, is a difficult word to grasp and symbolizes language change and the effort to understand it (Williams, Keyword 13). III. Speech communities As noted in Williams’ example of ‘culture’, the meaning and use of the word changed based on the community that was using it. Language shift is defined as the “process by which a speech community in a contact situation (i.e. consisting of bilingual speakers) gradually stops using one of its two languages in favor of the other” (Ravindranath v). Speech communities in contact with each other are the main catalyst of language change. William Labov’s definition of a speech community is one “not defined by any marked agreement in the use of language elements, so much as by participation in a set of shared norms”

(Sociolinguistic Patterns 120). Most groups that demonstrate any sustainability or

(22)

66). For language change to occur in these speech communities, they cannot be completely homogeneous (Chomsky, Aspects of the Theory of Syntax 3). The social variables identified through sociolinguistics can all influence the disruption of homogeneity in these speech communities, allowing for language change. The defining characteristic of a speech community that determines whether they can lend themselves to language changes is whether they are considered to have strong tie or weak tie networks. In social network theory, strong ties are considered the drivers of change. (Barnes 215228). However weak ties are more conducive to language change because people who are not strongly tied are more likely to move around and migrate between groups, which exposes language to variation (Granovetter; J. Milroy and L. Milroy). Online communities are difficult to define due to their virtual borders, yet taking on Labov’s assertion that speech communities merely need to share a set of language norms and that, according to Gumperz, they merely need the notion of permanence to be considered a community, many would qualify. The virtual borders of the Internet generally allow for free movement across the web, and online platforms of linguistic innovation, such as social media, could be considered to have many weak tie members. The Internet is too broad to be considered as a singular community, yet it is host to many that interplay and drive language change.

CHAPTER THREE: INTERNET LINGUISTICS

Internet linguistics, as a field of study, emerged when email and chat rooms started to produce a pattern of blatant informal language usage in the early to mid1990s (Crystal, Internet Linguistics). Since, then the corpus of language change and invention that has come out of Internet culture has been met with consistent chorus of disapproval. The advent of text messaging between mobile phones brought with it the much reviled text speak, which heavily featured abbreviations and the replacement of words with numerical characters (Crystal, Internet Linguistics 4). Social media has followed in its footsteps and become an environment that both produces and spreads slang terms at rapid speed, and exacerbates a speech pattern now synonymous with the Internet; its impact on language is, of course, also heavily criticized. At its most basic, Internet Linguistics can be defined as a field that studies the new or changed language styles and forms, having emerged from the Internet, various other new

(23)

media objects such as text messaging and email, as well as the culture that they influence (Crystal, The Scope of Internet Linguistics; University Of Wales Bangor). It covers any linguistic development with clear new media influence, which ranges from the avalanche of acronyms spurned from character limits in text messages and Twitter updates to words that describe a technological function, including names and trademarks that have become genericized, such as [google]3_{(Crystal, Internet Linguistics; Noon).} I. Computermediated communication In order to understand how Internet linguistics came to be, we must look at computermediated communication and all the new ways of interaction that have come from the introduction of computers. While computermediated communication is too broad a view when looking at written language as a whole, it is the correct framework in which the current research can take place. CMC has been around for decades, and is defined as any human communication in which two or more electronic devices are involved (McQuall). This includes email, instant messaging, chat rooms, text messages, social media, and the like (Thurlow, et al.). The bulk of research in the area of computermediated communication focuses on the social norms and social effects of technology and practices that fall under the CMC umbrella. People behave and communicate differently in professional settings versus social settings. The method of communication adds another variable to interaction (Barab, et al.). One benefit of examining these mediated spaces is that “people engage in socially meaningful activities online in a way that typically leaves a textual trace, making the interactions more accessible to scrutiny and reflection than is the case in ephemeral spoken communication, and enabling researchers to employ empirical, microlevel methods to shed light on macrolevel phenomena” (Barab, et al. 1). Even with such records, studying online behavior still relies heavily on anecdotes and speculation, rather than on empirical evidence (Barab, et. al.; Herring). This study does not seek to examine the evidential behaviors of groups and influencers online, even if they do have an obvious impact on language. 3_{Google, the company, initially fought to keep their trademarked moniker out of the dictionary. Both the} Oxford English Dictionary and the MerriamWebster Collegiate Dictionary added the neologism [google], defined as a verb in 2006. The word was kept lowercase due to copyright concerns (Noon). It is those copyright objections that allowed Google to successfully veto the Swedish Language Council from adding [ogooglebar] – translated as ungoogleable – to their list of new words. As its proposed definition was “something that cannot be found with any search engine”, Google objected because they demand words that are a variant on their name to be only related to their specific search engine, not generic search (Fanning).

(24)

However, behavior and communication are intrinsically tied, with on the one hand behavioral communication studies examining how individual, indirect expressions of feelings, needs, thoughts and identities can substitute for more open and direct communication (Ivanov and Werner); and on the other hand, a behavioral hivemind4_{that can dictate which social} phenomena gain traction. This is how memes, a form of computermediated communication, along with new forms of Internet jargon, can spread like wildfire across the Internet (O’Brien, Bowie and Johnston). When looking at language, CMC research can be applied in examining the differing habits of communication, depending on the medium. For example, the formality of language, addressing conventions, syntax and jargon usage vary greatly between a business email and a text message sent to a friend (Beard). It is this observation that underscores how certain customs or jargon present in one environment make the jump to being acceptable in another. In this study of Internet linguistics and elexicography, we are looking at words that originated or became popular in a specific environment, seeing their cultural impact and how they made the jump – or not – to being integrated in a different environment, differently mediated and with different social forces. Computermedia communication has long been accused of being a contributing factor to the deformalization of language (Crystal, Language and the Internet 2); or of creating a new, inferior language (Callot and Belmore). Emails, for example, have a distinct format – “a structure dictated by [...] mailer software” that requires a header, sender and receiver addresses and body text – but the purpose and what constitutes appropriate language for that environment is not always straightforward (Crystal, Language and the Internet 94). This results in part from huge user saturation, with an estimated 2.5 billion email users worldwide at the end of 2014. The language in email is not as consistent as its linguistics features (Radicati; Crystal, Language and the Internet 94). There exists a tension “between the nature of the medium and the aims and expectations of its users”. (Crystal, Language and the Internet 24). This derives from ambiguity over the Internet’s default style of “written speech,” and the fact that “electronic discourse is writing that very often reads as if it were being spoken — that is, as if the sender were writing talking” (David and Brewer 2; Crystal, Language and the Internet 25). 4_{A notional entity consisting of a large number of people who share their knowledge or opinions with one} another, regarded as producing either uncritical conformity or collective intelligence. (“Hive mind” Oxford

(25)

Typed messages sent through CMC challenge the binary notion of language being either written or spoken, not wholly falling within the linguistic characteristics for one or the other. Language in the situational constraints of CMC may flow with ease, be informal and switch topics rapidly as is the case in speech. However, if the communication is typed out, the participants may not see or hear each other. Conversely, the immediacy of mediums such as instant messaging deny the communicator the “planning and editing strategies at the disposal of even the most informal writer” (Callot and Belmore 14). A part of CMC studies and Internet linguistics would seek to examine to what extent it is possible to ‘write speech’ given the restrictions of the medium and the materiality of a keyboard with set letters, numbers and symbols. In this study, the interest lies with the result of habits that formed from working within these CMC environments and what new language this produces. The speed of digital communication, and its custom of decreased revision, can sometimes result in an error or malapropism inadvertently coining a new term. Only rarely do errors from fast typing or a lack of editorial revision lead to true miscommunication, as the social functions and expectations of the environment are more forgiving (Crystal, Language and the Internet 111). Incidentally, writing speech is a factor in building the elexicon because people participating in computermediated communication have the limitation of keystrokes to convey their meaning, demonstrating that necessity is the mother of invention. II. Internet Linguistics and David Crystal Computermediated communication studies often take a sociopsychological perspective. Internet linguistics focuses on the language and language forms, while considering social processes that influence them and the social effects they have. However, the psychological dimension of communication receives less attention. CMC is a building block for Internet linguistics, which is still a relatively new field of research without a large population of academics and linguists working in it. The leader of the field is the UK’s leading linguistic academic David Crystal, OBE, who has been very prolific in publishing works that deal with the intersection of language and new media. Any research done on Internet linguistics will inevitably arrive back at his work, including this study. Here, his works will be discussed in relation to the field, what he defines as Internet linguistics and how other linguists approach it. Crystal coined the term Internet Linguistics and established it as a new field of research in his February 2005 paper, ‘The Scope of Internet Linguistics’ that he presented to the American Association for the

How ‘selfie’ got into the dictionary: an examination of Internet linguistics and language change online