Latent Semantic Analysis, Corpus stylistics and Machine Learning. Stylometry for Translational and Authorial Style Analysis: The Case of Denys

Size: px

Start display at page:

Download "Latent Semantic Analysis, Corpus stylistics and Machine Learning. Stylometry for Translational and Authorial Style Analysis: The Case of Denys"

Scarlett Lee
5 years ago
Views:

1 Latent Semantic Analysis, Corpus stylistics and Machine Learning Stylometry for Translational and Authorial Style Analysis: The Case of Denys Johnson-Davies Translations into English A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Mohammed Al Batineh May, 2015

3 Dissertation written by Mohammed Al Batineh BA., Yarmouk University, Jordan, 2008 MA., Yarmouk University, Jordan, 2010 APPROVED BY, Chair, Doctoral Dissertation Committee Dr. Françoise Massardier-Kenney (advisor), Members, Doctoral Dissertation Committee Dr. Carol Maier, Dr. Gregory M. Shreve, Dr. Jonathan I. Maletic, Dr. Katherine Rawson ACCEPTED BY, Interim Chair, Modern and Classical Language Studies Dr. Keiran J Dunne, Dean, College of Arts and Sciences Dr. James L. Blank

4 TABLE OF CONTENTS LIST OF FIGURES... viii LIST OF TABLES... ix DEDICATION... x ABSTRACT... xii CHAPTER 1: INTRODUCTION Introduction Denys Johnson-Davies Research Hypotheses Research Method Significance of the Study Summary of Chapters CHAPTER 2: LITERATURE REVIEW A Brief History of Literary Stylistics Approaches to Style in Translation Studies Text-Oriented Approaches Comparative Approach Target-Oriented Approach Translator-Oriented Approaches Cognitive-Oriented Approach Conclusion iii

5 CHAPTER 3: METHODOLOGY Introduction Data Collection Corpus Database Corpus Compilation and Pre-processing Latent Semantic Analysis LSA Similarity Query LSA Similarity Cutoff LSA Output Evaluation Corpus Stylistics Standardized Type-Token Ratio (STTR) Mean Sentence Length Punctuation marks Statistical Testing Machine Learning Approach Character n-grams Part of Speech (POS) n-grams Word n-grams Tools Used in the Dissertation Conclusion CHAPTER 4: LATENT SEMANTIC ANALYSIS RESULTS Introduction iv

6 4.2. LSA Similarity Analysis LSA Similarity Query on J-D s Translation before Creative Writing LSA Results with V= LSA Similarity Query on J-D s Translation after Creative Writing LSA Results with V= Conclusion CHAPTER 5: CORPUS STYLISTICS AND MACHINE LEARNING ANALYSIS RESULTS Introduction Corpus Analysis Textual Analysis Standardized Type-Token Ratio Mean Sentence Length Punctuation Marks Analysis Standardized hyphen Analysis Standardized Comma Analysis Standardized Semicolon Analysis SPSS Statistical Analysis Textual Analysis Standardized Type-Token Ratios (STTRs) Mean Sentence Length v

7 Punctuation Marks analysis Standardized Comma analysis Standardized Hyphen analysis Standardized Semicolon analysis Machine Learning Stylometry JGAAP Tool Corpus Pre-processing JGAAP Analysis Method Style Markers Analysis Character n-gram analysis Part-of-Speech (POS) Analysis Word n-gram Analysis Conclusion CHAPTER 6: DISCUSSION Introduction Zooming into the Results Thematic analysis Textual Analysis STTR Mean Sentence length Punctuation Marks Syntactic Analysis vi

8 6.7. Word n-gram Analysis Character n-gram Analysis On the Translating and Writing Activities of J-D A Framework for Studying Translator Style Corpus Compilation and Control Digital Corpus Preparation: Corpus Preprocessing Style Markers Selection: Corpus Analysis Method Conclusion CHAPTER 7: CONCLUSION Summary of Results Limitations of the Study Implication of LSA method for Translation Studies Future Directions GLOSSARY OF ACRONYMS REFERENCES APPENDIX A: List of Denys Johnson-Davies Translated Short Stories APPENDIX B: List of Denys Johnson-Davies Creative Writing Short Stories vii

9 LIST OF FIGURES Figure 1: Matrix of made up documents Figure 2: Made up documents in 2-dimensional space Figure 3: LSA similarity query Figure 4: One-to-many similarity query process Figure 5: LSA Analysis of TBCRW_raw Figure 6: LSA experiment 1 results (Q1--Q5) Figure 7: LSA experiment 1 results (Q6--Q10) Figure 8: LSA experiment 1 results (Q11--Q15) Figure 9: LSA analysis of TACRW_raw Figure 10: LSA experiment 2 results (Q1--Q5) Figure 11: LSA experiment 2 results (Q6--Q10) Figure 12: LSA experiment 2 results (Q11-Q15) Figure 13: Machine Learning Translator style detection (Adopted form Efstathios Stamatatos) Figure 14: JGAAP tool interface Figure 15: Vectors of made up documents Figure 16: Machine Learning Character 3-gram analysis Figure 17: Machine Learning POS n-gram analysis Figure 18: Machine Learning Word n-gram analysis Figure 19: Framework for translator style analysis viii

10 LIST OF TABLES Table 1: The corpora of the present study Table 2: Penn Treebank tag set (Adopted from Marcus, Santorini, and Marcinkiewicz) Table 3: Tools used in the dissertation Table 4: CRW and TBCRW_raw size Table 5: LSA output of similarity query analysis on TBCRW_raw Table 6: CRW and TACRW_raw size Table 7: LSA output of similarity query analysis on TACRW_raw Table 8: Study corpora from the LSA results Table 9: STTR score in the three corpora Table 10: Mean Sentence Length score in the three corpora Table 11: Standardized Hyphen score in the three corpora Table 12: Standardized Comma score in the three corpora Table 13: Standardized Semicolon score in the three corpora ix

11 DEDICATION To my parents, Sabri and Aisha To my family and friends x

12 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my supervisor, Dr. Françoise Massardier-Kenney for her guidance and insightful feedback throughout the writing process of this dissertation. Dr. Kenney s support is beyond what words can express. I would also like to thank the committee members: Dr. Carol Maier, Dr. Gregory M. Shreve, Dr. Jonathan Maletic and Dr. Katherine Rawson for their time and for all their comments and suggestions. Many thanks to the faculty members of the Department of Modern and Classical Language Studies for helping me grow as a graduate student and a researcher in translation. I would like to particularly thank Dr. Isabel Lacruz for her help refining the experimental component of this study. I am also grateful to Dr. Judy Wakabayashi for helping me present some of research in two conferences and for always being encouraging and supportive. Special thanks to Dr. Nouh Al-Hindawi, Brian Bartman, Loubna Bilali, Mohammed Al-Rawashdeh, Adriana Di Biase, Bilal Sayaheen and Ibrahim Al-Omar for their help, support and friendship. xi

13 ABSTRACT The analysis of style in translation discipline typically relies on methods borrowed from literary studies. Most of the style-related research conducted in translation studies has either focused on the style of the author or on the text type as manifested in the translation as opposed to the style of the translator. The few studies of translator style that have been carried out using corpus methodologies present some methodological limitations related to corpus compilation and control which affect the analyis of style. To address these limitations, the present study adopts an interdisciplinary approach combining Latent Semantic Analysis (LSA), and methods from Corpus Stylistics, and Machine Learning Stylometry in order to develop a rigorous framework for studying translator style. The suggested framework is developed based on the investigation of the translations and creative writings of Denys Johnson-Davies (J-D), a British creative writer and an Arabic-English translator. This study attempts to trace instances where the style of J-D the translator intersects with the style of J-D the author. It investigates the effect of J-D s translating activity on his own writing and vice versa in order to determine the extent to which the two activities influence each other. The computational stylistic (corpus & machine learning) and the thematic (LSA) analyses suggest that J-D s style as a translator impacted his style as a writer. In addition, it was evident that translation helped J-D to develop his writing skills and style. Indeed, the translating activity served as a source of inspiration and intertextuality for his creative writing. As for the interaction between J-D s creative writing and the post-creative writing translations, xii

14 the findings show that J-D s creative writing impacted the selection of short stories he translated after the production of his creative writing, which revolved around themes he developed as a creative writer. xiii

15 CHAPTER 1: INTRODUCTION 1.1. Introduction This dissertation investigates the interaction between translation and creative writing activities of translators by focusing on the case of the translatorwriter Denys Johnson-Davies (J-D). J-D was the first and one of the most influential Arabic-English literary translators of Modern Arabic Literature. The study focuses on J-D s style as displayed in his translations and his creative writings in an attempt to determine if his translating activity influenced or was impacted by his creative writing activities. It discusses the history and the impact that stylistics within literary studies has on the study of style in translation studies. It also reviews the literature related to translator style within translation studies. The study also highlights the limitations of previous research in this area; it establishes the need for more empirical investigations of translator style and proposes an interdisciplinary framework for translator style analysis combining methods from corpus stylistics, computer science and stylometry. The interdisciplinary nature of translation studies has encouraged translation scholars to adopt methods from literary studies in order to study style. However, the application of such methods in translation studies is still debatable and problematic. Within literary studies, Geoffrey Leech defines style as the sum of linguistic features associated with texts or textual samples defined by some set of contextual parameters (55). In translation, style refers to the Target Text (TT) style as 1

16 manifested in its type 1 or to the translator style. The former has been studied either by tracing the manifestation of the Source Text (ST) style in the Target Text (TT) (Reiss, Translation Criticism- Potentials and Limitations: Categories and Criteria for Translation Quality Assessment), by finding out the extent to which the TT follows the conventions and the norms of the target culture (Nord) or by examining the realization of the translation brief in the TT (Vermeer). The latter (translator style) has been studied through the analysis of the stylistic patterns of a specific translator that distinguish him/her from other translators. Kirsten Malmkjær defines translational stylistics as the study of why, given the source text, the translation has been shaped in such a way that it comes to mean what it does [emphasis in the original] (39). Works on translational stylistics have focused on either source texts or on target texts as their object of analysis. For instance, Eugene Nida, the leading theoretician of the linguistic school, focuses on the notions of content and form, message and style, as a way to talk about equivalence placing more emphasis on the style of the source text as a reference for producing the style of the target text. In contrast, the functionalist approach has shifted the focus from source-oriented to target-oriented stylistics. The study of style has become more concerned with the function of the TT and has departed from any stylistic constraints imposed by the style of the ST. Katharina Reiss, for instance, has taken style as a 1 Typology of texts based on their linguistic and stylistic characteristics. Based on these features, texts are divided in to types depending on their function, which could be narrative, descriptive, expository or argumentative. 2

17 point of departure for her work on translation criticism. She relies on text types and the way the stylistic features of each text type should be exhibited in target culture and the target text (TT). According to the functionalist approach, a poem needs to function as a poem in the target literary system. In the same manner, a novel needs to function as a novel based on the norms and the convention of the target culture s literature (Nord). With the emergence of the cultural turn, which placed more emphasis on translators as cultural mediators and translation as rewriting and manipulation, some scholars have started to argue for the importance of recognizing the translator s voice and presence in his/her translations and for considering translators as one of the main agents in the translation process. Theo Hermans argues, for example, that the TT always implies more than one voice in the text, more than one discursive presence. He also indicates that the illusion of transparency and the illusion of one voice blind the reader to the presence of the other voice (27). In the same vein, Lawrence Venuti in The Translator s Invisibility advocates making the translator more visible so as to resist and change the conditions under which translation is theorized and practiced (17). These arguments for a more sophisticated understanding of the translator s voice or visibility have implications for the study of style in translation studies. The matter of close reading, i.e., the careful reading of passages with an emphasis on lexical choices, figures of speech and the syntax of a specific text, has dominated the study of style or voice in translation. Close reading might be a good 3

18 tool when analyzing style in isolated texts; however, it is not applicable to a large corpus of texts. In this kind of corpus, close reading becomes untenable as an exhaustive or definitive method of evidence gathering [; and in following it,] something important will inevitably be missed (Jockers 9). Tracing the stylistic patterns of a specific translator in a large corpus of different translations through close reading as a methodology is not only extremely time consuming and but it is likely that some stylistic patterns will be missed. However, compiling and analyzing a corpus of texts for the purpose of analyzing translator style is not an easy task. Studies on this topic are rare and are mostly derived from literary studies (e.g. Mason and Abdullah). Gabriela Saldanha argues that stylistics in literary studies, as defined by Leech and other scholars, are meant to study the textual style of the translation by focusing on the reproduction of the source-text style in the target text, and not the personal style of the translator, which is a way of translating that distinguishes one translator s work from that of others, and is felt to be recognisable across a range of translations by the same translator (Saldanha 28). In order to determine this style across different TT by the same translator, an interdisciplinary approach that goes beyond traditional close reading might be useful to allow translation scholars to analyze and reveal translator stylistic patterns of choice. Recently, translation scholars have begun to use computational methods in the analysis of style. One of the first studies that makes use of computational methods to analyze translator style is Mona Baker's Towards a Methodology. In her study, 4

19 Baker uses corpus stylistics, which is a method that makes use of the computer for extracting some stylistic patterns in a large corpus of digital texts. Another seminal work on translator style is done by Saldanha. Following Baker s steps, Saldanha makes use of corpus stylistics to analyze the style of two translators in an attempt to develop a methodology and to propose a working definition for translator style. However, the two studies present some methodological problem related to corpus compilation, analysis and control. These issues are discussed in more details in Chapter two. What might be needed is a broader perspective that adopts interdisciplinary approaches to reveal the personal attributes of the translator. Within computer science, non-traditional authorship attribution (AA) and stylometry have exclusively worked on developing computational methods, relying on artificial intelligence and on statistical analysis to analyze authorial style. AA is a domain aimed at automatically analyzing texts based on their author s style (Cristani et al.). Stylometry is also a field of study that draws on the use of computers to statistically analyze the style of one author or the variation in style between two or more authors. It builds on the assumption that writers have unique unconscious writing habits or features. These unique writing features are measured to create an author profile against which other texts or authors can be compared (Schulstad et al.). Tony McEnery and Michael Oakes define stylometric analysis as an attempt to capture the essence of the style of a particular author by reference to a variety of quantitative criteria (548). Different scholars in these varying fields have proven that methods used in such research are, to 5

20 a great extent, accurate (Grieve, Houvardas and Stamatatos and McEnery and Oakes) and outperform manual analysis of patterns of personal style of text producers. Applied to the personal style of authors, methods from authorship attribution and stylometry could in turn be applied to the study of translator style. In this regard, Meng Ji argues that stylometry is one of the best-established methodologies for studying the authorial style of any document author, but it has rarely been practiced on translation texts in exploring literary stylistics or authorship attribution. As a result, many scholarly works on individual translators style seem to have based their judgments on limited excerpts randomly and irregularly selected from parallel source/target texts (Ji 79). Thus, the present study draws on the notion of translator style and adopts methods from authorship attributions and stylometry in order to analyze and compare the authorial and the translational style of Denys Johnson- Davies (J-D) and to propose an interdisciplinary framework for translator style analysis Denys Johnson-Davies Denys Johnson-Davies is one of the most influential Arabic-English translators of modern Arabic literature. J-D was born in 1922 in Canada to English parents. He spent his childhood in Cairo, then in Sudan (Johnson-Davies 1). At the age of fourteen, J-D attended the School of Oriental Studies in London. In the summer of that year, he went to Cairo to learn Arabic; this was his first academic encounter with the Arabic language and culture. During his stay, he attempted to 6

21 immerse himself in the Arabic Egyptian culture, frequenting traditional cafés in Cairo, where he made several friends. This gave him the chance to become familiar with Egyptian culture, daily life and dialect, which had a great effect on his translation work and text selection. The following year, Johnson-Davies went to Cambridge, where he began reading Arabic literary works in addition to extracts from the Qur an. This experience increased his knowledge of Arabic language, literary style and tradition. After graduation, he was employed by the BBC radio. His duties included checking translations made into Arabic of talks to be broadcasted (Johnson-Davies 9). This was his first encounter with translation. At that time, he became eager to know more about Arabic language and literature, and he contacted famous Arabic short story writers, such as Mahmud Teymour, Naguib Mahfouz, Tayeb Salih and Twfiq al- Hakim. After meeting a number of Arab writers, Johnson-Davies took his first step towards translation in 1946 by translating two of Teymour s short stories into English. He published translations in literary magazines, with the help of a friend. At that time, Davies was the only translator who was interested in translating Modern Arabic literature into English. He continued his career as an Arabic-English translator and he was the first to recognize and translate Naguib Mahfouz, the Nobel laureate. J- D has translated more than thirty Arabic volumes including short stories, novels and biographies. He also won many translation prizes. After he had been translating for almost forty years, J-D began to write his 7

22 own short stories. He published his own collection in English under the title Fate of a Prisoner and Other Stories in The collection contains fifteen short stories discussing themes related to Arab culture and narrates stories taking place in different Arab countries such as Egypt, Sudan, Lebanon and the Arabian Gulf. However, J-D the author was not successful as J-D the translator. His translations received more attention than his creative writing. He mentions in his autobiography that his work, Fate of a Prisoner and Other Stories, received no notices except for two reviews one in Al-Ahram Weekly by John Rodenbeck, J-D s friend and the other in The Literary Review by Francis King. Few works analyze the notions of creative writing and translating as two activities done by a single author/translator. To the best of my knowledge and based on the literature I consulted that dealt with the notion of translator style, the only study that investigated those two activities performed by a writer-translator is Walder s A Timbre of Its Own: investigating style in translation and original writing. Thus, J-D s translating and creative writing activities make a valuable case study for addressing the stylistic and thematic influence of translating on creative writing and vice versa Research Hypotheses The present study focuses on the translating and the creative writing activities of J-D and their impact on each other. It investigates whether the short stories J-D translated before the production of the creative writing have a thematic or 8

23 stylistic impact on his creative writing short stories and whether J-D s creative writing short stories impacted his style and selection of short stories translated after the production of creative writing. To this end, this study tests the following hypotheses: 1- The short stories J-D translated before the production of creative writing are close in theme to his creative writing short stories. 2- The short stories J-D translated after the production of creative writing are close in theme to his creative writing short stories. 3- The short stories J-D translated before the production of creative writing will display a number of stylistic characteristics similar to those visible in J-D s creative writing. 4- The stylistic characteristics of J-D s creative writing are similar to those which are visible in the short stories he translated after the production of creative writing 1.4. Research Method Three corpora were built to analyze the translational and the authorial style of J-D. The first corpus contains short stories translated by J-D before the production of his creative writing. The second corpus contains his creative writing short stories and the third corpus contains the short stories J-D translated after the production of his creative writing. This study only analyzes short stories translated from Arabic into English and written in English by J-D given the fact that J-D only wrote short stories. Including only short stories helps control the corpus for genre to make sure that genre 9

24 does not interfere with the stylistic analysis in this study. For the data analysis, this study applies three computational methods to analyze the style of J-D his translations and writings. First, Latent Semantic Analysis (LSA) will be used as a fully automated methodology, taken from computer science, to conduct thematic similarity or relevancy analysis. The thematic analysis using LSA serves the current study in two ways: 1- it helps control the three corpora in this study for theme and 2- it helps reveal the thematic relation between J-D s translations before and after his creative writing to his creative writing themes. The second method makes use of corpus stylistics, a sub-field of corpus linguistics that uses the application of corpus methods and tools to analyze style. Baker defines a corpus as any collection of running texts, held in electronic form and analyzable automatically or semi-automatically ("Corpora" 225). Corpus stylitsics is used to analyze a set of style markers inclusing Standarized Type-Token Ratio (STTR), average sentence length and punctuation marks. The third method that will be used in this study is machine learning Profile- Based (PB) approach. PB approach, adopted from stylometry, is used to build the authorial and the translational style of J-D in his translations and creative writing. Such profiles contain different lexical and stylistic variables that show the personal stylistic attributes of the text producer. Machine learning PB is used to analyze the following style markers: word n-gram, character n-gram and Part-of-Speech (POS) n- gram. In this study, the style-markers profile analysis attempt to capture the lexical, semantic and syntactic variables in the writing and the translation of J-D in the three 10

25 corpora, short stories translated before creative writing, creative writing short stories and the short stories translated after creative writing Significance of the Study This study addresses translator style as one of the largely ignored topics in translation studies. Saldanha points out that most work in translation studies focuses on the style of translations, as opposed to the style of translators (27). The study of translation style is source-oriented and focuses on the reproduction of the source-text style in the translation; while the style of translators is target-oriented focuses on stylistic patterns found in different translations produced by the translator. In addition, this study empirically investigates the relation between creative writing and translation as two activities carried out by one translator in an attempt to show the extent to which translating affects creative writing and vice versa. No systematic studies have been conducted in this area, and the only studies there remain anecdotal. In addition, an important issue in the study of translator style is the lack of a solid methodological framework to conduct this type of analysis. Even works on translator style done using corpus methodologies by translation scholars present some methodological issues related to corpus compilation and control (e.g. Baker "Towards a Methodology" and Saldanha). Furthermore, existing definitions of a style markers or style variables are problematic. To address these issues, this dissertation adopts an interdisciplinary approach combining computational methods from corpus linguistics, computer science, and stylometry and authorship attributions to develop a rigorous 11

26 framework to study translator style. In addition, it proposes a style-marker profile that includes style variables, which are tested and approved by different scholars such as Stamatatos, Fakotakis, and Kokkinakis, Kim and Walter; Zhou, Xu, and Tan; Coyotl- Morales et al. and Zheng et al., to capture the personal style of text authors. Lastly, this study discusses best practices related to corpus compilation and control for translator style analysis purposes Summary of Chapters The present dissertation includes seven chapters. Chapter one provides a general overview of the research on translator style in the field of translation studies. It also addresses the significance of the present study and lays down the research hypotheses. Chapter two provides a review of the related literature. It starts with a general overview of stylistics in literary studies; it shows the connection between stylistics in literary studies and the study of style in translation studies, and also provides an overview of the development of the study of style in the translations studies field. Chapter three describes the research methodology, data collection and analysis. Chapter four reports the results of the Latent Semantic Analysis method that reveals the thematic connection between J-D s translations and creative writing. Chapter five provides the results of corpus stylistics analysis of Standardized Type- Token Ratio (STTR), Sentence Length and punctuation marks. This chapter also reports the results of the second method of analysis, which relied on Machine- Learning Profile-Based approach to analyze words n-grams, Part-of-Speech n-grams 12

27 and character n-grams in the three corpora in this study. Chapter six discusses the results of this study with reference to the research hypotheses that motivated it. The discussion is framed within translation theories related to the relation between translated and non-translated texts and the relation between creative writing and translation in order to reflect on the patterns discovered and to show the stylistic and thematic relations between creative writing and translation in the case of J-D. It also provides a proposed framework for translator style analysis and best practices related to corpus compilation and control for translator style analysis purposes. The last chapter in this study, chapter seven, summarizes the dissertation s findings and discusses the limitations and proposes directions for future research. 13

28 CHAPTER 2: LITERATURE REVIEW 2.1. A Brief History of Literary Stylistics Literary style has attracted the attention of many scholars and thinkers. Ancient Roman and Greek philosophers such as Aristotle, Cicero, Demetrius and Quintilian treated style as the proper adornment of thought (Augustyn 1). They focused on the aesthetic function of literary style; the analysis of style was conducted to reveal the beauty of the literary piece. In this approach, the focus of stylistic analysis is mainly concentrated on metaphors, images and symbols. This view of style was held for many centuries and shifted only in early nineteenth century when style came to mean the message carried by the frequency-distributions and transitional probabilities of its linguistic features (Bloch 42). In modern stylistics, style has become a component of literary analysis revealing what a text means (Culler 906). This modern view of style can be traced back to the work of the Russian formalist, Roman Jakobson who argues that the goal of stylistic analysis is to reveal the literariness of a verbal message (Jakobson 360). Jakobson was the first to propose a framework to conduct stylistic analysis of literary works. His framework includes analyzing six text-related elements in order to reveal the function of a particular communicative act. These elements include: context, addresser, addressee, contact, code, and message of the verbal act. Jakobson indicates that these six elements constitute six functions of language including a referential, an emotive, a conative, a phatic, a metalingual and a 14

29 poetic function. Jakobson s thoughts on literary stylistics have impacted the different literary schools and scholars in Europe and in America. Following Jakobson s steps, Jan Mukařovský, a Czech literary and aesthetic theorist, addressed the notion of style in literature and argued that the style of a literary work is different from the style in standard language. His argument is one of the first arguments that discussed the style of non-literary works. Mukarovsky explains that what differentiates literary language from standard language is the use of some patterns that go against the norms of standard language (Allen). This in turn creates, what Victor Shklovsky calls, a defamiliarizing effect on the reader and attracts his/her attention (12). This deviation from ordinary language forces the reader to perceive the communicative act in a different way. Built on the ideas of Jakobson and Mukarovsky, the New Criticism appeared in the United States during the early twentieth century. It focused on the close reading of texts as a way to analyze their style. New Criticism disconnect[s] the literary text from its social and historical context (Jancovich 200). That is, the text itself is the independent unit of meaning and meaning is inside the text. Another literary school, Practical Criticism, appeared in Britain in the 1920s. This school adopted a psychological approach to stylistics and focused on the psychological effect of the interaction between the text and the reader (Richards 7). Michael Riffaterre, a member of this school, called for a new reader-oriented theory of style. In this theory he emphasized the importance of including an analysis of reader s response in the stylistic analysis of the act of communication (Clayton and Rothstein 24). 15

30 In turn, some stylistics scholars such as David Crystal and Derek Davy (1969), Nils Enkvist (1973) and Michael Halliday (1978) criticized previous approaches to literary stylistics for not considering the social context in the stylistic analysis of texts. Crystal and Davy called for studying the social impact and the social context in style analysis. They were specifically concerned with the question of how a certain context or a social event could restrict the stylistic choices of a writer or a speaker (Sharndama and Mohammed). Halliday also emphasized the importance of the social function of discourse. He propounded a theory of stylistics that combines the language system and the social dimension of the language. Halliday argued that the social structure and language are two inseparable entities in any act of communication (Bawarshi and Reiff 30). Thus, the analysis of any linguistic act of communication should consider the social structure that is embedded in and transmitted through language. The development and the combination of the social and the linguistic approaches to stylistics have paved the way for a new trend in literary stylistics known as discourse analysis. The discourse analysis approach focuses on context as an important component in the stylistic analysis of the discourse. Ronald Carter and Paul Simpson argue that issues such as gender, class, sociopolitical determinations and ideology cannot be ignored in the analysis of the discourse in any communicative event, written or spoken (14). This approach argues that the stylistic analysis should be concerned not simply with the micro-contexts of the effects of words across sentences or conversational turns but also with the macro-contexts of larger social 16

31 patterns (Carter and Simpson 14). The analysis of micro and macro contexts of the discourse and the contextual elements of the discourse, as proposed by this approach, takes into consideration the linguistic, the social, the ideological, and the political aspects of the communicative act. The development of the stylistic approaches to literary style has impacted the stylistic approaches to style in translation studies. This impact and the development of the stylistic approaches to style in translation are discussed in the following sections Approaches to Style in Translation Studies Translation scholars who focus on literary translation have adopted approaches from literary studies to study and discuss style in translation. For instance, the notions of literariness (Jakobson) and defamiliarizing effect (Shklovsky) have impacted the works of translation scholars on stylistics and generated a text-oriented view of style. This particular approach focused on the stylistic peculiarities of the source text and how they should be manifested in the target text. Halliday s theory of stylistics, which focuses on the importance of the social function of the discourse, has also shifted the focus of translational stylistics to take a functional turn considering the stylistic function the TT in its new context, the target culture. Carter and Simpson s discourses-analysis approach to literary style, which draws on the critical and on the close reading of texts to reveal the hidden intended meaning of texts that is beyond the physical representation of the signs in the verbal discourse, has generated a descriptive approach to translational stylistics with a focus on the translator as a 17

32 cultural agent. The reader-oriented theory to literary stylistics as proposed by Michael Riffaterre has paved the way for a reader-oriented and a cognitive approach to style in translation studies. This approach, as proposed by Jean Boase-Beier, argues that the style of the text is determined by the cognitive state of the reader. It also argues that the translator, as a reader, attempts to understand the mind in the ST (the attitude of the ST author) in order to reflect to this mind style and then to recreate it in the translation. Translation scholars and theorists have followed different approaches to discuss the notion of style in translation. The study of style in translation studies has taken three main forms. The first one is text-oriented. This approach focuses on either the style of the ST and its manifestation in the TT (comparative approach) or on the style of the TT and its adaptation in the target culture (functionalist approach). The second approach to translation stylistics focuses on translator style or presence in his/her translations; while the third approach takes a cognitive turn. This later approach builds on cognitive linguistics to study the relation between text and mind. It views style as a cognitive entity rather than a textual one (Boase-Beier, Stylistic Approaches to Translation). The following sections review and discuss in some details the literature pertaining to the study of style in translation studies Text-Oriented Approaches As mentioned earlier, the first approaches to style in literary studies have focused on the stylistic features of texts as manifested in the linguistic choice of the 18

33 author. These approaches ignore the cognitive and the ideological dimensions of texts. Similarly, text-oriented approaches to style in translation have focused on the textual attributes of the ST or/and the TT. Scholars who adopt a textual approach to translation stylistics are divided into two main groups. The first group, which adopts a comparative approach, is concerned with a comparative analysis of the ST style and the TT style. These scholars either focus on the divergence of the TT style from that of the ST (Boase-Beier, Knowing and Not Knowing") or on how the ST style should be rendered in the TT ( Nida and Vinay and Darbelnet). The second group adopts a functionalist approach, which stresses the importance of adapting the TT style to the norms and conventions of the target culture. This approach places a minimal emphasis on the style of the ST ( Nord, Vermeer) Comparative Approach Text-oriented comparative approaches to style in translation focus on the style of the ST and on the manifestation of this style in the TT. Scholars within this tradition, such as Eugene Nida and Chales Taber, and John Catford argue that the translator should always take the style of the ST as a point of departure to create the style of the TT. They argue that the style of the ST should be rendered as closely as possible in the TT. This view has generated a comparative approach to style in translation, which focuses on the style of the ST and on the manifestation of this style in the TT. 19

34 Vinay and Darbelnet pioneered the study of comparative stylistics in translation. In their book, Comparative Stylistics of French and English 2, they stress the importance of conducting a stylistic analysis of the ST to identify potential difficulties and problems and to propose solutions to ultimately reach a good translation. Vinay and Darbelnet identify some stylistic difficulties that face English French translators and classify them into categories in order to find systematic solutions to those problems. They also discuss some basic linguistic notions, such as servitude and option, and relate them to stylistic analysis in translation. They indicate that servitude belongs to the grammar system of a certain language, whereas option refers to the domain of stylistics. They argue that In the analysis of the SL, translators must pay particular attention to the options which constitutes the style of the ST author. In the TL, translators must pay attention to the servitudes, which limit their freedom of action (16). Vinay and Darbelnet also discuss two types of stylistics: internal stylistics, which seeks to isolate the means of expression of a given language by contrasting the affective with the intellectual elements and comparative stylistics, which seeks to identify the expressive means of two languages by contrasting them (17). In their model, they indicate that translators are concerned with comparative stylistics and should not ignore internal stylistics. As for the relation between the ST and the TT style, they point out that translators must preserve the tone of the texts they translate, if possible. For Vinay and Darbelnet, the 2 Their book was first published in French in 1958 then the English version came out in

35 focal point in the stylistic analysis is the ST and the TT should be produced in light of the stylistic analysis of the ST. Vinay and Darbelnet s work on translation and stylistics relies on analyzing the stylistic peculiarities of the ST to produce a stylistically equivalent TT. Vinay and Darbelnet adopt a purely linguistic approach that ignores some important elements in the translation process such as the discourse itself, the target culture, the target reader and the translator. They consider the ST as an independent unit and the stylistic analysis of the linguistic features of texts is enough for the translator to reach a good translation. This view of style is also discussed by Eugene Nida, the pioneer of the linguistic turn. Nida points out that translators always face the conflict between form (style) and meaning (message). If they attempt to approximately maintain the stylistic qualities of the text, translators are likely to sacrifice much of the meaning (2). As for the translation of style, he argues that the message in the receptor language should match as closely as possible the different elements [including message and style] in the source language. (159). Nida discusses his view of style by distinguishing two types of equivalence. First, dynamic equivalence in which the original text is translated into a target language and the response of the receptor in the target culture is essentially like that of the original receptors. In this type of translation, the form of the original text [style] is changed but is still resemble that of the original (200). The second type is formal equivalence, a type of translation in which the features of the form of the source text have been mechanically reproduced in the receptor language 21

36 (201). In both types, the style and the message of the ST are constantly compared with those in the TT to determine the standards of accuracy and correctness. In his approach, Nida stresses the importance of the ST as a point of departure for producing a faithful translation, which also entails the production of the same style in the TT, if possible. His discussion of style in translation does not account for the stylistic function of the TT in the target culture or the stylistic preferences of the target culture. This view of style is rejected by Katharina Reiss. She was one of the first translation scholars to discuss a functionalist approach to style. Her approach is also comparative; however, it places more emphasis on the target text and culture compared to Nida s and Vinay and Darbelnet s approaches. In her approach, she builds on the stylistic features of texts and on the way the style of each text type should be exhibited in translation. Reiss argues that the style of written language is determined by the function of the text. She stresses the importance of the stylistic analysis of the ST to realize the communicative function of the ST in order to recreate the same function in the TT. Reiss argues that there are three main text types 3 depending on the function of the language in the text. First, in content-focused texts, the function of the language in this kind of texts is to deliver content or a message. Second, form-focused texts, which have a unique form or style, show the peculiarities of a specific genre or the stylistic characteristics of a specific author. Third, appeal- 3 The notion of text types, as discussed by Reiss, is borrowed from linguistics. 22

37 focused texts represent the persuasive function of language. Reiss believes that equivalence is being reached by paying more attention to text type and to how each text type should be translated to fulfill its function in the target language. For instance, an absurd play should function as an absurd play in the target culture. Thus, the translation process takes the function of the ST as a point of departure in creating its counterpart in the target culture. Reiss proposes a model for translation criticism based on text style. She argues that the analysis and evaluation of a translated text can serve as the first stage, but it must be followed by the second and indispensable stage of comparison with the source text (10). Reiss criticizes translation critics and reviewers for not comparing the translation to its original text. Her model also sets the rules for this comparison. She argues that in a content-focused text it is always appropriate to eliminate obvious errors and compensate for stylistic defects. In a form-focused text, on the other hand, a translator s stylistic or other faults should not be ignored (64). Reiss places emphasis on the style of form-focused texts, compared to other text types. She argues that in this types of texts, the translator will not mimic slavishly (adopt) the forms of the source language, but rather appreciate the form of the source language and be inspired by it to discover an analogous form in the target language (33). Reiss' approach to style is different from that of Nida. She looks at the style of the ST as a source of inspiration for the translator in creating the style of the TT. From a functionalist point of view, Reiss argues that the translator should appreciate and be 23

38 inspired by the ST style to produce a TT that blends the ST style with the stylistic conventions of the target culture to reach stylistic textual equivalence. Boase-Beier also adopts a comparative approach and discusses the notion of style in poetry. She argues that style in poetry has the power to create cognitive effects in the reader where content alone can fail to ( Knowing and not knowing 34). From this point, Boase-Beier stresses the importance of understanding and reproducing the source poem s poetic effect that is embedded in its style. In her discussion, Boase-Beier analyzes the English translation of a German poem about the Holocaust to show how the manifestation of ambiguity in the style of the source poem is rendered into English. Boase-Beie s discussion concentrates on the fact that style is a very important component of any poem and translators should preserve the essential characteristics of the original in producing the translation. Her approach to style is comparative; however, Boase-Beier places more emphasis on the translator as the reader and the creator of the TT (compared to Nida, Vinay and Darbelnet and Reies). Boase-Beier points out that the style of the ST carries clues for the author s intention and devices that have a particular effect on the reader. That is, the ST should be the main focus of the translator s efforts, if the translator wants to create the same poetic effect of the ST. Like Vinay and Darbelnet, Boase-Beier argues that the stylistic analysis of the source poem/text is the first step towards understanding it. She also agrees with Reiss that the translator should seek to create the same effect of the ST in the translation. Boase-Beier points out that creating a poetic effect that is close or similar to that of the source poem can be achieved by revealing the author s intention, 24

39 which has been a very problematic notion in literature. The source-oriented approach to style in translation has placed more emphases on the ST as the main reference for the translation process as compared to the target text and culture, which may have a considerable effect on the style of the TT. The functionalist turn in translation studies, as proposed by Nord and Vermeer among others, has shifted this view by placing more emphasis on the function or the purpose of the TT in the target culture. This shift has generated a target-oriented approach to the study of style. This approach argues that the function or the purpose of the TT determines the style of the translation. It rejects any stylistic or textual constrains imposed by the ST Target-Oriented Approach In the functionalist approach to translation studies, as represented by Christiane Nord, Hans Vermeer among others, the translation process is viewed as always carried out with a function and a purpose (skopos) in mind. That is, the function of the TT, whether literary or non-literary, and the skopos 4 of the translation are the components that govern the production of any translation. This view has generated a target-oriented functional view of translation that values the function or the purpose of the translation and refuses the ST to be the reference for the translation process. In this regard, Christina Schäffner argues that the starting point for any 4 It means purpose, aim or goal. It is derived from the Greek word skopós. 25

40 translation is therefore not the (linguistic surface structure of the) source text (ST), but the purpose of the target text. The Skopos of the ST and the Skopos of the TT can be either identical or different (133). This target-oriented view of translation has radically changed the long-held view of translation as a process of linguistic matching that seeks equivalence and faithfulness. Vermeer discusses the skopos theory and its application in translation theory. He argues that translation is a type of human action. This action is an intentional, purposeful behavior that takes place in a given situation; it is part of the situation at the same time as it modifies the situation (qtd. in Nord 11). Vermeer also argues that the translator is the expert in the translation project and he/she is the one who is responsible of the translation. This also indicates that the translator as an expert in the ST and the TT contexts is the one who chooses the best way and the best strategy to fulfill the translation skopos, which also determines the degree of the TT quality (Nord 28). Unlike the previous linguistic approaches that draw on the notion of equivalence, where the ST is the reference of the relationship between the ST and the TT, the functionalist approach argues that the quality and quantity of [the ST and the TT] relationship are specified by the translation skopos (Nord 28). According to Nord, the skopos of the translation can be embedded in the translation brief. She argues that adequacy refers to the qualities of a target text with regard to its translation brief (35). As discussed earlier, Reiss functional view of style discusses the style of the text as a whole. Skopos theory, on the other hand, argues that style is applicable not only to entire texts but also to text segments (Nord 33). 26

41 Target-oriented functionalist scholars link style to the function or to the purpose of the TT. This has shifted the view of style as a ST component that should be taken into consideration in creating a TT, to treat style as a component of the TT that needs to adapt to the skopos or the function of the TT. Thus, translation brief, skopos or function determines the style of the TT. However, Nord and Vermeer s arguments for a functional target-oriented view of translation have ignored the human agent in the translation process, the translator. Translation scholars, such as Harish Trivedi and Susan Bassnett, Venuti and Hermans among others, have called for paying more attention to the presence of translators in their translation. This view has impacted the study of style by paving the way for translator style to emerge as a major current in translation stylistics Translator-Oriented Approaches The text-oriented approaches to style have placed more emphasis on texts and have ignored the human agent of the translation process, the translator. Recently, more attention has been paid to translators and their presence and voice in their translations. One of the first works that discusses translators in their translations is The Translator's Invisibility (1995) by Venuti. His work underlines the importance of translator s presence in translated texts. Venuti challenges the previous functionalist approach to translation, which stresses the importance of making the target text fluent as an original text and not as a translation. He indicates that the more fluent the translation, the more invisible the translator, and, presumably, the more visible the 27

42 writer or meaning of the foreign text (1). He argues that this act of making the TT transparent, through domestication 5, is an act of violence, which stems from the fact that a transparent translation reconstructs the foreign text in accordance with the value and the beliefs that preexists in the target language (18). Venuti argues against domestication and calls for adopting foreignization 6 as a translation strategy to restrain the ethnocentric violence of translation. He stresses the importance of keeping the style of the ST in translation even if the style of the ST seems unfamiliar in the target culture. He refuses any stylistic adaptation that could lead to a fluent TT and results in a transparent translation where the TT does not read as a translation. Theo Hermans also discusses the presence of translators in their translation. He indicates that translators voice is present in every translation they produce. Venuti and Hermans notions of translator voice and presence can be embedded in the translator style in the TT. Baker took Hermans and Venuti's arguments a step further and empirically investigated the translator s voice and presence in the target text. Her work took the TT as a point of departure to trace the translator style in his/her different translation, which she refers to as translator s fingerprint (Baker Towards a Methodology ). In her study, she used corpus stylistics, which is a method that makes use of the 5 Domestication is a translation strategy that involves the translation and the adaptation of the ST to the domestic culture values and literary system which results in a fluent original- like translated text. Adopting this strategy may result in a loss in the stylistic or linguistic peculiarities of the ST. 6 A translation strategy involves keeping the foreign stylistic and linguistic peculiarities of the ST in the TT as a way to break the conventions of the target culture to produce a text that sounds foreign 28

43 computer for extracting some stylistic patterns in a large corpus of digital texts. Baker built two corpora to analyze the style of two translators into English, Peter Bush and Peter Clark. Peter Bush s corpus contains translation of one fiction work from Portuguese, Turbulence (1994), by Chico Buarque and translations of two fiction works from Spanish, Quarantine (1994) by Juan Goytisolo and Strawberry and Chocolate (1995) by and Paz Senel. It also contains translations of two Spanish autobiographies, Forbidden Territory (1989) and Realms of Strife (1990), written by one author, Juan Goytisolo. Peter Clark s corpus contains translations of one collection of Arabic tales, Dubai Tales (1991), by Muhammad al Murr and a collection of tales, Grandfather s Tale (1999) and a novel, Sabriya (1997), written by the same author, Ulfat Idilbi. For the corpus analysis, Baker looked for patterns related to Standardized Type/Token Ratio (STTR) 7, average sentence length and pattern frequency in using the word SAY in its all forms (say, says, said, saying). Baker s model is an innovation not only in its approach to style but also in its method of analysis. Baker s study, in fact, has opened the way for further empirical studies that can address the limitations of her study. For instance, the two corpora in Baker s study were not homogenous; they included genres varying from novels, tales, autobiographies and short stories. One could argue that translator style may vary depending on the genre into another and this, in turn, affects style analysis. In 7 Type/token ratio is a measure of the range and diversity of vocabulary used by a writer, or in a given corpus. It is the ratio of different words to the overall number of words in a text or collection of texts (M. Baker, Towards a Methodology for Investigating the Style of a Literary Translator ). STTR is calculated for the first 1000 running words (tokens), and then calculated a fresh TTR for the next

44 addition, a given stylistic pattern in the TT may reflect the conventions of a certain genre and is not necessarily related to the translator style. Another issue arising from Baker s corpora is the use of two works written by the same author. In this case, there is a higher chance of finding some patterns in the translated texts that are related to the STs author s style and not to the translator s. Baker s use of STTR as a style marker poses a methodological question. STTR captures the creativity degree of an author; in translation, it also shows if the translator has a rich or a limited lexical capacity. However, to conduct a STTR analysis with a higher level of accuracy, the corpus should be controlled for theme and genre. Some themes may have a limited number of lexical choices compared to others, which in turn limit the lexical creativity of the text author. In this regard, George Mikros and Eleni Argiri empirically investigated theme influence on text author s style analysis, using STTR and other style variables, reporting that text theme has a considerable influence on the analysis of the author s style. Genre also affects STTR analysis; for instance, biographies would differ from tales not only in lexical density, but also in style. This proves true for tales in Arabic, which are oral traditional stories relying on repetition to build the story line (Mohamed and Omer). Any STTR analysis of tales would show this characteristic, and any comparison between STTR of tales and STTR of other genres such as novels and short stories would show such a distinction in lexical richness. Baker s corpus also included English translations of texts from different languages, including Arabic, Portuguese, and Spanish. This may also affect STTR analysis or what is called lexical density. 30

45 Some languages, such as Arabic, value repetition as a matter of stylistic elegance, and this might be reflected in the target text. Another example of a study of translator s style, which uses corpus analysis, is Diva De Camargo s. In her study, she analyzed translator style in an attempt to find the extent to which the style of the ST author is reflected in the style of the translator and whether the target text shows a distinctive recurring and preferred marks of linguistic behavior of that translator. De Camargo analyzed one Portuguese literary work, Tocaia Grande: a face obscura (1984) by Jorge Amado (original text (OT)), and its translation into English, Showdown (1988), translated by Gregory Rabassa (TT subcorpus). She used of corpus stylistics and analyzed number of tokens and types, type/token ratio TTR and STTR. De Camargo also uses two control corpora, The British National Corpus (BNC 8, BNC fiction corpus (BNC fn)) and the Banco do Português (BP) 9. She conducted her experiment in four steps. First, using WorSmith tool (Scott 1998), she retrieved statistics related linguistic pattern distribution in terms of TTR and STTR in both texts, TT and OT. Second, she conducted TT/TO comparisons by tokens (frequency of words) and types (word forms). Third, she compared TT TTR and STTR to that of British National Corpus BNC. Finally, she compared OT TTR and STTR with the TTR and STTR scores of the Banco do Português (BP). De Camargo s results show that the English translation of Tocaia Grande, 8 A 100- million word corpus of written and spoken English texts taken from different domains A 240- million word corpus of written and spoken Portuguese texts from different domains 31

46 Showdown registers a lower number of tokens and types in relation to its original text. She also found that the translator has a lower TTR and STTR compared to the original author. As a second stage of analysis, she compared the translator s language use (in his translation of Tocaia Grande, a corpus contains 141,608 words) and the language use in the BNC corpus, a 90,748,880-word corpus contains general English text originally written in English. She also compared the TT to the BNC fn corpus, which contains different fiction works for different genres written between in English containing 19,444,150 words. She found that the translator shows a richer and more varied language use, a higher TTR and a higher STTR, compared to those of the BNC and the BNC fn. De Camargo also compared the TTR score of both OT (with 159,440 words) and BP (with 230,460,560 words). Her analysis shows that the OT has a higher TTR compared to the TTR in BP. The same thing applies for STTR analysis. De Camargo concludes that since the TTR and STTR analysis show that Rabassa has a lower score compared to Amado s, this can be an indicator of the translator s divergence from the OT. She concluded that Rabassa presents a much higher diversified use of linguistic patterns and much less vocabulary repetition than what is found in the variety of text-types represented in the BNC and in the BNC fn. It is clear from the above discussion that the use of corpus methodology in De Camargo s study reveals interesting statistical information about the textual and the personal attributes of both the TT translator and OT author. The use of STTR and TTR to trace the authorial style of Amado in Rabassa s translation shows that the two texts have different scores, which, as De Camargo argues, indicates the difference between 32

47 the two styles. However, translation is not a process of linguistic matching of words. Thus, style divergence can never be measured simply by means of statistical linguistic analysis and comparison of the number of words (Tokens) ad number of forms (Types) in the ST and the TT. In other words, the comparison of number of types and tokens in both ST and TT does not reveal the extent to which the style of the TT diverges from the style of the ST. In addition, only a word-for-word 10 translation of OT would produce a close or similar STTR and TTR. In this regard, Baker questions the arguments that translators should reproduce the style of the ST in their translations. She indicates that it is as impossible to produce a stretch of language in a totally impersonal way as it is to handle an object without leaving one s finger- prints on it (244). Thus, comparing STTR and TTR scores of the OT and TT can neither reveal the extent to which the translator reproduces the style of the OT author nor show the distinctive manner of translating. De Camargo s use of the control corpora, BNC, BNC fn and PB, in her study can be questionable. First, she compared STTR and TTR of Rabassa s translation (translational English) to those of the BNC and the BNC fn, which include original English texts. One could argue that, comparing STTR and TTR of a translated text to the STTR and TTR of non-translated text, BNC and BNC fn in this case, is not a useful method of analysis since the two languages have different styles and peculiarities. In addition, BNC contains different English text types from different 10 A translation strategy involves a literal translation of texts 33

48 domains and this could affect the STTR and the TTR analysis. Some genres contain a limited number of vocabularies compared to others. De Camargo also makes use of PB corpus, which contains Portuguese spoken and written original texts from different domains. She compared the STTR and TTR of Amado s OT to the STTR and TTR in PB. She reported that Amado scores higher STTR and TTR compared to the PB. This analysis also presents a methodological issue because comparing STTR and TTR in a fiction work (Amado s OT) and in a great number of written and spoken texts from different domains (PB corpus) may not reveal accurate results for the comparison. Genre or text type and the size of the corpus do affect STTR and TTR analysis. Marion Winters also used of corpus-based methodology and studies translator style by comparing two German translations by Hans-Christian Oeser and Renate Orth-Guttmann of the novel The Beautiful and Damned (1998) by Francis Scott Fitzgerald. The researcher looked for patterns in the use of modal particles 11 by the translators. Winters argued that modal particles reveal the micro-level of the translators linguistic choices. She relied on two methods of analysis to trace the use of modal particles in the two translations. First, she used keywords list functionality to retrieve the most frequent eight modal particles in the two translations. Then using concordance 12 search function, Winters retrieved concordance lines of those eight 11In German, modal particles are words used in spoken language and in colloquial registers; these words show the attitude of the speaker or narrator. 12 concordance search functionality shows key words in their immediate context in the corpus 34

49 modal particles in the two translations in order to explore the individual style of the translators in using them. She then traced the effect of these micro-level linguistic choices on the macro-level of the novel. To do so, she referred to the ST by running a bilingual concordance search in the two German translations and the ST. Winters found that the two translators have an individual fingerprint when using modal particles. The difference between the two translators lies in the frequency and in the usage of the modal particles. Using concordance functionality, she analyzed the instances of the eight modal particles in the two translations and in the ST. She reported that in some instances, the two translators use the same modal particle for the same source-text sentence, which she argues is an effect of the ST. In most of the cases however, the two translators do not use a modal particle for the same sourcetext sentence (82). Winters pointed out that the two translators use modal particles differently and that reveals possible differences in the styles of the two translators. Winters analyzed the effect of the micro linguistic style of the translators in using modal particles on the macro-level of the novel and she pointed out that Oeser s translation is source-text oriented which takes the reader to the ST and its culture. Orth-Guttmann s translation, on the other hand, creates a more casual or colloquial tone. It moves the source text and the author s world closer to the reader, while Oeser expects the reader to move to the source culture/text (93). Winters concluded that the micro-level of the linguistic difference in translators style affects the macrolevel of the novel. As shown, Winters analyzed the style of the translators in two stages. The first 35

50 stage encompasses tracing the individual style of the translator by looking for some linguistic patterns in the two translations and then compare the two translation to trace the differences in style. This stage involves the target text only. The second stage includes a bilingual comparison of the ST and the two translations. Winters carries out this ST-TT analysis for two reasons: first, to see if the ST has any effects on the TT and second, to see the effect of the translator style on the overall effect of the novel. Unlike Baker who sees translator style as a recurring pattern of translator s linguistic choice that can be traced in his/her different translations of the same author of different authors, Winters sees translator style as a divergence from the ST. Winters also discusses the effect of translator style on the effect of translator style on the overall translation. In other word, she looks at why certain ST components are translated the way they are and how this translation affects the overall meaning, taking into account the relation between the ST and the TT. This approach takes translator style analysis a step further by revealing the impact of the translator s individual style on the overall translation of a certain text. This model can be useful for conducting a comparative analysis of a number of translations of one source text to reveal the individual translator style in each translation. However, this approach might not be useful in the case of analyzing the individual style of different translators in different translations of different texts. Following Baker s, an interesting study has been conducted by Gabriela Saldanha in an attempt to develop a methodology and to propose a working definition for translator style. Saldanha makes use of corpus methodology and focuses her 36

51 approach on looking for some consistent patterns in various works translated by the same translators as a way to reveal translator style. To do so, she built three corpora, Corpus of Translations by Peter Bush (CTPB) including four translations from Spanish (Forbidden Territory: The Memoirs of Juan Goytisolo (1989), Tonight (1991), The Wolf, the Woods, The New Man (1995) and The Old Man Who Read Love Stories (1993)) and one from Portuguese (Turbulence 1992), Corpus of Translations by Margaret Jull Costa (CTMJC) including three translations from Spanish (Adventures of the Ingenious Alfanhui (2000), Bedside Manner (1995) and Spring Sonata (1997) and two from Portuguese (The Mandarin (1993) and Lúcio s Confession (1993)) and a third corpus of translated works (COMPARA), used as a reference corpus. In her analysis she searched for the use of emphatic italics 13 and foreign words in the TT, which refers to lexical borrowing. She also examined the translator s use of the connective that after the reporting verbs say and tell as a distinctive feature of translator style. Saldanha s study exhibits the limitations noted earlier. The first one is related to corpora compilation. Her corpora contain different genres and this, as explained earlier, could affect the translator style analysis. The second limitation is related to corpus control. CTPB corpus contains translations of works published in 1980 onwards, whereas the other corpus, CTMJC, contains translations of works published in In other words, the first corpus covers works written in a ten-year 13 Refers to italicizing a word, a sentence or certain part of the translated texts to indicate that this part is emphasized 37

52 time period and the other one covers works written over 113 years. This could also affect translator style analysis. Some of the patterns that may distinguish CTMJC corpus from CTPB could be related to a historical period s writing style or conventions. Saldanha also examined the use of the connective that after the reporting verbs say and tell. She reported that Bush shows an overall preference for using zero connective while, Jull Costa shows a preference for using optional that after say and tell. One could argue that the use of connective that after the reporting verbs say and tell might not be a style-marker. This analysis could reveal a distinctive and consistent patterns of choice (Saldanha 28), but it cannot be used to distinguish the translator style from that of others, simply because different translators may use that after say and tell in a similar manner. In an interesting study on translator's "fingerprint" or style, Qing Wang and Defeng Li added the notion of translator s authorial style to the discussion. They analyzed not only the style of the translator in his/ her translation, but also his/her style in his/her creative writing. In their project, they adopted a corpus-based approach to trace translator s fingerprint in two translations of Ulysses into Chinese by Qian Xiao (1994) and Di Jin (1997). The researchers built a Bilingual Corpus of Ulysses (BCU) consisting of the two translations as well as the original text. They also built a comparable subcorpus, reference corpus (RC), which includes Xiao s original writings in Chinese of short stories and novels. The researchers also used this corpus to see if there were any instances where the authorial and the translational 38

53 style of Xiao intersect. The researchers analyzed lexical idiosyncrasy that is the individualized habitual use of words. They argued that lexical idiosyncrasy is a style marker that reveals the translator style or his/her unique manner of translating. To do this lexical analysis, they compared keyword list of the two translations of Ulysses. The researchers analyzed high and low frequency words in the two translations and in the reference corpus (RC). The analysis revealed that Xiao, the literary writer and translator, leaves some traces of lexical idiosyncrasy in his translation. The researchers reported that Xiao tries to make his translation more emotional and as colloquial as possible, which is reflected in his lexical choices. Jin, on the other hand, chooses to use the standard Mandarin to make the translation looks more neutral and impersonal (85). The researchers also found out that Xiao s habitual wording style in creative writing has an influence on his translation. They argued that the reason might be that the translator consciously or subconsciously reverts back to his own language habits, and shows a tendency to use preferred expressions over other alternatives (89). Wang and Li also used the syntactic sequence of sentences (positions of clauses) as a style marker to trace the individual style of the two translators in their translations. The authors reported that there is a similarity in post-positioned adverbial clauses in the two translations, which is, after referring to the ST, a reflection of the ST style. As for post-positioned adverbial clause in Xiao s creative writing, the analysis showed that this stylistic feature is more common in Xiao s translation. This feature, they argued, distinguishes the translated text from non- 39

54 translated original writing. Wang and Li s analysis thus indicated that tracing the style of the translators by comparing two translations of one source text would revel the distinctive manner of translating of the two translators. In addition, comparing two translations of one ST to their original would help reveal any instances where the style of the translators is impacted by the style of the ST. Wang and Li study is one of the first studies tackling the notion of the authorial and the translational style of one translator. However, their analysis of translational and authorial style of Xiao is very limited. They analyzed only one translation produced by Xiao (Ulysses) and compare it to his creative writing of a novel and twenty-three short stories. Wang and Li s conclusion that the authorial wording style has an influence on the translational wording style needs further investigation. Using more than one method of analysis to study the translational and the authorial style of a particular translator would help reach a more solid conclusions, which is what the current study attempts to do by making use of methods used in stylometry and authorship attribution studies. Like Winters and Wang and Li, Iraklis Pantopoulos analyzed the style of two translators in producing translations of one source text. In his study, Pantopoulos introduced new style markers to be traced in the analysis of translator style including the analysis of function words and contracted forms. In his study, Pantopoulos attempted to analyze translator style in two translations of C.P. Cavafy's The Canon by Rae Dalven (1961) and by Edmund Keeley and Philip Sherrard (1992). He built a corpus of the two translations aligned to their source text. He traced stylistic patterns 40

55 used by the translators by retrieving Type/Token Ratio (TTR), Standardized Type/Token Ratio (STTR), number of types and tokens, number of lexical words and number of functional words in the two translations. In addition, he traced the translator s use of contraction or contacted forms in the two translations. Pantopoulos reported that Dalven uses more tokens and more types compared to Keeley and Sherrard. He also found out that both translators exhibit a very similar TTR overall. Keeley and Sherrard show higher STTR. As for the lexical (open class) and function (closed-class) word analysis, the researcher reported that Dalven uses a greater variety of lexical and functional words compared to Keeley and Sherrard. The analysis of contractions in the two translations revealed that Dalven uses fewer apostrophes in the contracted forms of words in his translation of Cavafy s The Canon compared to Keeley and Sherrard in their translations of the same poems (100). Pantopoulos also carried out a qualitative analysis by comparing the way translators translated different terms in the poem and how different translations could lead to different meanings. At this stage, he ran the parallel corpus using ParaConc 14 tool, which provides the ability to search terms in a parallel corpus and retrieves concordance lines of the source aligned to the target text(s). Pantopoulos found out that the two translators have translated some terms differently, which has an impact on the meaning of some parts of the TT. He also compared the style of the ST to the 14 ParaConc is a bilingual or multilingual concordancer that can be used in contrastive analyses, language learning, and translation studies/training. ( 41

56 style of the two TTs. He reported that Dalven's version is closer to the structure of the ST in both form and effect (102). Keeley and Sherrard, on the other hand, appear to follow the structure of the ST more loosely. Pantopoulos discussed translator style as unique recurring patterns of a translator s linguistic choice compared to other translator(s) of the same ST and this provides more pertinent analyses than Baker s and Saldanha s. Other scholars who discuss translator style in his/her creative writing and translations include Maeve Olohan. She suggested a model to find the extent to which translated texts display the translator s linguistic habits by analyzing texts written by the translators that are not translations (150). Following Olohan, Claudia Walder discussed the style of the translator in his translations of different texts and his/her style in creative writing. She tried to see whether there are stylistic similarities and/or differences in the two types of texts produced by the same translator. Walder used corpus methodology and built three sub-corpora to study the style of Donal McLaughlin, an English writer and a German- English translator, in his translation and in his creative writing. The first sub-corpus contains McLaughlin s translations from German to English of 52 texts by 47 source text authors. This consists of 29,672 words. The second sub-corpora include McLaughlin s original writing from his collection of short stories: An Allergic Reaction to National Anthems and Other Stories. This corpus contains 20 short stories (61,028 words). The author divides this corpus into two sub-corpora of approximately similar length, 1- (ow1) contains 31, (ow2) contains 29,297 words. Walder also built a reference corpus 42

57 consisting of 15 different German text translated into English by 19 different translators. As for the data analysis, she used standardized type-token ratio 15 (STTR) and mean sentence length. The sentence length is the number of words in a sentence. She sets the sentence boundaries as a sequence of characters followed by a space and an initial capital letter (59). She also looked at the use of dashes and at language variation in the two translations. The researcher reported that STTR of McLaughlin s translation is close to that in his creative writing. As for mean sentence length, Walder found that sentences tends to be longer in the translations of McLaughlin compared to the sentences in his creative writing. She also reported that McLaughlin uses more dashes in his creative writing. Walder also traced the occurrence of dashes per 1,000 words. She found out that the scores of McLaughlin s original and translated texts are closer to each other than to that of the control corpus (62). As for language variation, Walder reported that McLaughlin uses borrowings from the languages used in Switzerland (German, Rumantsch, and Italian) in his translations; this feature is also found in his original writings (63). She identified some instances where untranslated German (and French) expressions and sentences were used in McLaughlin s creative writing and she concluded that the translational and authorial style of the translator influence each other. Walder's study is the most granular but some of her parameters may not be 15 STTR is usually calculated for every 1000 words. In this study, the researcher lowers STTR parameter from 1000 words to 200 and 500 because the average length of the short stories in the translation sub-corpus is

58 useful when it comes to the analysis of translator style. For instance, sentence length analysis may fail because setting the sentence boundaries as a space and an initial capital letter would regard Mr. in Mr. Sam is happy. as sentence. This would affect the mean sentence length analysis. It is worth mentioning that using the tittle of characters such as Mr. and Mrs. is common in short stories and literature. In addition to these text-oriented approaches and translator-oriented approaches to style, the emergence of cognitive approaches to translation studies has led to the inclusion of cognitive processes in the study of style. In this approach, style is considered a cognitive entity that is constructed in the mind of the ST author and in the mind of the TT creator. That is, style is a representation of the writer s mind. Consequently, the translator s task is to capture evidence of the mind in the ST and try to recreate it in the TT. The following section discusses the cognitive-approaches to the study of style Cognitive-Oriented Approach The cognitive approach to style, also called cognitive stylistics, has also led to recent changes in the study of style. This new approach builds on the concepts of cognitive linguistics and emphasizes the relation between text and mind. It looks at the style of the source text as a vehicle that conveys and creates a cognitive state or a mind style. It argues that the role of the translator is to capture and convey this cognitive state in the TT; otherwise the target text will have less effect on its readers' 44

59 minds (Ghazala 77). Boase-Beier s book Stylistic Approach to Translation is one of the first works to discuss cognitive stylistics in translation studies. She views style as a cognitive entity rather than as a purely textual one. Boase-Beier indicates that style is determined by the cognitive state of the reader, which is shaped by his/her historical, ideological and the cultural background. She argues that a cognitive stylistic approach views translators as readers and sees style as a reflection of mind and tries to grasp and recreate that mind in the TT. Boase-Beier points out that the mind in the text represents a cognitive state, which may have two aspects. First, it is influenced by ideology, it takes a particular attitude, or it embodies a particular feeling. Second, it carries an attitude conveyed by the style (79). She argues that style conveys this attitude in the text. Thus, it is very important that translators understand the mind attitude that appears in the ST. If the mind attitude is lost or misunderstood, the translation is affected. Boase-Beier argues that the translator should first recognize the cognitive state of the ST in order to recreate it in the TT. She points out that if the translation fails to capture such a cognitive state, the target text will have less effect on its readers' minds (77). Boase- Beier also indicates that knowledge of style helps translators understand how style works and help them interpret stylistic the features of the source text. She points out that stylistically aware translation, which begins with a stylistically aware reading of the ST, can make a more reasonable case for its interpretation of the source text than any other sort of translation can (111). She also argues that the TT reader reacts to choices made by the translator, which reflect the translator s cognitive state at the 45

60 moment of translating. This makes reading a translation different from reading a piece of creative writing because translation involves a reflection of two mind states, the original author s and its reflection by the translator in the TT. Cognitive stylistics is also discussed by Hasan Ghazala, who applied it to the translation of metaphors from English into Arabic. Ghazala neglects the traditional view of translating metaphors in terms of creating an equivalent to the ST metaphor in the TT. He argues that metaphor should be understood as a cognitive process that conceptualizes people s minds and thoughts linguistically in similar or different ways in languages. That is, he treats metaphors as a conceptual feature in texts that has two domains: the target domain (the concept to be described by the metaphor), and the source domain (the concept drawn upon, or used to create the metaphorical construction) (60). Ghazala argues that all metaphors are reflections and constructions of concepts, attitudes, mentalities and ideologies on the part of the writer / speaker (57). He adds that speakers or writers do not use metaphors only for esthetic purposes; they use them as a vehicle for ideological and cultural concepts, meanings and perception of world. From this point, Ghazala calls for conceptualization of metaphors of ST in their cultural, political, ideological, social and mental environment. Doing so helps translators understand and response to the ST metaphors in his/ her translation. Ghazala s cognitive approach to style does not touch upon translation as a mental process. It rather takes the conceptualization of some textual parts in texts as a key component of discourse comprehension, which is a very important step in 46

61 creating a TT. Both, Boase-Beier and Ghazala argue that mind of ST author is what the translator seeks to render and to recreate in the TT. The mind of the ST author is embedded in the style of the text. That is, the translator should try to reveal this mind style and convey it in the translation. It is obvious that the cognitive approach to style in translation, as discussed by Boase-Beier and Ghazala, has not been well developed yet. This approach does not rely on any empirical methods, such as keystroke logging 16 or eye tracking 17, in studying and discussing style in translation. The future of stylistics in translation studies may heavily rely on the empirical study of translators' style using cognitive approaches Conclusion The above discussion of the related literature shows that the development of stylistic approaches to translation has been affected by the approaches to the study of style in literary studies as well as the different turns in translation studies. The first approaches to style in literary studies have focused on texts and have placed a minimal emphasis on the ideological and the cognitive dimensions of style in a text. This approach has generated a text-oriented comparative view of style in translation studies placing more emphasis on the stylistic peculiarities of the ST and on their manifestation in the TT. Scholars within the linguistic turn in translation studies 16 A software records keys struck on a keyboard by a user while typing 17 the activity of recording eye movement and fixation of a computer user while executing a specific task on a computer 47

62 adopted this approach and discussed style in translation as a textual component that does not go beyond the linguistic signs in the text. This view reflects the main approach of the New Criticism school for which texts are self-referential and selfcontained, while meaning is always inside the text. Halliday s theory of stylistics, which focuses on the importance of the social function of the discourse, has also shifted the focus of translational stylistics to take a functional turn considering the stylistic function of the TT in its new context (i.e., the target culture). This view of style is also text-oriented and does not account for the ideological or cognitive components of style; however, it takes the TT culture and its norms as a point of departure in the stylistic analysis of translated texts. Discourse-analysis approach to literary style, as discussed by Carter and Simpson, draws on the critical and close reading of texts to reveal their hidden intended meaning that is beyond the physical representation of the signs in the verbal discourse. This approach has generated a descriptive method to translational stylistics with a focus on the translator s voice and presence in the TT. The focus on translators as cultural agents, has paved the way for translator-oriented approaches to style in translation to emerge as a new turn in translational stylistics. In the same manner, the reader-oriented theory to literary stylistics, as proposed by Michael Riffaterre, as well as the cognitive turn in translation studies, have played a key role in the emergence of the cognitive approach to style in translation studies. As this literature review makes clear, translation scholars have approached the topic differently. Some scholars analyzed translator style by referring to his/her 48

63 translations of different authors (Saldanha, Baker "Towards a Methodology"), while others trace translator style by comparing the changes he/she makes to the target text and how these changes could affect the macro-level of the TT (Pantopoulos, Winters). Another group of scholars compared different translations of one source text as a way to reveal the translator style (Pantopoulos, De Camargo) while a new recent research group discussed translator style as a fingerprint that is traceable in his translation and creative writing (Walder and Wang and Li). Most of the previous works on translator style have used corpus methodology to carry out the analysis, but their use of corpus methodology presents some limitations related to study design, corpus compilation, control and analysis. As we have seen, some researchers analyze translator style in his/her translations of different genre and ignored the fact that a number of the stylistic patterns might be a reflection of a certain genre 18. Other researchers study a translator s style in his/her translation from different languages. As discussed earlier, including texts from different languages affects style analysis because some stylistic patters could be related to stylistic/ linguistic constraints of a certain language, source language. In addition, the design of some of the previous studies is problematic. For instance, comparing translated to non-translated text is one of these issues. Comparing a corpus of written and spoken text to a translational corpus can also be considered as a flaw in study design. The gap between what corpus tools can do and what to consider as a style 18 see above discussion on Baker s use of tales, autobiographies and novels in one corpus to analyze translator style 49

64 marker is one of the main problems in several of the previous studies. In addition, building on methods used in previous studies without questioning their viability in translator style analysis is another problem. The next chapter attempts to propose a methodology and a style-markers profile for analyzing translator style. 50

65 CHAPTER 3: METHODOLOGY 3.1. Introduction This chapter discusses the methods used in this study. It also discusses the data collection process, corpus compilation and control. As mentioned earlier, the study s principal purpose is to investigate the thematic and stylistic relation between Denys Johnson-Davies s (J-D) creative writing and in his translations in an attempt to address existing limitations in current style-related research in translation studies. To do so, the study relies on three corpora of short stories produced by him including: 1- a corpus of short stories translated by J-D before the production of his own writing; 2- a corpus of J-D s own short stories; and 3- a corpus of short stories J-D translated after the production of his own writing. This study uses three computational methods to reveal the thematic and stylistic relation between J-D s creative writings and translations. First, Latent Semantic Analysis (LSA) is used as a fully automated method to conduct thematic similarity analysis of J-D s creative writing and translations. This study makes use of LSA similarity query, which is an information retrieval technique that is applied to Natural Language (NL) understanding problems. This technique relies on the latent semantic relation between the concepts and terms in a document to determine the similarity between different documents in a corpus and to classify them accordingly. This particular method is used to reveal the extent to which the themes in the translated short stories are similar to those in J-D s creative writing. 51

66 The second computational method makes use of Corpus Stylistics, a sub-field of corpus linguistics that uses the application of corpus methods and tools to analyze style. Baker defines a corpus as any collection of running texts, held in electronic form and analyzable automatically or semi-automatically ( Corpora in Translation Studies 225). Using corpus stylistics, three style markers will be analyzed including Standardized Type-Token Ratio (STTR), Sentence Length and punctuation marks (comma, semicolon and hyphen). The third method of analysis applies a machinelearning Stylometry technique. The machine learning approach to style analysis is one of the most advanced approaches in this kind of research. This method is built on the notion of Artificial Intelligence (IA) and on the fact that machine, i.e., computer, can learn from data. Machine learning style analysis is used in this study to analyze three style markers including word n-grams, character n-grams and Part-of-Speech n- grams. The present study applies multiple analysis methods to the same set of data. This is known as data analysis triangulation. Analysis triangulation is defined as a situation whereby two or more analysis techniques are used for the same data set (Ziyani, King, and Ehlers 12). In the present study, J-D s creative writing and translations are analyzed using three different methods of analysis to investigate the impact and the relation between his translating and creative writing activities. Ashatu Hussein argues that the importance of triangulation stems from two facts: it is used for increasing the wider and deep understanding of the study phenomenon, and it is also used to increase the study accuracy, in this case triangulation is one of the 52

67 validity measures (1). Therefore, mixing different analysis methods is used in this study to reach a deeper understanding of the interaction between J-D s authorial and translational style and to validate the findings of the methods adopted. The following sections provide an overview of the data collection, corpus compilation and control and discuss the three analysis methods used in this study Data Collection The data of this study include all short stories written and translated by J-D. Only short stories are included in this study for two reasons: first, J-D wrote short stories and he did not produce any other genre. Second, short stories are used to control the corpus for genre. This would help achieve adequate results when comparing J-D style in his creative writing and in his translations of short stories. Full lists of J-D s own and translated short stories are provided in appendixes A and B Corpus Database The corpus of this study is stored in an Open Office database, an open source productivity suite. The database includes the following metadata for each short story: translation year, collection title and ID, short story title, publisher, source text author and notes. Each collection of short stories has a unique four-character alphanumeric identifier (ID). The IDs of the translated collections were predefined as follows: the collections translated by J-D before writing his own collection of short stories are given TB as a prefix followed by a number; whereas the collections that were 53

68 translated after writing his short stories are given TA as a prefix followed by a number. Assigning a predefined ID to each translated short story helps retrieve a list of collections and short stories based on their translation date. While entering the titles of the short stories in the database, I noticed that some of the translated short stories were published in more than one collection. Thus, the database was set in a way that does not allow duplicates. When facing a short story that is published in more than one collection, a note is entered in the note field in the database Corpus Compilation and Pre-processing The first step in corpus compilation is to have all texts in a machine-readable format (MRF), which is an electronic format that can be read and analyzed by the computer. Most of the short stories that are included in this study are scanned and converted into MRF, plaintext format with.txt 19 extension, using Optical Character Recognition (OCR) tool. This tool converts the scanned images into texts that can be read by computers. Other short stories are converted either from a Portable Document Format (PDF) or Electronic Publication (epub) format into plaintexts. The txt files are pre-processed and cleaned to make sure that the body of the texts does not contain any running heads, page number, footnotes or any characters that could affect the analysis result. 19 Such files contain very little or no formatting. This format of text is used to make sure that text formatting has no effect on the corpus analysis. 54

69 The second step involves text selection. As Stamatatos has argued, any good evaluation corpus for authorship attribution [author s personal style] should be controlled for genre and topic (21). In other words, building a corpus for the purpose of analyzing authorial style should consider controlling the corpus for genre and topic. Thus, this study controls for genre and only considers short stories translated and written by J-D. It also controls for theme and builds three corpora that are, to a certain extent, related to each other in theme. LSA Similarity query is used as a computerized text selection method to control the corpora for theme. Similarity query is discussed in more details in section LSA Similarity Query. The initial three corpora in this study are: TABLE 1: THE CORPORA OF THE PRESENT STUDY Corpus Name Description Number of Short Stories included Overall size in words Production Date TBCRW_raw CRW Translational corpus contains all short stories translated from Arabic into English by J-D before writing his own short stories Creative writing corpus contains the short stories written by J-D TACRW_raw Translational corpus contains all short stories translated from Arabic into English by J-D after writing his own short stories

70 3.5. Latent Semantic Analysis As mentioned earlier, Similarity query using Latent Semantic Analysis (LSA) is used in this study to reveal the thematic relation between the short stories in the three corpora and to build two translational sub-corpora that are close in theme to those in the creative writing corpus. Before discussing the notion of Similarity query, it might be beneficial to provide a definition of LSA and to lay down the main processes through which it works. LSA is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer and Dumais). Landauer, Foltz, and Laham argue that LSA is an automated text analysis method that can approximate human judgments of meaning similarity between words and can objectively predict the consequence of overall word-based similarity between passages (5). LSA is regarded as an advanced information retrieval method that solves the problems caused by synonymy and polysemy. It goes beyond traditional Information Retrieval (IR) methods that rely on key word matching techniques and rather deals with the concept, and carries out a search on this basis (Antai, Fox, and Kruschwitz 161). In 56

71 retrieving information from textual data, LSA takes advantage of the conceptual and semantic content of texts 20. LSA makes use of the relation between words and concepts to reveal the latent (hidden) relations between words. For instance LSA would differentiate Apple the fruit and Apple the Company based on relation of these two terms with other terms in the same document. When the word Apple accompanies terms such as imac and OSX, the document is more likely describing Apple Company, whereas, when the term Apple accompanies other terms such as calories and fruit, the document is more likely discussing a topic related to the fruit. The LSA relies on complex mathematical algorithms that convert texts into matrices to do the analysis, which goes through several steps. The first step in the LSA analysis is matrix creation (term-document matrix) including terms and documents in a target corpus 21. In this matrix the terms are placed on the rows and the documents on the columns. The entries in the matrix are the frequencies of each term in a corresponding document. The following figure shows three documents and their terms in a matrix: 20 The LSA method works by categorizing terms and concepts in a document in what is called a concept space. That is, related terms will be mapped onto their concept space and other concepts can be retrieved from a particular concept. LSA uses Singular Value Decomposition (SVD) to create this concept space. (Antai, Fox, and Kruschwitz 161). SVD is a mathematical technique for dimensionality reduction. That is, given a large vector space, SVD attempts to reduce the number of dimensions in the space by combining terms (this process is explained in the following paragraphs). This allows patterns to be revealed between terms and concepts in a corpus of (un)structured documents. 21 A corpus on which LSA is applied 57

72 FIGURE 1: MATRIX OF MADE UP DOCUMENTS Terms Doc1 Doc2 Doc3 The State Kent graduate community European UK education Canterbury campus The first document in the above matrix, Doc 1, is more likely to discuss topics related to Kent State, the educational institution. While, the second document, Doc 2, is more likely to discuss topics related to Kent, a city in the United Kingdom. The matrix creation process is followed by a pre-processing step, which includes stop word removal, and assignments of weights to terms (Antai, Fox, and Kruschwitz 162). An important step in using LSA is creating a stop words list containing propositions, conjunctions, articles and names. The importance of including such a list stems from the fact that LSA should capture the relation between meaningful words in order to reach a good level of accuracy. Term weight depends on the score or the frequency of terms in a set of documents and on applying Singular Value Decomposition (SVD) 22 on the term-document matrix as a method for data reduction. SVD captures the strong relations between terms and removes other terms 22 SVD takes a high dimensional, highly variable set of data points and reducing it to a lower dimensional space that exposes the substructure of the original data more clearly and orders it from most variation to the least (K. Baker 16). To do so, SVD takes term-document matrix, X, and decomposes it into three matrices including [o]ne component matrix describes the original row entities [of X] as vectors of derived orthogonal factor values, another describes the original column entities in the same way, and the third is a diagonal matrix containing scaling values such that when the three components are matrixmultiplied, the original matrix is reconstructed (Landauer, Foltz, and Laham 8). 58

73 with unimportant value to the document. In other words, it reconstructs the initial matrix by only including terms that have a strong relation with other terms in the document. Then, each document is represented as a vector 23 in a dimensional space and the similarity between words and/or documents is computed by measuring the similarity between their vectors (see Figure 2: Made up documents in 2-dimensional space). Usually, the cosine of the angle between the vectors in Dim-space is used to determine this similarity between terms and between documents. The following figure explains this process: FIGURE 2: MADE UP DOCUMENTS IN 2-DIMENSIONAL SPACE Dim 1 Doc 1 Doc 4 Doc 2 Dim 2 Doc 5 Doc 3 In the above figure, five documents are represented in vectors (Doc1, Doc2, Doc3, 23 A vector is a mathematical representation of the latent topics/themes in the document. The researcher should decide on the vector value or the V value, which is the number of latent topics that a vector should include. The V value varies from one study to another depending on the corpus size and the research questions. 59

74 Doc4, Doc5). The distance between vectors represents the distance between terms and concepts in the documents from other concepts and terms in other documents. The similarity between documents is measured by the closeness of vectors from each other. LSA measures the angle between the vectors (which represent documents) in a dimensional space, Dim, to determine their similarity. The closer the vectors from each other the more similar they are. For instance, Doc 1 is more like Doc 4 compared to Doc LSA Similarity Query LSA similarity query or similarity analysis is one of the LSA applications, which is used to determine the thematic similarity between a query document and a number of other documents in specific corpora/ corpus. A query document is like a query term that a user enters in Google.com search bar; however, the query document is a whole document that can be thousands of words long. The following figure explains this process: FIGURE 3: LSA SIMILARITY QUERY Query Document 1 Target' Corpus 2 LSA Processing 4 Query' Results 3 60

75 LSA similarity query is used in this dissertation to retrieve the most thematically relevant translated short stories to their counterparts in the creative writing corpus. The most thematically relevant short stories (in the two translational corpora) to each query document (to each creative writing short story) will be retrieved to build (TBCRW) and (TACRW). The One-to-many 24 approach is followed to apply similarity query on the corpora of this study. This approach takes a single document as a query document (each single short story from the creative writing corpus (CRW)), compares it to a number of other documents in a target corpus, (TACRW_raw) and (TBCRW_raw) in this case, and then retrieves the document(s) that are most thematically relevant to the initial query document (creative writing short story). See Figure 4 below: FIGURE 4: ONE-TO-MANY SIMILARITY QUERY PROCESS (CRW) Doc Doc.15 Target Corpora Doc. A Doc. B Doc. C Doc. n TACRW TBCRW This process results in the creation of the two translational sub-corpora (TACRW and TBCRW), which are used in the second and third data analysis methods, namely: corpus analysis and machine learning. 24 One-to-many is one of the LSA applications. It is also used for vocabulary testing and essay grading by some universities such as Colorado University at Boulder. (see 61

76 LSA Similarity Cutoff The output of the LSA analysis provides the similarity degree of the query document to the other documents in the corpus by providing the cosine value of the angle between their vectors. The researcher should determine the similarity degree or the cosine value, which is the value of the cosine angle above which the retrieved document is considered relevant to the query document. Scholars indicate that that there is no defined rule for determining this value, however, most studies reported that this value occurs between (Graesser et al., Penumatsa et al.). The present study sets the similarity value to 0.70 and considers the translated short story to be similar to the creative writing short story if their vectors have a cosine value of 0.70 or greater (See Table 5 page 84 for an example of LSA output) LSA Output Evaluation The LSA output is evaluated manually. This manual evaluation includes reading each query short story (from the creative writing corpus) and its first five thematically relevant translated short stories as retrieved by the LSA analysis. This would give the researcher the chance to evaluate the LSA results and to know the kind of themes, characters and settings in the three corpora (TBCRW, CRW, and TACRW) Corpus Stylistics 62

77 The second method that is used in this study adopts a corpus stylistic approach, which is a corpus-based approach that makes use of computational methods to analyze style in a collection of digital texts. Corpus Stylistics relies on the statistical analysis of style based on frequencies of some stylistic features or style markers such as sentence length and word length. This study traces the stylistic influence of J-D s translating activity on his creative writing and vice versa by analyzing three style marks (Standardized Type-Token Ratio, Mean Sentence Length and Punctuation Marks) in the three corpora produced by J-D (translation before creative writing, creative writing and translation after creative writing). To do so, WordSmith (WS) tool (Scott), a corpus analysis tool that provides the ability to retrieve stylistic statistical information from digital texts, is used in this study. The corpus analysis output of WS is also tested using one-way independent samples Analysis of Variance (ANOVA) to compare the effect of text production activity (translation before creative writing, creative writing and translation after creative writing) on the three style markers (Standardized Type-Token Ratio, Mean Sentence Length and Punctuation Marks). The following sections explain the three style markers analyzed in this study as well as the statistical analysis applied on the data Standardized Type-Token Ratio (STTR) Type-Token Ratio (TTR) reveals what is called lexical density or vocabulary richness by calculating the ratio of types (words without repetition) and tokens (the overall number of words in a document) in a specific text. In other words, 63

78 it measures the diversity of vocabulary used by a writer, or in a given corpus (250). STT analysis is affected by text length; it is used to measure vocabulary richness in equal-sized corpora. However, it becomes useless if the corpora in the study are not of an equal size. On the other hand, Standardized Type-Token Ratio (STTR) is used when the corpora under study are not of an equal size. STTR calculates the average TTR based on consecutive word chunks of a text; it calculates, for example, the TTR in each consecutive 500 or 1000 tokens (Kubát and Milička 341) Mean Sentence Length Mean sentence length is the average number of words in a sentence. As Udny Yule argues, it has been approved in the literature that sentence length or number of words per sentence is a characteristic of an author s style ( 370). Jon Patton and Fazil Can also use sentence length as a style marker, among others, to analyze authorial style. Their result reveals that using "sentence lengths" as a style marker is one of the best style markers that helps distinguish author s style. A sentence is defined in Wordsmith tool s User Guide as the full-stop, question-mark or exclamation-mark (.?!) and immediately followed by one or more word separators and then a number or a currency symbol, or a letter in the current language which isn't lower-case (Scott 317). The analysis of mean sentence length in this dissertation was preceded by find short sentence method. It is a function in the Wordsmith tool that lists all short sentences. The importance of applying find short sentence method before measuring the mean sentence length lies in making sure that sentences such as 64

79 I like Mr. John. are not considered as two sentences. The Find short sentence method makes it possible to trace all the forms that may affect sentence length analysis and convert them to forms that have no effect on sentence length analysis. For example, this method retrieves a form such as Mr. as a sentence, the researcher manually removes the period in Mr. to make it Mr. This will make the tool consider Mr as a word and not a sentence Punctuation marks Li, Zheng, and Chen argue that incorporating punctuation frequency as a feature can improve the performance of authorship identification [style analysis] (80). In order to apply punctuation marks analysis, three punctuation marks are analyzed in this dissertation: semicolon, comma and hyphen. Punctuation marks analysis using corpus-based methods relies on frequency scores of punctuation marks in a given corpus. However, since text size could affect this kind of analysis and given the fact that the size of the three corpora in this study is not equal, the frequency of each punctuation mark is calculated per 1000 words (standardized punctuation marks analysis). Therefore, text size as a variable will not have any effect on the punctuation markers analysis Statistical Testing 65

80 Statistical analyses were conducted using Statistical Package for the Social Sciences (SPSS version 22; SPSS Inc., Chicago, IL) to determine if there was a significant difference between the mean scores of the three style-markers under three conditions (Translation Before Creative Writing, Creative Writing, and Translation After Creative Writing). To do so, one-way independent samples Analysis of Variance (ANOVA) was used. ANOVA is a type of statistical tests that is used to analyze the difference between the means of groups in a specific study to determine if there is any statistical difference between them. When there is a significant difference between the means of the compared groups, ANOVA does not provide information on which group is different form the other. In this case, a post-hoc statistical analysis should be used to determine which means are significantly different from each other. The present study applies Tukey HSD (Honest Significant Difference) test, which is a post-hoc statistical test applied after the ANOVA test. Tukey HSD is applied when the ANOVA test reveals that there is a significant different between the means of the groups. The Tukey HSD helps reveal which groups are significantly different from each other Machine Learning Approach The third method used in this study to analyze J-D s translational and authorial style makes use of machine learning stylometry. Machine learning approach to style analysis is one of the most advanced methods in this field. This method goes beyond the statistical methods of style analysis, such as corpus stylistics. It relies on artificial intelligence and what is called automatic pattern recognition, which is the 66

81 ability of the machine to learn the style of text producer by training the machine on a corpus of text written by a specific author (Ramyaa, Rasheed, and He). Machinelearning stylometric analysis is conducted in five main stages. The first stage involves feeding the machine, i.e. the computer, with a training corpus of texts written by a specific author. The machine is trained on the style of the training corpus author by analyzing his/her style based on machine learning algorithms and a predefined set of style marker(s). This training stage results in building the authorial stylistic profile of the training corpus author 25. The third stage of machine learning stylometric analysis involves querying the machine whether or not a specific text, a text written by an unknown author, is written by the same training corpus author. At this stage, the machine analyzes the text that was written by an unknown author and builds a stylistic profile of this text. Finally the machine compares the two stylistic profiles (of the training data author and of the text with unknown author) and decides whether the author of the training data is the same author or the query text. Machine learning approach to style analysis has been used and tested in several studies (Luyckx and Daelemans and Elayidom et al.) investigating the personal style of authors and most studies reported that this method is viable and very useful for investigating authorial style. Machine learning stylometry is used in this study to determine which of J-D s text production activities (translation before creative writing, creative writing and translation after creative writing) is stylistically 25 This stage is also called automated pattern recognition, which involves training the machine to learn the style of text producer based on some stylistic patterns in the training corpus (Ramyaa, Rasheed, and He). 67

82 close to the other. In the machine learning experiments, J-D s translations before creative writing (TBCRW) and after creative writing (TACRW) are used as training corpora. The creative writing short stories (CRW) are used as query documents in order to reveal the extent to which the authorial stylistic profile of J-D in his creative writing is close to his translational stylistic profiles in the translations produced before and after creative writing (See Figure 13 on page 109). The stylometric analysis of J-D s translations and creative writings is based on the analysis of three style markers including: Character n-grams Oakes & Ji (2012) define character n-gram as a sub-sequence of n characters from a given word (153). For example, the 2-gram or the 2-subsequent characters of the word happy will be ha, ap, pp, py. Jack Grieve compared the feasibility of word frequency, punctuation marks and character n-gram for style analysis purposes. He reported that character n-gram is one of the most effective measures to reveal the style of the document s author. John Houvardas and Stamatatos indicated that character n-grams are able to capture complicated stylistic information on the lexical, syntactic, or structural level (78). They pointed out that character 3-grams analysis could reveal lexical information such as [/the/ /_to/], word-class [/ing/, /ed] or punctuation usage (/._X/, /_ X/) information (Houvardas and Stamatatos 78). The current study uses charcter n-gram with n=3 in order to allow a deeper stylistic analysis. Since the chracter n-gram method is able to reveal 68

83 information related punctuation mark usage, this method would help validating the coprus analysis of punctuation marks Part of Speech (POS) n-grams Part of Speech (POS) analysis depends on the analysis of the tagged part of a speech element. For this particular type of analysis two tools are needed. The first one is POS tagger, which is a software that reads texts in a certain language and assigns parts of speech to each word. The second tool is a corpus analysis tool that analyzes the tagged text depending on the research questions. POS analysis tends to capture the syntactic style of the document writer. Patrick Juola indicates that there is a general agreement in authorship attributions literature that POS tags are good features to include in any style related studies. He also argues that methods that do not use syntax in one form or another, either through the use of word n-grams or explicit syntactic coding tend to perform poorly (320). The importance of using POS tags stems from the fact that authors tend to use similar syntactic patterns unintentionally in their different writings. Several researchers such as Argamon- Engelson, Koppel, and Avneri; Zhao and Zobel have used POS tags n-gram as a way to study authorial style and have reported that this style-marker gives quite accurate results. As for the analysis of POS tags in this study, 2, 3, 4gram POS tags were considered. Since the target language syntactic conventions could impact the syntactic style of the J-D in his translation and the creative writing, a fourth control 69

84 corpus, containing fifteen short stories written originally in English during the period ( ), is used. The control corpus is used to ensure that syntactic conventions of the target language have no influence on the analysis of J-D s syntactic style. In order to perform POS analysis, this study automatically tagged the three initial corpora using Stanford Log-linear Part-Of-Speech Tagger. This tagger gives 97.24% accuracy in tagging texts (Toutanova et al.). In producing the tags, this tagger uses the Penn Treebank tag set, shown in Table 2 below: TABLE 2: PENN TREEBANK TAG SET (Adopted from Marcus, Santorini, and Marcinkiewicz) POS Tag Description Example CC coordinating conjunction and CD cardinal number 1, three DT determiner the EX existential there there is FW foreign word d'hoevre IN preposition/subordinating in, of, like, after, that JJ adjective green JJR adjective, comparative greener JJS adjective, superlative greenest LS list marker 1) MD modal could, will NN noun, singular or mass table NNS noun plural tables 70

85 NP proper noun, singular John NPS proper noun, plural Vikings PDT Predeterminer both the boys POS possessive ending friend's PP personal pronoun I, he, it PP$ possessive pronoun my, his RB adverb however, usually, here, good RBR adverb, comparative better RBS adverb, superlative best RP particle give up SYM Symbol $, % TO to to go, to him UH interjection uhhuhhuhh VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when 71

86 Word n-grams Word n-gram shows the most frequent n-grams in a document depending on the size of the gram n. For instance a bigram list consists of the most frequent two words that come together. Stamatatos, Fakotakis, and Kokkinakis and Kim and Walter use word n-grams, reporting that word n-gram analysis is a good method to study the style of a document author. In the same way, Raghavan, Kovashka, and Mooney apply the n-gram model to study authorship attribution. In their study, they report that the best performing n-gram is the 3-gram model with an accuracy of 98.34%. Within translation studies, the idea of using word n-gram in analyzing translated texts is discussed by Dorothy Kenny. She refers to word n-gram as word clusters indicating that retrieving and analyzing two or three-word clusters help reveal different types of patterns in translated texts. These patterns can be related to the style of a translated text or to the style of the translator (42, 138). Similarly, Michaela Mahlberg uses the term word cluster to refer to word n-gram. She uses word cluster to analyze Dickens style in his fiction. The present study uses the term word n-gram to refer to word cluster analysis. While applying word n-gram, the researcher should decide on the size of the n, which varies from one study to another. As for the present study, n-gram is applied with n=, 3 and 4 to see which n-gram model works better in the case of the authorial and translational style. Word n-gram has been also used for topic discovery or topic modeling, which is a natural language processing technique that is used to reveal the topic or theme of a specific set of documents and classify them accordingly (Wang, McCallum, and Wei). Word 72

87 n-gram analysis would help reveal lexical patterns in J-D s translations and creative writing and would also help validate the Latent Semantic Analysis results (LSA) Tools Used in the Dissertation Three tools are used to analyze the data in this study. The first one is Gensim tool, an open source topic-modeling tool that is implemented in the Python programing language. The second tool is WordSmith, a windows-based tool used in corpus linguistics. The third tool is JGAAP (Java Graphical Authorship Attribution Program), a free machine-learning stylometry tool. Table 3 below provides a description of these tools along with their main features: TABLE 3: TOOLS USED IN THE DISSERTATION Tool Features About the tool WordSmith 6.0 Gensim tool Concord Key Words list Word Lists N-gram lists LSA Topic modeling Document Corpus tool created by Mike Scott at the University of Liverpool. The tool is available for sale in the tool s webpage: An open source topic-modeling tool that is implemented in the Python programing language. The Python code for the tool is available online for researcher on the tool s website ( 26 It is worth mentioning that LSA analyzes the topic in a text by following what is called bag of words technique, which is an NLP technique where word order is not important. On the contrary, word n-gram analysis relies on word order and the co-occurrences of words in a text. The two techniques use different analysis methods to reveal the thematic content of a text. 73

88 JGAAP Tool categorization Similarity Query Machine Learning Style analysis Java Graphical Authorship Attribution Program (JGAAP), a free machine-learning stylometry tool. JGAAP is a Java-based program for stylometric analysis developed by the Evaluating Variation in Language Laboratory (EVL Lab) at Duquesne University, Pennsylvania Conclusion This chapter has provided detailed information about data collection, and corpus preprocessing, compilation, control and analysis. It has also discussed the three computational methods used in this study (Latent Semantic Analysis, Corpus Stylistics and Machine Learning Stylometry) while showing the purpose behind using each method. This chapter has also discussed a set of style markers that were analyzed using each method to conduct the data analysis. The following chapter discusses the processes and the results of applying Latent Semantic Analysis on the three corpora in this study: the corpus of translated short stories produced before the production of creative writing (TBCRW_raw), the corpus of creative writing short stories (CRW) and the corpus of translated short stories produced after the production of creative writing (TACRW_raw). 74

89 CHAPTER 4: LATENT SEMANTIC ANALYSIS RESULTS 4.1. Introduction This chapter presents the results of the LSA experiments on the three short stories corpora produced by J-D: the corpus of translated short stories produced before the production of creative writing (TBCRW_raw), the corpus of creative writing short stories (CRW) and the corpus of translated short stories produced after the production of creative writing (TACRW_raw). The chapter is divided into three sections; the first section provides a general overview of the LSA experiment applied to the three corpora in this study (TBCRW_raw, CRW and TACRW_raw). The second section shows the results of the first LSA experiment concerning the thematic similarity of the translated short stories in (TBCRW_raw) to their counterparts in J- D s creative writing (CRW) while the third section discusses the thematic similarity of creative writing short stories (CRW) to the short stories translated after the production of the creative writing (TACRW_raw). This chapter relies on data visualization as a method for data representation. A number of charts are provided in each section to demonstrate visualizations of the LSA experiments, processes and results. The first LSA experiment described and reported in this chapter is meant to test the third hypothesis, which claims that the short stories J-D translated before the production of creative writing are close in theme to his creative writing short stories. The second LSA experiment tests the fourth hypothesis, which claims that J-D s own short stories are close in theme to the short stories he translated after the production 78

90 of his creative writing. As discussed in the methodology chapter, LSA Similarity analysis is used to reveal if the themes J-D developed in his creative writing were influenced by the themes of the short stories he translated before the production of his creative writing. This analysis is also used to determine if the themes in J-D s creative writing impacted his choice of the short stories translated after the production of creative writing LSA Similarity Analysis As pointed out in the methodology chapter, LSA is a fully automated computational method that applies mathematical algorithms to retrieve and represent the contextual meaning and usage of words. The power of LSA analysis lies in the fact that it approximate[s] human judgments of meaning similarity between words and can objectively predict the consequence of overall word-based similarity between passages (Landauer, Foltz, and Laham). Several scholars, such as Thomas Landauer, Foltz, and Laham; Thomas Hofmann, T. L. Griffiths and Steyvers; T. Griffiths and Steyvers, Prediction and Semantic Association indicate that automated document similarity analysis is very useful in many cases such as structuring a huge corpus of texts based on topics. The applicability of LSA in similarity analysis queries is empirically proved with excellent rates of accuracy (Deerwester et al.; Ahat, Amor, and Bui). When it comes to the study of translator style using computational methods, LSA similarity analysis can be effectively used to build a corpus of translated texts that is controlled for topic. As mentioned earlier, LSA similarity analysis is used in 79

91 this study to retrieve the most thematically relevant short stories J-D translated before the production of his creative writing (TBCRW_raw) and the short stories he translated after the production of creative writing (TACRW_raw) to their counterparts in the creative writing corpus (CRW). The following sections explain this process and report the result of applying this method on the three corpora in the present study LSA Similarity Query on J-D s Translation before Creative Writing In this experiment, the fifteen short stories written by J-D (CRW corpus) are used as query documents to retrieve the most relevant short stories in the first translational sub-corpora (TBCRW_raw), containing the short stories J-D translated before writing his own short stories. The most relevant five short stories to each query document (creative writing short story) were retrieved to build (TBCRW), which contains short stories translated by J-D and is relevant in topic to the creative writing corpus. The following figure shows this process: FIGURE 5: LSA ANALYSIS OF TBCRW_RAW Query Document (Doc1..15 in CRW) 1 4 Query' Results TBCRW Target' Corpus (TBCRW _raw) 2 LSA Processing 3 80

92 As mentioned in the methodology chapter, the present study makes use of the Gensim 27 tool to run the LSA experiment. It is worth noting that LSA analysis using Gensim requires preprocessing the corpus in a specific way. That is, each corpus is saved in one txt file containing a number of lines that equals the number of different short stories in each corpus. In other words, each line contains one short story. The first corpus, CRW, which contains fifteen creative writing short stories, is converted into a 15-lines txt file. Similarly, the short stories TBCRW_raw is converted into one txt file containing two hundred and sixteen lines, which reflects the number of short stories J-D translated before writing his own collection of short stories, as the Table 4 below shows: TABLE 4: CRW AND TBCRW_RAW SIZE Number of lines/ documents CRW TBCRW_raw The second step in the LSA analysis requires pre-processing the corpus by applying a stoplist on the data. The stoplist contains words such as prepositions and articles that have less lexical meaning compared to content words such as verbs or nous. The importance of including such a list stems from the fact that LSA should capture the relation between meaningful words in order to reach a good level of accuracy. After pre-processing the corpus using the stoplist, LSA represents the corpus documents in 27 An open source topic-modeling tool that is implemented in the Python programing language (see the methodology Chapter). 81

93 a vector format within a dimensional space (See Figure 2 page 59). The size of the vector containing the number of latent themes based on which the similarity experiment is conducted, must be determined. The thematic similarity or relevancy between documents in a specific corpus is determined by calculating the cosine angle between their vectors. In the literature, different researchers used different dimensionality in their experiments (Nakov). Typically, to choose the vector value, one runs experiments with different values (e.g., 50, 100, 200 or 300), depending on the research questions and the corpus size in the study, and then selects the value that gives the most accurate results. In this experiment, LSA was run with V= 75, 100, A manual evaluation of the three LSA outputs showed that the overall results had not changed when changing the V value. Thus, the output of the LSA with V=100 was selected since it is the value between 75 and LSA Results with V=100 Table 5 below displays the LSA output of the experiment featuring the five most thematically relevant short stories translated before the production of creative writing to each creative writing short story in (CRW). The table also shows the cosine value of the angle between the vectors of the translated short stories from (TBCRW_raw) and the vector of the creative writing short stories (CRW). The 28 There is no existing consensus on the rules determining the vector size in LSA experiments. The researcher determines the V value based on the corpus size along with a manual evaluation of LSA output of different V values tested. 82

94 creative writing short stories are represented as Q (1 15). The Doc column lists the short story number that was translated by J-D after the production of his creative writing in the first translational corpus (TBCRW_raw). The similarity column shows the cosine value of the angle between the vectors of the translated short story (Doc column) and the creative writing short story (Q1 15). For example, Q-1 is short story number one in the creative writing corpus. Doc number 166 represents the translated short story number one hundred and sixty six in the translational corpus (TACRW_raw). The similarity column shows the cosine value of the angle between the vector of the translated short story number one hundred and sixty six in (TACRW_raw) and the vector of the short story number one in the creative writing corpus (CRW), which is

TABLE 5: LSA OUTPUT OF SIMILARITY QUERY ANALYSIS ON TBCRW_RAW As explained in the methodology chapter, the similarity cutoff for this experiment is set to 0.70.

95 TABLE 5: LSA OUTPUT OF SIMILARITY QUERY ANALYSIS ON TBCRW_RAW As explained in the methodology chapter, the similarity cutoff for this experiment is set to The similarity cutoff value determines the cosine value between two vectors above which the translated short story is considered thematically similar to the short stories in the creative writing corpus. Thus, any short story with less than 0.70 cosine value is considered less thematically relevant to the creative writing short story. Based on the pre-defined cosine value cutoff, the tables above show that none of the translated short stories that J-D produced before engaging in his creative 84

96 writing has a significant thematic similarity to the themes of his own writing. The exposition of the LSA findings in a table format might not provide an easy read of the results. Therefore, the following graphs offer a visualization of the LSA results in form of spatial graphs: FIGURE 6: LSA EXPERIMENT 1 RESULTS (Q1--Q5) Short Stories translated by J-D from TBCRW_raw Similarity Cutoff D3 D89 D157 D30 D90 D2 D70 D150 D166 D167 D154 Q5 Q4 Q3 D17 D D156 D70 D15 Q2 D24 D D173 D166 Q1 D70 D90 D166 D2 D The first creative writing short story Short Stories translated by J-D from TACRW_raw Similarity Cutoff space ( 0.70) 85

97 FIGURE 7: LSA EXPERIMENT 1 RESULTS (Q6--Q10) Short Stories translated by J-D from TBCRW_raw Similarity Cutoff D90 D2 D166 D24 D173 D68 D79 D70 D166 D2 D2 D166 D90 D24 D173 D70 D90 D70 D92 D2 D2 D4 D90 D166 D166 Q10 Q9 Q8 Q7 Q Short Stories translated by J-D from TACRW_raw Similarity Cutoff space ( 0.70) FIGURE 8: LSA EXPERIMENT 1 RESULTS (Q11--Q15) Short Stories translated by J-D from TBCRW_raw Similarity Cutoff D68 D169 D2 D90 D79 D53 D90 D176 D92 D169 D70 D23 D2 D166 D90 D2 D4 Q15 Q14 Q13 Q D70 D152 D166 D68 D173 D2 D166 D152 Q Short Stories translated by J-D from TACRW_raw Similarity Cutoff space ( 0.70) 86

98 The above graphs visualize the results of the LSA queries and demonstrate the relation between the short stories J-D translated before the production of his creative writing and (TBCRW_raw) and the creative writing sort stories in (CRW). Q1 Q15 are used to refer to the process of using the fifteen creative writing short stories as a query documents. The blue rectangles represent the translated short stories from (TBCRW_raw). In each query, the first five results, which represent the five translated short stories most thematically relevant to the creative writing short stories, are presented in the graphs above. It can be noticed that none of the translated short stories that were translated before the production of J-D s creative writing fall in the similarity cutoff space in the three graphs; however, some translated short stories are more thematically relevant than other translated short stories in the same corpus (TBCRW_raw) to the creative writing stories in (CRW) LSA Similarity Query on J-D s Translation after Creative Writing The second LSA experiment reported in this section attempts to reveal the thematic similarity between J-D s creative writing and the translated short stories produced after the production of the creative writing short stories. The most thematically similar five short stories in the second translational corpus containing short stories translated after the production of creative writing (TACRW_raw) in relation to each creative writing short story (CRW) are retrieved. The output of this LSA experiment will be used to build (TBCRW) corpus, which contains the fifteen most thematically relevant short stories translated by J-D after the production of his 87

99 creative writing. The following figure shows this process: FIGURE 9: LSA ANALYSIS OF TACRW_RAW Query Document (Doc1..15 in CRW) 1 4 Query' Results TACRW Target' Corpus (TACRW_ raw) 2 LSA Processing 3 As with the previous experiment, TACRW_raw was saved in a txt file containing one hundred and seven short stories. Each short story is presented in a separate line (See table 3 below). The stoplist was then applied on the two corpora (CRW and TACRW_raw). TABLE 6: CRW AND TACRW_RAW SIZE Number of lines/ short stories CRW TACRW_raw As for Vector value, the second LSA experiment was run with two vector values: V =25 and V =50, considering the size of the corpus. It is worth mentioning that determining the V value depends on the size of the corpus. Since (TACRW_raw) is smaller than (TBCRW_raw), the V value will logically be lower than the V value of the first LSA experiment considering the size of the two corpora. After a manual evaluation of the LSA analysis results with V = 50 and V =100, I noticed that the 88

100 overall results had not changed when changing the V value. However, the cosine value decreased when setting the experiment with a higher V value although the number of the relevant short stories remained the same. For consistency purposes, I selected V =50 which represent almost half the number of the short stories in the (TACRW_raw). The same principle was applied in the first experiment where V value was set to 100, which is half the number of short stories in (TBCRW_raw) LSA Results with V=50 The following table display the LSA output of the experiment featuring the five most thematically relevant short stories translated after the production of creative writing to the creative writing short stories (CRW). The table also shows the cosine value of the angle between the vectors of the translated short stories from (TACRW_raw) and the vector of the creative writing short stories (CRW). The creative writing short stories are represented as Q (1 15). The Doc column lists the short story number that was translated by J-D after the production of his creative writing in the second translational corpus (TACRW_raw). The similarity column shows the cosine value of the angle between the vectors of the translated short story (Doc column) and the creative writing short stories (Q1 15). For example, Q-1 represents short story number one in the creative writing corpus. Doc number 28 represents the translated short story number twenty-eight in the translational corpus (TACRW_raw). The similarity column shows the cosine value of angle between the vector of the translated short story number twenty-eight in (TACRW_raw) and the 89

101 vector of the short story number one in the creative writing corpus (CRW), which is TABLE 7: LSA OUTPUT OF SIMILARITY QUERY ANALYSIS ON TACRW_RAW 90

102 Based on the pre-defined similarity cutoff (cosine value 0.70), Table 7 above shows that many of the translated short stories that J-D produced after engaging in his creative writing have a significant thematic similarity to themes in his own writing. The following graphs in particular demonstrate the thematic similarity between the creative writing short stories and the translated short stories that were translated after the production of the creative writing short stories (in TACRW_raw): FIGURE 10: LSA EXPERIMENT 2 RESULTS (Q1--Q5) Similarity Cutoff 0.70 D69 D31 D23 D6 D28 Q D6 D23 D106 D91 D12 Q D37 D28 D6 D23 D3 Q D23 D6 Q2 D35 D31 D D35 D31 D23 D6 D28 Q Short Stories translated by J-D from TACRW_raw Similarity Cutoff space ( 0.70) 91

103 FIGURE 11: LSA EXPERIMENT 2 RESULTS (Q6--Q10) D35 D9 D23 D28 D6 Q D35 D31 D23 D6 D28 Q D31 D6 D23 D6 D28 Q D35 D31 D23 D6 D28 Q D31 D23 D35 D6 D28 Q Short Stories translated by J-D from TACRW_raw Similarity Cutoff space ( 0.70) FIGURE 12: LSA EXPERIMENT 2 RESULTS (Q11-Q15) D23 D28 Q D72 D6 D D69 D100 D6 D23 D28 Q14 D9 D23 D30 D28 D6 Q D100 D23 D31 D6 D28 Q D9 D6 D35 D23 D28 Q Short Stories translated by J-D from TACRW_raw Similarity Cutoff space ( 0.70) 92

104 The above graphs visualize the results of the LSA queries and demonstrate the thematic relation between the short stories in both corpora (CRW and TACRW_raw). Each graph represents five creative writing short stories (Q1 Q5, Q6 10, Q11 Q15) and shows their most thematically relevant counterparts in the translational corpus (TACRW_raw). The triangles in the above graphs represent the translated short stories from (TACRW_raw). It can be noticed that a number of translated short stories fall within the similarity cutoff space in the three graphs. It can also be observed that only four translated short stories are relevant to the fifteen creative writing short stories Conclusion According to the findings, the first LSA experiment on the relation between (TBCRW_raw) and (CRW) revealed that there was no significant thematic similarity between the short stories translated by J-D before the production of his creative writing and his creative writing short stories. On the other hand, the second LSA experiment revealed that there is a significant thematic similarity between the short stories translated after the production of creative writing short stories (TACRW_raw) and the creative writing ones in (CRW). These findings will be further discussed in Chapter 6 in the light of the research hypotheses that motivated this study. The following chapter offers a second set of experiments that made use of corpus stylistic and machine learning stylometry methods to investigate a selection of style markers specific to J-D s translations and creative writings. 93

105 CHAPTER 5: CORPUS STYLISTICS AND MACHINE LEARNING ANALYSIS RESULTS 5.1. Introduction This chapter is divided into two parts. The first part reports the corpus analysis results as derived from the WordSmith tool. It also reports the results of the one-way independent samples Analysis of Variance (ANOVA). The second part reports the results of the machine learning experiments. The corpus and the machine learning experiments described in this chapter are meant to test the third hypothesis, which claims that the short stories J-D translated before the production of creative writing display some stylistic markers that are also found in his creative writing. These two experiments also address the fourth hypothesis, i.e., J-D s creative writing short stories display some stylistic markers that are also found in the short stories he translated after the production of creative writing The corpora analyzed in this chapter are built based on the results of the two LSA experiments that were conducted in the previous chapter. That is, the chapter analyzes a set of short stories translated by J-D before and after the production of his creative writing and which are thematically relevant to his creative writing short stories. This include the following corpora: TABLE 8: STUDY CORPORA FROM THE LSA RESULTS Corpus Description Size in words Text size range in words TBCRW A translational corpus containing the fifteen most thematically relevant translated short stories to J-D s creative writing and which are translated before ,

106 the production of J-D s creative writing. CRW Creative writing corpus TACRW A translational corpus containing the fifteen most thematically relevant translated short stories to J-D s creative writing and which are translated after the production of J-D s creative writing Corpus Analysis This section applies a corpus-based approach to translator style and analyzes J-D s style in the corpus of short stories translated before the production of creative writing (TACRW), the corpus of Creative Writing short stories (CRW) and the corpus of short stories translated after the production of creative writing (TACRW). The stylistic analysis in this section focuses on some style-markers including Standardized Type-Token Ratio (STTR), mean sentence length and punctuation marks (commas, hyphens and semicolons). The goal of this analysis is to trace the stylistic impact of J-D s translating activity on his creative writing activity and in turn see if the creative writing activity has any stylistic impact on J-D s translating activity. The following sections report the corpus analysis results of the three style markers Textual Analysis Standardized Type-Token Ratio STTR reveals the degree of vocabularies diversity in a text or the vocabulary richness of the text producer. As mentioned earlier, STTR has been used 95

107 as a style marker to analyze translators and authors style. This section provides the STTR results of three corpora produced by J-D. The following table displays the STTR analysis results of the three corpora in this study (TBCRW, CRW, and TACRW). The STTR results in Table 9 below are based on textual chunks of words. TABLE 9: STTR SCORE IN THE THREE CORPORA The above table shows the mean STTR in J-D s translation before creative writing (TBCRW), in J-D s creative writing (CRW) and in J-D s translation after creative writing (TACRW). It can be noticed that there is not a significant difference 29 The STTR analysis in this study was set to 500 because one of the short stories J- D translated after the production of his creative writing contains less than 1000 words and the WordSmith tool did not provide the STTR for this specific short story unless the STTR is measured for each 500 words. 96

108 between the mean score of STTR in the translations that J-D produced before and after the production of his creative writing (TBCRW and TACRW). However, there is a noticeable difference between the mean score of STTR in the two translational corpora (TBCRW and TACRW) and the creative writing corpus (CRW). The difference between the mean score of STTR in the three corpora will be further investigated using statistical significance analysis Mean Sentence Length As explained in the methodology chapter, mean sentence length analysis calculates the average number of words in a sentence. A sentence is defined as the full-stop, question-mark or exclamation-mark (.?!) and immediately followed by one or more word separators and then a number or a currency symbol, or a letter in the current language which isn't lower-case (Scott 317). Table 10 below shows the mean sentence length in each short story in the three corpora along with the overall mean sentence length in each corpus: TABLE 10: MEAN SENTENCE LENGTH SCORE IN THE THREE CORPORA TBCRW Mean Sentence Length CRW Mean Sentence Length TACRW Mean Sentence Length 15.txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt

109 169.txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt Mean±S D 21.37±7.23 Mean±S D 20.68±2.50 Mean±SD 18.85±7.45 As displayed in the above table, the mean sentence length in the first translational corpus, translation before creative writing (TBCRW) ranges from to 12.23, while the range in the creative writing corpus (CRW) starts from a lower score, to words. The overall mean score of sentence length in two corpora is not very distant. The score of the sentence length mean in the second translational corpus, translation after creative writing (TACRW) ranges from to It can be noted that the overall mean sentence length scores in the TBCRW and CRW corpora are close to each other compared to the TACRW corpus. However, this does not mean that the mean sentence length in TACRW is significantly different from the mean sentence length in the other two corpora, Before Creative Writing and Creative Writing corpora. The statistical significance of mean sentence length is tested in the second section of this chapter Punctuation Marks Analysis Several studies in Stylometry consider the use of punctuation markers as a 98

110 viable style marker for authorial style analysis. In this regard, Li, Zheng, and Chen argue that incorporating punctuation frequency as a feature can improve the performance of authorship identification (80). As mentioned in the methodology chapter this study provides analysis of three punctuation marks including hyphens, semicolons and commas. This analysis relies on frequency scores of punctuation. However, since the text size could affect this kind of analysis and given the fact that the size of the three corpora in this study is not equal, the frequency of each punctuation mark should be calculated per 1000 words. Therefore, texts size as a variable will not have any effect on the punctuation marks analysis Standardized hyphen Analysis Several studies rely on hyphens as a style-marker to either determine the authorship of a particular text or to analyze the authorial style of the text producer. For instance, Narayanan et al. and Chaski have used hyphen as a style marker and reported that the hyphen, among other punctuation marks, does help identify authors. Hyphen analysis was applied on the three corpora. The following table shows hyphen frequency ratio per words in each text in the three corpora (TBCRW, CRW and TACRW): 30 The WordSmith tool is defaulted to calculate punctuation marks in textual chunks of 1000 words. WordSmith tool provided the punctuation mark ratio for the short story that contains less thank 1000 in this study; however, the tool did not calculate the ration of STTR of the same short story in textual chunks of 1000 words. 99

111 TABLE 11: STANDARDIZED HYPHEN SCORE IN THE THREE CORPORA TBCR W Hyphen _per 1,000 CRW Hyphen _per 1,000 TACRW Hyphen _per 1, txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt 1.97 Mean± SD 5.76±3.44 Mean±SD 9.38±2.54 Mean±SD 4.74±2. 42 As the above table shows, the mean standardized hyphens score in the first translational corpus, translation Before Creative Writing (TBCRW) ranges from 2.6 to It ranges in the Creative Writing corpus (CRW) from 5.27 to and from to 8.96 in the translation After Creative Writing (TACRW) corpus. The overall mean standardized hyphens score in the first translational corpus (TBCRW) is 5.76, which is lower than the average of the total occurrences of hyphens in the creative writing corpus (CRW), It is also observed that the mean standardized hyphens 100

112 score in the Creative writing corpus is the highest while the total scores of hyphens in the two translation corpora is lower and both corpora are situated close to each other Standardized Comma Analysis Another punctuation mark that is widely used in Stylometry as a stylemarker is the comma (Li, Zheng, and Chen ). The following table shows the mean Standardized Comma score in each short story in the three corpora (TBCRW, CRW, TACRW) along with the overall standardized comma score for each corpus. TABLE 12: STANDARDIZED COMMA SCORE IN THE THREE CORPORA TBCRW Comma _per CRW Comma _per TACRW Comma _per 1,000 1,000 1, txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt Mean±SD 51.12±12.57 Mean±SD 41.36±7.21 Mean±SD 48.90±

113 The above table displays the results of standardized comma analysis in the three corpora. The mean score of standardized comma in the first translational corpus translation Before Creative Writing (TBCRW) is the highest, 51.12, followed by the mean score of standardized comma in the second translational corpus, 48.90, translation After Creative Writing (TACRW). It is clear that the mean scores of standardized comma for the two translational corpora are relatively close to each other compared to the mean score of standardized comma in the Creative Writing corpus (51.12) Standardized Semicolon Analysis The semicolon as a style marker has been used in many studies investigating authorial style (Hänlein; Ramyaa, Rasheed, and He). Ramyaa, Rasheed, and He pointed out that semicolons indicate the reluctance of an author to stop a sentence where (s)he could. That being said, semicolon analysis might reveal the idiosyncratic style of authors in using this specific punctuation mark. Table 13 below provides the standardized semicolon score for each short story along with the overall score of each corpus: TABLE 13: STANDARDIZED SEMICOLON SCORE IN THE THREE CORPORA TBCRW SemiCo _per CRW SemiCo _per TACRW SemiCo_per 1,000 1,000 1, txt txt txt txt txt txt

114 154.txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt txt 1.13 Mean±SD 2.46 ±1.55 Mean±SD 2.12±2.0 Mean±SD 1.08±.54 The above table displays the mean score of standardized semicolon in the three corpora. It can be noticed that the mean score of semicolons in the first translational corpus, translation Before Creative writing (TBCRW), is the highest with a mean score of The mean score of standardized semicolon in the Creative writing corpus (CRW) is 2.12, which is not very different from that of the first transitional corpus. However, the mean of standardized semicolon score in the second translational corpus, translation After Creative Writing (TACRW) was 1.08, which is lower than that of the other two corpora SPSS Statistical Analysis The results of the corpus analysis in the above section was then verified using one-way ANOVA test in order to determine if there is a significant difference 103

115 between the mean scores of the above discussed style-markers under the three conditions (Translation Before Creative Writing, Creative Writing, and Translation After Creative Writing). Post hoc groups comparison using the Tukey HSD test was run also when there was a significant difference between the mean scores of style markers under the three condition in order to determine which groups (translation before, creative writing and translation after) are significantly different from each other and which groups are not considering the different style markers. This analysis would help solidify any conclusion drawn from the possible effects of J-D s translating activity on his creative writing activity and vice versa. The following sections report the AVOVA and the Tukey s HSD test results Textual Analysis Standardized Type-Token Ratios (STTRs) Mean Standard Type/Token Ratios (STTRs) for the three conditions (Before Creative Writing, Creative Writing, and After Creative Writing) of the independent variable (Text Production Activity) were submitted to one-way independent samples Analysis of Variance (ANOVA). There was a significant effect of Text Production Activity, F(2,42) = 4.338, p =.019 on the mean STTRs of the three conditions. Thus, at least two of the mean STTRs, for Before Creative Writing [M = 49.27, SD = 2.62], Creative Writing [M = 51.19, SD =1.48], and After Creative Writing [M = 49.06, SD = 2.28], were significantly different. 104

116 Post hoc comparisons using the Tukey HSD test indicated a significant difference between mean STTR of Creative Writing and mean STTR of After Creative Writing (p =.028). The difference between mean STTR of Before Creative Writing and mean STTR of Creative Writing approached significance (p =.052). However, there was no significant difference between mean STTR of Before Creative Writing and mean STTR of After Creative Writing (p =.963) Mean Sentence Length Sentence Length Means for the three conditions (Before Creative Writing, Creative Writing, and After Creative Writing) of the independent variable (Text Production Activity) were submitted to one-way independent-samples Analysis of Variance (ANOVA). There was not a significant effect of Text Production Activity, [F (2,42) =0.665, p =.520], on the three conditions (Before Creative Writing, Creative Writing, and After Creative Writing) Punctuation Marks analysis Standardized Comma analysis Mean standardized comma scores for the three conditions (Before Creative Writing, Creative Writing, and After Creative Writing) of the independent variable (Text Production Activity) were submitted to a one-way independent-samples Analysis of Variance (ANOVA). There was a significant effect of Text Production Activity, [F (2, 42) = 0.665; p = 0.013] on the standardized comma means in the three 105

117 conditions. Thus, at least two of the mean standardized comma score, for Before Creative Writing [M= 4.186, SD=.191], Creative Writing [M= 4.29, SD=.072], and After Creative Writing [M= 4.186, SD=.191], were significantly different. Post hoc comparisons using the Tukey HSD test indicated a significant difference between the mean standardized comma score of Creative Writing and the mean standardized comma score of Before Creative Writing (p =.013). However, there was no significant difference between mean standardized comma score of After Creative Writing and mean standardized comma score of Creative Writing (p =.067). There was no significant difference between the mean standardized comma score of After Creative Writing and the mean standardized comma score of Before Creative (p =.136) Standardized Hyphen analysis Mean standardized hyphen ratio for the three conditions (Before Creative Writing, Creative Writing, and After Creative Writing) of the independent variable (Text Production Activity) were submitted to a one-way independent-samples Analysis of Variance (ANOVA). There was a significant effect of Text Production Activity, [F (2, 42) = 11.03; p = 0.00] on the mean standardized hyphen ratio for the three conditions. Thus, at least two of the mean standardized hyphen ratios, for Before Creative Writing [M= 5.76, SD= 3.44], Creative Writing [M= 9.38, SD= 2.54], and After Creative Writing [M= 4.74, SD= 2.42], were significantly different. 106

118 Post hoc comparisons using the Tukey HSD test indicated a significant difference between the mean standardized hyphen score of Creative Writing and the mean standardized hyphen score of Before Creative Writing (p =.003). There was a significant difference between the mean standardized hyphen score of After Creative Writing and the mean standardized hyphen score of Creative Writing (p =.000). However, there is no significant difference between mean standardized hyphen score of After Creative Writing and the mean standardized hyphen score of Before Creative Writing (p =.593) Standardized Semicolon analysis Mean standardized semicolon scores for the three conditions (Before Creative Writing, Creative Writing, and After Creative Writing) of the independent variable (Text Production Activity) were submitted to a one-way independent-samples Analysis of Variance (ANOVA). There was a significant effect of Text Production Activity, [F (2, 42) = 3.457; p =.041] on the mean standardized semicolon scores of the three conditions. Thus, at least two of the mean standardized semicolon scores, for Before Creative Writing [M= 2.46, SD=.40], Creative Writing [M= 2.11, SD= 2.00], and After Creative Writing [M= 1.08, SD=.54], were significantly different. Post hoc comparisons using the Tukey HSD test indicated a significant difference between the mean standardized semicolon score of Before Creative Writing and the mean standardized semicolon score of After Creative Writing (p =.040). However, there was not a significant difference between the mean standardized 107

119 semicolon of Before Creative Writing and the mean standardized comma of Creative Writing (p =.802). There was no significant difference between mean standardized semicolon ratio of After Creative Writing and mean standardized comma of After Creative Writing (p =.153) Machine Learning Stylometry Machine learning stylometry is used in this section to determine which of J- D text production activities (translation Before Creative Writing, Creative Writing and translation After Creative Writing) is stylistically closer to the other. The stylometric analysis of J-D s translations and creative writings is based on the analysis of three style markers including character n-grams 31, Part of Speech (POS) n-grams and word n-grams. In the machine learning stylometry experiments reported in this chapter, J-D s translations before creative writing (TBCRW) and after creative writing (TACRW) are used as training corpora. The creative writing short stories (CRW) are used as query documents in order to reveal the extent to which the authorial stylistic profile of J-D in his creative writing is close to his translational stylistic profile in the translations produced before and after creative writing. Figure 13 below shows how machine-learning stylometric analysis is applied in this study: 31 N- gram is a sequence of textual data in n size. For example word 3- gram is a cluster consisting of three words. In the same manner, character 3- gram is a cluster of three characters for example 3- grams of the word happy would be hap, app, and ppy. 108

120 FIGURE 13: MACHINE LEARNING TRANSLATOR STYLE DETECTION (ADOPTED FORM EFSTATHIOS STAMATATOS) Training translational corpus TBCRW TT 1 + TT 2 + TT... n = Step 1 Training translational corpus TACRW J-D translational Profile in TBCRW (S_TB) Step 4 Style Comparison J-D authorial Profile in CRW (S_CRW) Step 3 Creative writing short stories CRW_1.. CRW...15 TT C + TT D + TT... n = Step 2 J-D translational Profile in TACRW (S_TA) Result Step 5 First, the machine is trained on the style of J-D in his translations before the production of creative writing (TBCRW) and the translation after creative writing (TACRW) based on a predefined set of style-marker such as character n-gram or word n-gram (steps 1 and 2). The machine, internally, analyzes and recognizes the stylistic patterns J-D in (TBCRW), (S 32 _TB), (step 1). In the same manner, the machine takes the texts in the second translational corpus (TACRW) as an input to learn the style of J-D in those texts (S_TA), (step 2). After that, the machine is given 32 S stands for style 109

121 J-D s creative writing short stories. It analyzes his style, using the same methods and the same style-markers that were used to analyze the style in the two training corpora, and produces J-D s authorial stylistic profile (S_CRW), (step 3). As a final step, the machine compares the three stylistic profiles (S_TA, S_TB and S_CRW) (Step 4) and determines the most relevant translational stylistic profile (S_TA or S_TB) to the authorial stylistic profile (S_CRW), (step 5) JGAAP Tool As pointed out in the methodology chapter, the stylometric analysis in this chapter makes use of the Java Graphical Authorship Attribution Program (JGAAP), a free machine-learning stylometry tool. JGAAP is a Java-based program for stylometric analysis developed by the Evaluating Variation in Language Laboratory (EVL Lab) at Duquesne University, Pennsylvania. Figure 14 below shows a screenshot of the JGAAP tool interface: 110

FIGURE 14: JGAAP TOOL INTERFACE The above screenshot displays the interface of the JGAAP tool and shows that J-D s creative writing short stories are used to process the stylometric queries, which

122 FIGURE 14: JGAAP TOOL INTERFACE The above screenshot displays the interface of the JGAAP tool and shows that J-D s creative writing short stories are used to process the stylometric queries, which will reveal which translational stylistic profile (translation before or after creative writing) is closer to the authorial stylistic profile of J-D in his creative writing. The screenshot also shows that the two translational corpora (translation before and translation after creative writing) are used as training corpora. It is worth mentioning the term stylistic profile refers to the internal pattern resignation that is built by the machine based on the style marker analysis applied on the data. This machine learning pattern recognition is used by the machine to determine the closeness of the other compared patterns in the other sets of data (other stylistic profiles). That being said, the user cannot see the stylistic profile (the recognized patterns) of the authors during the processing stage of the analysis. 111

123 Corpus Pre-processing The JGAAP tool provides the users with the ability to conduct automatic corpus pre-processing based on the user s input. Before conducting the stylometric analysis, the three corpora in this study, (TBCRW, CRW and TACRW), were preprocessed using two canonicalization methods. Canonicalization is a normalization process that converts the data that has two different representations into one standard representation. An example of canonicalization is converting all capital letters in a corpus to small letters. The first canonicalization method applied on the three corpora in this study is normalizing white space. It is a process that converts all whitespace characters such as newline, space and tab to a single space. This will ensure that any space produced by the text conversion processes is normalized across texts. The second method is normalizing the textual data based on The American Standard Code for Information Interchange (ASCII) 33. This process guarantees that the texts analyzed do not contain any non-printing characters. It will also removes any characters that are not included in the ASCII table, which include printable characters, a-z, A-Z, digits 0-9, punctuation marks, and some different symbols. 33 A character- encoding scheme 112

124 JGAAP Analysis Method The JGAAP provides different methods of analysis/ analysis algorithms. The analysis methods that were used in the machine learning experiments reported in this chapter are Nearest Neighbor Driver with metric Cosine Distance. Nearest Neighbor is an algorithm that presents the similarity/distance between document vectors in a dimensional space. Cosine Distance is another algorithm that calculates the cosine distance between vectors to determine the similarity between documents based on the distance of their vectors from each other. Figure 15 below exemplifies this process: FIGURE 15: VECTORS OF MADE UP DOCUMENTS In the above dimensional space Dim1, Doc 1 and Doc 2 represent two vectors of two different documents. Q Doc represents another vector that represents a query document. If the purpose is to determine which Doc (1 or 2) is similar to the 113

125 query document Q Doc using Cosine Distance approach we can say that Doc1 is more like Q Doc by noting the angles between the vectors. The smaller the size of the angle, the closer the vectors and the more similar the documents are Style Markers Analysis Character n-gram analysis As pointed out in the methodology chapter, the current study uses character n-gram with n=3 in order to allow a deeper stylistic analysis. This would help reveal imporatnt stylistic information such as the use of punctuation marks or lexical information such as word class. Figure 16 below shows a screenshot of the first two results of the character 3-gram analysis. It also shows creative writing short stories number one and ten and their most stylistically relevant short stories in the translational corpora (translation before or after creative writing) along with the canonization and the analysis methods applied. FIGURE 16: MACHINE LEARNING CHARACTER 3-GRAM ANALYSIS 114

126 The character n-gram analysis with n=3 revealed that J-D s translational stylistic profile 34 in the short stories he translated after the production of creative writing (TACRW) is closer to his authorial stylistic profile 35 in his creative writing (CRW). The analysis showed that the authorial stylistic profile of J-D in thirteen short stories was closer to his translational stylistic profile in the short stories he translated after the production of creative writing (TACRW). The analysis also revealed that J- D authorial stylistic profile in only two of his creative writing short stories is similar to his translational stylistic profile in the short stories he produced before the production of creative writing Part-of-Speech (POS) Analysis POS n-gram analysis reveals syntactic patters related to the text author. In this experiment, the goal is to investigate whether or not the syntactic style of J-D in the translations produced before creative writing impacted his syntactic style in his creative writing. In order to perform this type of analysis, this study uses an automatic POS Tagger that is embedded in the JGAAP tool. Using JGAAP, POS n-gram analysis is conducted with n= 2, 3, 4. Figure 17 below shows a screenshot of the first two results of the POS 3-gram analysis: 34 Translational stylistic profile refers to the machine internal pattern recognition of the character n- grams in J- D s translations. 35Authorial stylistic profile refers to the machine internal pattern recognition of the character n- grams in J- D creative writing. 115

127 FIGURE 17: MACHINE LEARNING POS N-GRAM ANALYSIS The character POS n-gram analysis with n=3 revealed that J-D s translational stylistic profile 36 in the short stories he translated before the production of creative writing (TACRW) is closer to his authorial stylistic profile 37 in his creative writing (CRW). The analysis showed that the authorial stylistic profile of J-D in the fifteen short stories was closer to his translational stylistic profile in the short stories he translated after the production of creative writing (TACRW). It also revealed that none of the creative writing short stories is close to J-D s translational stylistic profile in the short stories he produced after the production of creative writing. The same experiment was conducted with different n size (n= 2, 3 and 4) and 36 Translational stylistic profile refers to the machine internal pattern recognition of the POS n- grams in J- D s translations. 37 Authorial stylistic profile refers to the machine internal pattern recognition of the POS n- grams in J- D creative writing. 116

the results had not changed. The results also showed that none of the creative writing short stories syntactic style is close to the control corpus syntactic style.

128 the results had not changed. The results also showed that none of the creative writing short stories syntactic style is close to the control corpus syntactic style. This confirms that the syntactic style of the creative writing short stories was impacted by J-D s own style in the short stories translated before the production of creative writing. That being said, the syntactic style or conventions of the target language did not have any impact on J-D s personal syntactic style in his creative writing Word n-gram Analysis The last style marker analyzed in this study using machine learning is word n-gram. As pointed out in the methodology chapter, word n-gram analysis reveals lexical patterns related to the author s own lexical choice and to the document theme or topic. This study sets the size of n= 3 and 4 in the word n-gram analysis. Figure 18 below shows a screenshot of the first two outputs of the word 3-gram analysis: FIGURE 18: MACHINE LEARNING WORD N-GRAM ANALYSIS 117

A Brief Introduction to Stylistics. By:Dr.K.T.KHADER

A Brief Introduction to Stylistics By:Dr.K.T.KHADER What Is Stylistics? Stylistics is the science which explores how readers interact with the language of (mainly literary) texts in order to explain how