Skip to main navigation menu Skip to main content Skip to site footer

Building the Australian Legislative Corpus 2023: Combatting Issues and Highlighting Applications of General Legislative Corpora

Abstract

This article introduces and details the construction of the Australian Legislative Corpus 2023 (‘ALC23’). The ALC23 includes relevant legislation from each Australian jurisdiction that was in force as at 30 June 2023, and is intended to act as a general corpus of a specialised nature. The article begins by providing a brief overview of corpus linguistic applications in the legal sphere, before summarising current legal corpora. Following this, the article details the composition of the ALC23, before moving to discuss how issues in construction were overcome. The article also notes potential applications of the ALC23, including a case study of how the word “gender” is used in the ALC23. The article concludes with some potential limitations that may be of note to both corpus linguists and legal scholars. Crucially, this article is written from the perspective of a legal scholar, which means that the creation, and use, of the ALC23 is intended to be made accessible to scholars who have a limited background in linguistic theory, method, and programming.

Cite as: Genovese, JLL 14 (2025), 30–62, DOI: 10.14762/jll.2025.030

Keywords

legal corpora, corpus linguistics, legal linguistics, law and language

PDF

References

  1. Anthony, Lawrence (n.d.). AntConc4. Retrieved September 20, 2022. Available at laurenceanthony.net/software/antconc/ (accessed 7 Jan 2025).
  2. Australian Capital Territory Government (n.d.-a). About the register — An overview. ACT Legislation Register. Retrieved June 30, 2023. Available at legislation.act.gov.au/Static/Help/About/about_the_register.html#3 (accessed 7 Jan 2025).
  3. Australian Capital Territory Government (n.d.-b). ACT Legislation Register. Available at legislation.act.gov.au/ (accessed 7 Jan 2025).
  4. Australian Government (n.d.). Terms governing the use of this website. Federal Register of Legislation. Available at legislation.gov.au/terms-of-use (accessed 7 Jan 2025).
  5. Baker, Paul (2004). Querying Keywords: Questions of Difference, Frequency, and Sense in Keywords Analysis. Journal of English Linguistics, 32(4), 346–359. DOI: 10.1177/0075424204269894.
  6. Baker, Paul (2005). Public Discourses of Gay Men. London: Routledge.
  7. Baker, Paul (2006). Using Corpora in Discourse Analysis. London: Continuum.
  8. Baker, Paul (2008). Sexed Texts: Language, Gender and Sexuality. London: Equinox Pub.
  9. Baker, Paul (2010). Sociolinguistics and Corpus Linguistics. Edinburgh; Edinburgh University Press.
  10. Baker, Paul (2014). Using Corpora to Analyze Gender. London: Bloomsbury.
  11. Baker, Paul (2023). Using Corpora to Analyze Gender. London: Bloomsbury.
  12. Baker, Paul & McEnery, Tony (2015). Introduction. In Baker & McEnery (Eds.), Corpora and Discourse Studies: Integrating Discourse and Corpora (pp. 1–19). New York: Palgrave Macmillan.
  13. Barnard, David T.; Burnard, Lou & Sperberg-McQueen, C. Michael (1996). Lessons learned from using SGML in the Text Encoding Initiative. Computer Standards & Interfaces, 18(1), 3–10. DOI: 10.1016/0920-5489(95)00035-6.
  14. Berk-Seligson, Susan (2012). Linguistic issues in courtroom interpretation. In Tiersma & Solan (Eds.), The Oxford Handbook of Language and Law (pp. 421–434). Oxford: University Press.
  15. Berūkštienė, Donata (2018). A corpus-driven analysis of structural types of lexical bundles in court judgments in English and their translation into Lithuanian. Kalbotyra, 70(70), 7–31. DOI: 10.15388/Klbt.2017.11181.
  16. Bhatia, Vijay; Langton, Nicola M. & Lung, Jane (2004). Legal discourse: Opportunities and threats for corpus linguistics. In Connor & Upton (Eds.), Studies in Corpus Linguistics (pp. 203–231). Amsterdam: John Benjamins Publishing Company. DOI: 10.1075/scl.16.09bha.
  17. Biber, Douglas (2008). Representativeness in corpus design. In Fontenelle (Ed.), Practical Lexicography: A Reader (pp. 63–87).
  18. Breeze, Ruth (2017). Corpora and computation in teaching law and language. International Journal of Language & Law, 6, 1–17. DOI: 10.14762/JLL.2017.001.
  19. Breeze, Ruth (2019). Part-of-speech patterns in legal genres: Text-internal dynamics from a corpus-based perspective. In Fanego & Rodríguez-Puente (Eds.), Studies in Corpus Linguistics (pp. 79–103). Amsterdam: John Benjamins Publishing Company. DOI: 10.1075/scl.91.04bre.
  20. Brezina, Vaclav; McEnery, Tony & Wattam, Stephen (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 139–173. DOI: 10.1075/ijcl.20.2.01bre.
  21. Butler, Umar (n.d.-a). Open Australian Legal Corpus. Datasets. Available at huggingface.co/datasets/umarbutler/open-australian-legal-corpus (accessed 7 Jan 2025).
  22. Butler, Umar (n.d.-b). Open Australian Legal Corpus Creator. Open Australian Legal Corpus Creator. Available at umarbutler/open-australian-legal-corpus-creator (accessed 7 Jan 2025).
  23. Butler, Umar (2023, October 28). How I built the largest open database of Australian law. Umar Butler. Available at umarbutler.com/how-i-built-the-largest-open-database-of-australian-law/ (accessed 7 Jan 2025).
  24. Cock, Barbara De; Hambye, Philippe & Pedraza, Andrea Pizarro (2024). Annotation and mark up for representation analysis. In Heritage & Taylor, Analysing Representation (pp. 84–99). London: Routledge. DOI: 10.4324/9781003350972-6.
  25. Conley, John M. & O’Barr, William, M. (2005). Just Words: Law, Language, and Power (2nd ed). Chicago: University of Chicago Press.
  26. Danet, Brenda (1980). Language in the legal process. Law and Society Review, 14(3), 445–564. Dekkers, Makx & Weibel, Stuart (2003). State of the Dublin core metadata initiative. D-Lib Magazine, 9(4). DOI: 10.1045/april2003-weibel.
  27. Egbert, Jesse & Römer-Barron, Ute (2024). Applying corpus linguistics to the law. Applied Corpus Linguistics, 4(2), 100093. DOI: 10.1016/j.acorp.2024.100093.
  28. Egbert, Jesse & Wood, Margaret (2023). The corpus of United States state statutes — Design, construction and use. Applied Corpus Linguistics, 3(2), 100047. DOI: 10.1016/j.acorp.2023.100047.
  29. Freeman, Michael D. A. & Smith, Fiona (2013). Law and language: An introduction. In Freeman & Smith (Eds.), Law and Language (pp. 1–7). Oxford: University Press.
  30. Galdia, Marcus (2023). Researching the language of law. In Wagner & Matulewska (Eds.), Research Handbook on Jurilinguistics (pp. 17–34). Edward Elgar Publishing. DOI: 10.4337/9781802207248.00009.
  31. Genovese, Emma. (2023). The spectacle of respectable equality: Queer discrimination in Australian law post marriage equality. University of New South Wales Law Journal, 46(2). DOI: 10.53637/NAFG1780.
  32. Gillings, Mathew (2022). How to use corpus linguistics in forensic linguistics? In O’Keeffe & McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 589–601). London: Routledge. DOI: 10.4324/9780367076399.
  33. Gillings, Mathew; Mautner, Gerlinde & Baker, Paul (2023). Corpus-Assisted Discourse Studies (1st ed.). Cambridge University Press. DOI: 10.1017/9781009168144.
  34. Goldfarb, Neal (2021). The use of corpus linguistics in legal interpretation. Annual Review of Linguistics, 7(1), 473–491. DOI: 10.1146/annurev-linguistics-050520-093942.
  35. Goodrich, Peter (1987). Legal Discourse: Studies in Linguistics, Rhetoric and Legal Analysis. Basingstoke: Macmillan. Government of South Australia (n.d.-a). A-Z Acts. South Australian Legislation. Available at legislation.sa.gov.au/about-this-site/legislation-available-on-this-website/about-this-website (accessed 7 Jan 2025).
  36. Government of South Australia (n.d.-b). A-Z Regulations and Rules. South Australian Legislation. Available at legislation.sa.gov.au/about-this-site/legislation-available-on-this-website/regulations-and-rules (accessed 7 Jan 2025).
  37. Government of South Australia (n.d.-c). Copyright. South Australian Legislation. Available at legislation.sa.gov.au/copyright (accessed 7 Jan 2025).
  38. Government of Western Australia (n.d.). Copyright and licence. Western Australian Legislation. Available at legislation.wa.gov.au/legislation/statutes.nsf/copyright.html (accessed 7 Jan 2025).
  39. Government of Western Australia, Department of Justice, & Parliamentary Counsel’s Office (n.d.). Acts in Force. Available at legislation.wa.gov.au/legislation/statutes.nsf/actsif_info.html (accessed 7 Jan 2025).
  40. Goźdź-Roszkowski, Stanislaw (2011). Patterns of Linguistic Variation in American Legal English: A Corpus-Based Study (1st, New ed ed.). Frankfurt: Peter Lang GmbH, Internationaler Verlag der Wissenschaften.
  41. Goźdź-Roszkowski, Stanislaw (2021). Corpus linguistics in legal discourse. International Journal for the Semiotics of Law - Revue Internationale de Sémiotique Juridique, 34(5), 1515–1540. DOI: 10.1007/s11196-021-09860-8.
  42. Goźdź-Roszkowski, Stanislaw (2023). Corpus linguistics, methodology of jurilinguistics. In Wagner & Matulewska (Eds.), Research Handbook on Jurilinguistics (pp. 103–115). Edward Elgar Publishing. DOI: 10.4337/9781802207248.00014.
  43. Grey, Alexandra, & Severin, Alyssa A. (2021). An audit of NSW legislation and policy on the government’s public communications in languages other than English. Griffith Law Review, 30(1), 122–147. DOI: 10.1080/10383441.2021.1970873.
  44. Gries, Stefan T. (2021). Corpus linguistics and the law: Extending the field from a statistical perspective. Brooklyn Law Review, 86, 321–356.
  45. Grover, Claire; Hachey, Ben & Hughson, Ian (2004). The HOLJ Corpus: Supporting Summarisation of Legal Texts.
  46. Hafner, Christoph A. & Candlin, Christopher N. (2007). Corpus tools as an affordance to learning in professional legal education. Journal of English for Academic Purposes, 6(4), 303–318. DOI: 10.1016/j.jeap.2007.09.005.
  47. Hafner, Christoph A. & Wang, Simon Ho (2018). Hong Kong learner corpus of legal academic writing in English: A study of boosters as a marked language form in an English‐Medium instruction context. TESOL Quarterly, 52(3), 680–691. DOI: 10.1002/tesq.451.
  48. Haigh, Richard (2018). Thirty years with section 15 of the charter: A report on legislative terminology in Canada. National Journal of Constitutional Law, 38(1), 7–34.
  49. Hart, Herbert Lionel Adolphus (1994). The Concept of Law (2nd ed). Oxford, United Kingdom: Clarendon Press/Oxford University Press.
  50. Hildebrandt, Mireille (2018). Law as computation in the era of artificial legal intelligence: Speaking law to the power of statistics. University of Toronto Law Journal, 68(1), 12–35.DOI: 10.3138/utlj.2017-0044.
  51. Höfler, Stefan & Piotrowski, Michael (2011). Building corpora for the philological study of swiss legal texts. Journal for Language Technology and Computational Linguistics, 26(2), 77–89. DOI: 10.21248/jlcl.26.2011.148.
  52. Höfler, Stefan & Sugisaki, Kyoko (2014). Constructing and Exploiting an Automatically Annotated Resource of Legislative Texts. DOI: 10.5167/UZH-96172.
  53. Hu, Ming; Hu, Xitao & Cheng, Le (2021). Exploring digital economy: A sociosemiotic perspective. International Journal of Legal Discourse, 6(2), 181–202. DOI: 10.1515/ijld-2021-2053.
  54. Kilgarriff, Adam; Rychly, Pavel; Smrz, Pavel & Tugwell, David (2014). The sketch engine. Lexicography, 1, 7–36.
  55. Laske, Caroline (2022). Corpus linguistics: The digital tool kit for analysing language and the law. Comparative Legal History, 10(1), 3–32. DOI: 10.1080/2049677X.2022.2063510.
  56. Leung, Janny H. & Durant, Alan (2018). Editors’ Introduction. In Leung & Durant (Eds.), Meaning and Power in the Language of Law (pp. 1–16). Cambridge University Press. DOI: 10.1017/9781316285756.001.
  57. Lexical Analysis Software & Oxford University Press (n.d.). WordSmith Tools. Available at lexically.net/wordsmith/ (accessed 7 Jan 2025).
  58. Lexical Computing (n.d.-a). POS tags. Available at sketchengine.eu/blog/pos-tags/ (accessed 7 Jan 2025).
  59. Lexical Computing (n.d.-b). Sketch Engine Boot Camp. Sketch Engine. Available at sketchengine.eu/bootcamp/ (accessed 7 Jan 2025).
  60. Lexical Computing (n.d.-c). Sketch Engine. Available at sketchengine.eu (accessed 7 Jan 2025).
  61. Lexical Computing (n.d.-d). Word Sketch—Collocations and Word Combinations. Retrieved August 6, 2024. Available at sketchengine.eu/guide/word-sketch-collocations-and-word-combinations/#toggle-id-7 (accessed 7 Jan 2025).
  62. Lexical Computing (2015). Statistics Used in the Sketch Engine. Available at sketchengine.eu/wp-content/uploads/ske-
  63. statistics.pdf (accessed 7 Jan 2025).
  64. Lexical Computing (2021). Why are Some Words in the Corpus Tagged Incorrectly? Available at support.sketchen-
  65. gine.eu/help/en-us/5-tags-lemmas-taggers-vertical-file/60-why-are-some-words-in-the-corpus-tagged-incorrectly
  66. (accessed 7 Jan 2025).
  67. Lukin, Annabelle & Araujo E Castro, Rodrigo (2022). The macquarie laws of war corpus (MQLWC): Design, construction and use. International Journal for the Semiotics of Law - Revue Internationale de Sémiotique Juridique, 35(5), 2167–2186. DOI: 10.1007/s11196-022-09889-3.
  68. Lukin, Annabelle & García Marrugo, Alexandra García (2024). The ‘existential fabric’ of war: Explaining the phrase of war in the laws of war. Applied Linguistics, amae027. DOI: 10.1093/applin/amae027.
  69. Lukin, Annabelle & Marrugo, Alexandra García (2023). The international laws of war: Linguistic analysis from the perspectives of register, corpus and grammatical patterning. Journal of International Humanitarian Legal Studies, 14(2), 223–249. DOI: 10.1163/18781527-bja10065.
  70. Mautner, Gerlinde (2022). What can a corpus tell us about discourse? In O’Keeffe & McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 250–262). London: Routledge. DOI: 10.4324/9780367076399.
  71. McEnery, Tony & Brookes, Gavin (2022). Building a written corpus: What are the basics? In O’Keeffe &
  72. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 35–47). London: Routledge. DOI: 10.4324/9780367076399.
  73. McEnery, Tony; Xiao, Richard & Tono, Yukio (2010). Corpus-Based Language Studies: An Advanced Resource Book (Reprinted). London: Routledge.
  74. Mellinkoff, David (1963). The Language of the Law. Eugene, Oregon: Resource Publications.
  75. Moran, Leslie J. (1996). The Homosexual(ity) of Law. London: Routledge.
  76. Mouritsen, Stephen (2017). Corpus linguistics in legal interpretation. An evolving interpretative framework. International Journal of Language & Law (JLL), 6, 67–89 . DOI: 10.14762/JLL.2017.067.
  77. New South Wales Government (n.d.). In Force Legislation. Available at legislation.nsw.gov.au/browse/inforce (accessed 7 Jan 2025).
  78. New South Wales Government. (2021, September 10). Copyright. NSW Legislation. Available at legislation.nsw.gov.au/copyright (accessed 7 Jan 2025).
  79. Northern Territory Government (n.d.-a). Help. Northern Territory Legislation. Retrieved June 30, 2023. Available at legislation.nt.gov.au/Footer/Help (accessed 7 Jan 2025).
  80. Northern Territory Government (n.d.-b). Terms of Use. Northern Territory Legislation. Available at legislation.nt.gov.au/Footer/Terms-of-Use (accessed 7 Jan 2025).
  81. Okawara, Mami Hiraike (2012). Courtroom Discourse in Japan’s New Judicial Order. In Tiersma & Solan (Eds.), The Oxford Handbook of Language and Law (pp. 381–394). Oxford: University Press.
  82. Onesti, Cristina (2011). Methodology for building a text-structure oriented legal corpus. Comparative Legilinguistics, 8, 37–50.
  83. Östling, Andreas; Sargeant, Holli; Xie, Huiyuan; Bull, Ludwig; Terenin, Alexander; Jonsson, Leif; Magnusson, Måns & Steffek, Felix (2024). The Cambridge law corpus: A dataset for legal AI research. SSRN Electronic Journal. DOI: 10.2139/ssrn.4763429.
  84. Pei, Jiamin & Li, Jian (2018). A corpus-based investigation of modal verbs in Chinese civil-commercial legislation and its English versions. International Journal of Legal Discourse, 3(1), 77–102. DOI: 10.1515/ijld-2018-2003.
  85. Pérez-Paredes, Pascual; Jiménez, Pilar Aguado & Hernández, Purificación Sánchez (2017). Constructing immigrants in UK legislation and administration informative texts: A corpus-driven study (2007–2011). Discourse & Society, 28(1), 81–103. DOI: 10.1177/0957926516676700.
  86. Phillips, James C. & Egbert, Jesse (2017). Advancing law and corpus linguistics: Importing principles and practices from survey and content analysis methodologies to improve corpus design and analysis. Brigham Young University Law Review, 1589–1619.
  87. Pontrandolfo, Gianluca (2012). Legal corpora: An overview. Rivista Internazionale Di Technica Della Traduzione, 14, 121–136.
  88. Pontrandolfo, Gianluca (2019). Corpus methods in legal translation studies. In Biel, Engberg, Martín Ruano & Sosoni (Eds.), Research Methods in Legal Translation and Interpreting: Crossing Methodological Boundaries (pp. 13–28). Routledge. DOI: 10.4324/9781351031226.
  89. Queensland Government (2020). Copyright. Queensland Legislation. Available at legislation.qld.gov.au/copyright (accessed 7 Jan 2025).
  90. Rea Rizzo, Camino & Marín Pérez, M. José (2012). Structure and design of the British Law Report Corpus (BLRC): A legal corpus of judicial decisions from the UK. Journal of English Studies, 10, 131. DOI: 10.18172/jes.184.
  91. Römer-Barron, Ute & Cunningham, Clark D. (2024). Applied corpus linguistics and legal interpretation: A rapidly developing field of interdisciplinary scholarship. Applied Corpus Linguistics, 4(1). DOI: 10.1016/j.acorp.2023.100080.
  92. Rossini-Favretti, Rema (1998). Using Multilingual Parallel Corpora for the Analysis of Legal Language: The Bononia Legal Corpus (W. Teubert, E. Tognini Bonelli, & N. Volz, Eds.; pp. 57–68). TELRI Association.
  93. Rychlý, Pavel (2008). A lexicographer-friendly association score. In Sojka & Horák (Eds.), Proceedings of Recent Advances in Slavonic Natural Language Processing (pp. 6–9). Czechia.
  94. Salembier, J. Paul (2018). Legal and Legislative Drafting (Second edition). LexisNexis.
  95. Sinclair, John McHardy (1991). Corpus, Concordance, Collocation. Oxford: University Press.
  96. Sketch Engine (n.d.). Word Sketch. Available at sketchengine.eu/guide/word-sketch-collocations-and-word-combinations/ (accessed 7 Jan 2025).
  97. Solan, Lawrence M. & Gales, Tammy (2017). Corpus linguistics as a tool in legal interpretation. Brigham Young University Law Review, 1311–1357.
  98. Sole-Mauri, Francina; Sánchez-Gijón, Pilar & Oliver, Antoni (2021). Cadlaws – An English–French parallel corpus of legally equivalent documents. Mutatis Mutandis. Revista Latinoamericana de Traducción, 14(2), 494–508. DOI: 10.17533/udea.mut.v14n2a10.
  99. South Australian Law Reform Institute (2015). Discrimination on the Grounds of Sexual Orientation, Gender, Gender Identity and Intersex Status in South Australian Legislation [Audit Report]. Available at law.adelaide.edu.au/sys-
  100. tem/files/media/documents/2019-01/audit_report_lgbtiq_sept_2015.pdf (accessed 7 Jan 2025).
  101. Stubbs, Michael (1996). Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture. Blackwell
  102. Publishers.
  103. Stückler, Andreas (2018). Legislation and discourse: Research on the making of law by means of discourse analysis. In Keller, Hornidge & Schünemann, The Sociology of Knowledge Approach to Discourse (pp. 112–132). London: Routledge. DOI: 10.4324/9781315170008-6.
  104. Stygall, Gail (2012). Discourse in the US Courtroom. In Tiersma & Solan (Eds.), The Oxford Handbook of Language and Law (pp. 369–380). Oxford: University Press.
  105. Tasmanian Government (2021, March 9). Legislation. Tasmanian Legislation. Available at legislation.tas.gov.au/about/legislation (accessed 7 Jan 2025).
  106. Tasmanian Government (2023, February 12). Copyright notice. Tasmanian Legislation. Available at legislation.tas.gov.au/copyrightanddisclaimer (accessed 7 Jan 2025).
  107. Tiersma, Peter Meijes (1999). Legal Language. Chicago: University of Chicago Press.
  108. Tognini-Bonelli, Elena (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins Publishing Company.
  109. Tufiș, Dan; Mitrofan, Maria; Păiș, Vasile; Ion, Radu & Coman, Andrei (2020). Collection and Annotation of the Romanian Legal Corpus.
  110. Victorian Government. (2020, February 24). Copyright. Victorian Legislation. Available at
  111. legislation.vic.gov.au/copyright (accessed 7 Jan 2025).
  112. Vogel, Friedemann (2017). Calculating legal meanings? Drawbacks and opportunities of corpus-assisted legal linguistics to make the law (more) explicit. In Giltrow & Stein (Eds.), The Pragmatic Turn in Law (pp. 287–306). De Gruyter. DOI: 10.1515/9781501504723-012.
  113. Vogel, Friedemann; Hamann, Hanjo & Gauer, Isabelle (2018). Computer-assisted legal linguistics: Corpus analysis as a new tool for legal studies. Law & Social Inquiry, 43(4), 1340–133. DOI: 10.1111/lsi.12305.
  114. Wagner, Anne & Matulewska, Aleksandra (Eds.). (2023). Research Handbook on Jurilinguistics. Edward Elgar Publishing.
  115. Williams, Christopher (2005). Tradition and Change in Legal English: Verbal Constructions in Prescriptive Texts. Bern: P. Lang.
  116. Woodbury, Hanni (1984). The strategic use of questions in court. Semiotica, 48(3–4). DOI: 10.1515/semi.1984.48.3-4.197.
  117. Woolls, David & Coulthard, Malcolm (1998). Tools for the trade. Forensic Linguistics, 5(1), 33–57. DOI: 10.1558/sll.1998.5.1.33.