Skip to main content

Research Hive

Insights into tools and methods to implement your research, written by CCSS staff and Data Science Fellows!

  • Photo of honeycomb pattern with words "Research Hive"

How to Tame Text Data: An Introduction to Regular Expressions

January 17, 2023

By: Remy Stewart, CCSS Data Science Fellow 

While text data has been growing in popularity within social science research, it is also a particularly challenging data type to clean, process, and extract information from. Regular expressions- commonly referred to as regexes- are a special syntax grounded in the strategy of pattern matching that enhances programmers’ ability to efficiently work with text data. This guide provides a brief overview of the power of regular expressions through examples in both Python and R.

From SPSS to R: A Personal Voyage

A personal journey from a clinical setting and SPSS to a research laboratory hands-on R.

June 6, 2022

By: Samantha De Leon Sautu, CCSS Data Science Fellow

Eight months ago, I embarked on several transitions: I moved from Panama to the United States, stopped practicing clinical medicine to pursue a Ph.D., and abandoned survey-based research to start looking at fMRI data. So far, the English language has been welcoming. The scientific principles in hospitals and labs remain the same; however, my ship for navigating data was swept from under my feet.

voyage
  • From SPSS to R: A Personal Voyage

    A personal journey from a clinical setting and SPSS to a research laboratory hands-on R.

    Exploration is in our nature. We began as wanderers, and we are wanderers still. (…) We are ready at last to set sail for the stars.” – Carl Sagan

    Eight months ago, I embarked on several transitions: I moved from Panama to the United States, stopped practicing clinical medicine to pursue a Ph.D., and abandoned survey-based research to start looking at fMRI data. So far, the English language has been welcoming. The scientific principles in hospitals and labs remain the same; however, my ship for navigating data was swept from under my feet.

    Headshot of Samantha

    Back home, for community research (mostly Knowledge, Attitudes, and Practices studies), I relied on SPSS (Statistical Package for Social Sciences), an IBM-owned software. In my new lab, we scrutinize our data with R, an open-source software environment. From beginner to beginner, I will share with you the first highlights of this transition in case you are interested in taking the same jump. Of course, for some researchers and students, SPSS will meet their needs. 

    At the station: Where is my data?

    As Excel and SPSS users, we expect some sort of spreadsheet to pop up first thing upon opening an XLS, CSV, SAS, or Stata file. There is no such thing with R! It is pretty simple to display your data in this style in R, too, whether you want to see headings and a few data points or your whole matrixes.

    SPPS has a user-friendly interactive interface that makes you feel like you are using one more Office program. R instead has no graphical user interface but an environment for you to input command lines.

    First stop: Where are my buttons?

    When coming from the land of clicks, it may be disorienting to find no drop-down menus and no buttons in R. But fear not, we have command lines instead. R is an open-source programming language with complex syntax. It is not beginner-friendly, but it will let you do better modeling and data manipulation than SPSS. Because R is used by a much larger audience and is open source, documentation and packages for endless functions are available for free online. Take my research, for example, I want to look at Heart Rate Variability (HRV), but I don’t want to code myself to eliminate outliers and unacceptable physiological values. Free access online libraries contain the RHVR package, designed for that and a lot more. I download it, install it and voila, build on the shoulder of giants.

    Second stop: What do you mean, load the data?

    Data management in R may make you miss SPSS. SPSS can handle all the data you have input at the same time. R requires you to load the data into the memory before command execution. Another thing you may miss from SPSS is decision trees. Such algorithms in R will require specific packages, like CART (Classification and Regression Tree), and some training.

    The ride: Wow, these graphics!

    Did you ever try to customize your graphs on SPSS? If you did, I feel for you. Popular R packages for data visualization are ggplot2 and R shiny. These will let you build personally tailored visualizations and even make them interactive.

    The ticket: RStudio is your business class ride

    R is free and open-source. To use SPSS, you must pay for a license. And the good news is that there is a way to enjoy a graphical user interface for R: RStudio. RStudio is an environment that will let you see your console, edit your syntax, view your history and manage your packages and datasets. Moreover, it will allow you to code both with R and Python language.

    Destination: To self-correcting science

    Both these tools for statistical data analysis have their ideal scenarios. R may bring more versatility but require a longer learning curve, while SPSS may offer a user-friendly interface more adapted for beginners. Whichever ship you choose to navigate your data, we hope you discover all its secrets and share them with us.

OpenRefine: A Lightweight Data Cleaning Tool to Add to Your Workflow

March 14, 2022

By: Aspen Russell, CCSS Data Science Fellow

When working with big data, the issue of data management and cleaning is always at the forefront. OpenRefine is a free and open source software (FOSS) to add to your toolkit which can help with data exploration, organization, and cleaning. Importantly, it’s lightweight and can export right into your Python, R, or into other data cleaning tools to help your workflow. You can even save and export your cleaning steps, which can be used for reproducibility, optimizing your process, or sharing with other researchers. 

Preregistered badge
  • OpenRefine: A Lightweight Data Cleaning Tool to Add to Your Workflow

    When your research includes data, especially a lot of it, you will almost always encounter issues with data cleanliness and structure. Whether these issues are with incompleteness or inconsistencies, you probably have been in search of tools and techniques to help you fix these issues to aid in the quality of your analysis.

    Headshot of Aspen Russell

    When your research includes data, especially a lot of it, you will almost always encounter issues with data cleanliness and structure. Whether these issues are with incompleteness or inconsistencies, you probably have been in search of tools and techniques to help you fix these issues to aid in the quality of your analysis.

    OpenRefine is one of these tools.

    I research online communities, which means I work with APIs quite a bit. Reddit, Facebook, YouTube, Twitter, and even a few odd ones like arXiv. Often, messy data. A lot of messy data. Most of my data cleaning and analysis happens in R or Python, but I’ve recently been adding OpenRefine into the data cleaning workflow.

    In my opinion OpenRefine is a powerful tool for three reasons. First, it allows you to clearly document and export your cleaning steps. This makes reproducibility much easier and will save your sanity at the end of a project. Second, anything you do in OpenRefine can be exported into another space (like R or Python) for analysis, no headaches with proprietary file types. Even if you are a great programmer, there are situations where a tool just makes things *easier,* so having interoperability is great. Lastly, it is fast and it is free. I understand being cautious about changing your data workflow, so an addition has to add to efficiency and ease.  OpenRefine doesn’t eat up a lot of time booting up and running a short list of standard checks.

    What is OpenRefine?

    OpenRefine, formerly Google Refine, is a free and open-source tool that is a part of Code for Science & Society. Importantly, OpenRefine isn’t for *data creation,* it is for data exploration and cleaning. If your data is local to your computer, then that’s where it stays for cleaning. OpenRefine runs as a local server on your computer.

    But what does it *do*? The best definition may be the one on their website: “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.”

    Cleaning, Transforming, and Extending Your Data

    OpenRefine is powerful in that learning a few basic skills can add up quickly to tackling much more complex problems. Here are some common use cases below:

    Do you need to isolate entries based on specific criteria for removal or revision? Create a text filter.

    Do you need to double-check your variables are all named uniformly? Create a text facet.

    Do you need to remove pesky remnants from scraped data such as links or HTML while still keeping the original data intact? Transform and select your removal criteria. One command, all of the text you want gone is now gone.

    Do you need to split your data based on some criteria? Add a column and select your specific criteria to split the data (could be text or symbols for example).

    Need to export your data for a particular setting? Use the templating Export feature. It defaults to JSON, but can be exported to other settings like Wikipedia! See the OpenRefine extensions for more.

    Worried that you spent a long-time cleaning and have to do a similar task? Extract the operations from your last project, apply the operations to the new dataset, and you have finished cleaning!

    And of course, you can always combine any and all of the above! In general, OpenRefine works best with a solid understanding of filters and facets. These allow you to isolate the data you want to explore, change, remove, or add to. I hope that the list above revealed at least one case where another tool you use is frustrating and that OpenRefine could help out.

    For those who are now onboard, you may be thinking “This all sounds well and good, but what happens when you are sitting with a messy Excel sheet in front of you? Where do you even begin?”
    Well, it just so happens we have a tutorial for you below!

    Installation and Setup

    You don’t need the internet to use OpenRefine, but you do need it to download. There are three requirements to run OpenRefine on your computer: you are running Mac/Windows/Linux, you have Java installed, and are using one of the major browsers (Chrome, Opera, Edge, etc.). Firefox does work, but can be wonky at times. There even is a Windows version with Java built-in if you don’t know the status of Java on your computer.

    Go to OpenRefine.org and click download. I personally would pick a stable release that matches my operating system and go ahead and download, unzip, and open the program.
    If you are running into trouble, full instructions here.

    https://docs.openrefine.org/manual/installing

    Extensions

    Just like your browser, OpenRefine has a list of extensions that add specific functionality to this tool. Extensions to work with data with geolocation, to make your data Findable, Accessible, Interoperable, and Reusable (FAIR), and more.

    There are even packages to use some of OpenRefine’s functionality in Python, R, Java, and a few other languages.

    You can find extensions on the OpenRefine website under download.

    Running OpenRefine

    Once you have downloaded and installed OpenRefine, you will have an executable file (that blue gem icon) that you can open to start OpenRefine on your computer.

    Two things will happen when you open the program. First, a command-line will open and start filling up with text. Don’t worry, this is normal. When they talk about a server running on your computer, this is that server booting up. All this means is that OpenRefine needs you to create a space, somewhere on your computer, to run the program. In this case, they want to run it in your browser. That is the second window that will open. You can safely minimize the command line window and turn your attention to your browser.
    If you don’t get to spend much time in command lines, take a peek if you’re curious. You aren’t quite Neo or Trinity, but you are closer!

    An Example

    At this point you should have an instance of OpenRefine running on your computer. A window should appear in your web browser that is prompting you to create a project and import some data. If that didn’t happen, try copy+pasting one of these into your browser: http://127.0.0.1:3333/ or http://localhost:3333/. These addresses are where the program tells the server to setup shop in a browser.

    Data

    Feel free to open a data file you have on hand. You should be able to make the tutorial commands work with most datasets. Remember that OpenRefine doesn’t mess with your original file, so it is a great space to play around.

    If you don’t have something on hand, here is the training data we use in workshops https://figshare.com/ndownloader/files/30848404.

    This dataset is from the Studying African Farm-Led Irrigation (SAFI) database. Entries have been programmatically messed with to represent some of the more common data issues you will run into.

    Your data doesn’t *have* to be local, you can also use web addresses, clipboard, database connection, or Google Data.
    Browse for your file of choice and click next.

    Review

    This step can be a bit overwhelming, but really is just: did we import your data correctly? Are the variables correct? Did we split up the columns right? Are all of the entries there?

    Below the preview window of your spreadsheet, there are a series of options that you can tweak if your preview doesn’t look right. Make sure the file type is correct and that the other minor settings line up. Most of the time this will work with no tweaking necessary, but it is always good to know what it is asking of you.

    Give the project a name and a tag if you want to sort things later and hit “Create Project.”

    Faceting

    Now we will go through a suite of common techniques that are useful in OpenRefine, first off is one of the most important: Facets.
    Basically, Facets group data by like values in a column. Then, you can manipulate the data from there. It will make more sense when you create one, so let’s do that.

    1. Select a column using the drop-down arrow, in this case I will use the column “village.”
    2. Select Facet → Text Facet as the entries are text.
    3. You should see a window populate with some familiar entries on the left-hand side. These are the unique entries in the column and the number of each.
    4. Sorting by name, an immediate problem emerges: entry inconsistency. A very common problem working with either user-collected or scraped data.
    5. Here you can hover over entries and edit them. This will correct the naming issue across all misnamed entries. You can also click on a set of values to inspect the other information for their entries. Click again on the entry in the Facet tab and you will return to the full dataset.

    Notice that two things happened when you corrected the spelling issue. First, the data entries consolidated into one cohesive label (yay!). But the second thing that happened was that an entry was added under the “Unde/Redo” tab.

    Click over and inspect how OpenRefine logs your changes. If you need to review, export, or change what steps you have taken, that’s the place to do so!

    Facets are quite powerful in tandem with the logging of changes. If you have a column with a different data type, like numeric or temporal try a different type of Facet!

    Clustering

    While faceting is nice for viewing categories and filtering data, clustering tries to zero in on the issue of misspelling and added white space to entries, which may be misrepresenting your outputs. With a programming approach it can take a few steps to identify the likely suspects for merging and then doing the merge itself. In OpenRefine, it’s a little simpler. Let’s try some clustering:

    1. Within the open Text Facet you have, in our case the “village” column, click the Cluster button.
    2. Next, pick out a Method and Keying Function (if you are really into learning more, check out this articlehttps://docs.openrefine.org/manual/cellediting#cluster-and-edit on what all of these different methods do and which data they work best on!). For the purposes of our data, we will use the “key collision” Method and the “metaphone3” Keying Function. If you don’t know what to use, it is pretty easy to just start picking different entries and seeing if your data starts popping up.
    3. Check the “Merge?” box and select “Merge Selected and Recluster” to finish off the merge. Done!

    Getting familiar with your data and having an internal sense of what could be wrong with it helps point you to the right feature of OpenRefine.

    Transforming Data

    Here is another one of those big three (clean, transform, and extend) features of OpenRefine. We are going to take a column of multiple entries and make it more analysis friendly. In our dataset, we are going to work with the “items_owned” column (go ahead and ctrl/cmd+f if you are having trouble locating it).

    1. Click the drop-down arrow for “items_owned” and select Edit Cells → Transform...
    2. A window will pop up asking you for an “expression.” The expression they are asking for is written in General Refine Expression Language or GREL. Which isn’t as technical as it sounds. We just have to write something that the computer will understand and execute. With a little practice, it will become a lot easier! In this case, we want to remove the brackets (”[” and “]”) from the entries.
    3. Enter value.replace(“[”, “”) into the expression box. Let’s translate this directly. value.replace should feel self-explanatory, this part says what we want to do. We could also add a value, trim entries to a certain max length, or select “end edit specific entries” (more here(https://docs.openrefine.org/manual/grel)). Okay, back to the expression. Within the parenthesis, the first set of quotes says what we are looking for and the second set says what to do with it when we find it. Here, we want to replace left brackets with nothing. Not a space, but nothing, essentially deleting it.
    4. Try replacing that other bracket on your own! OpenRefine has a preview window, so double-check the output looks as you expect and hit Ok.

    At this point, you should have the same list of entries, but this time without pesky brackets. Now what? Well, what if we wanted to see which items are owned by a lot of people? If your answer was a Text Facet, then well done!

    1. Try creating a Text Facet like we did in the first example. Not quite right is it? No, there is an issue with all of the entries bundled up in each cell. That won’t work, we need to split them up.
    2. This time, clicking again on the drop-down for “items_owned,” select Facet → Custom text facet...
    3. Interesting, looks more like a transform rather than a Facet, right? Here we are once more asked for an expression and this time we want to split. Try this: value.split(”;”) and click Ok. Just like the command above, we are just asking the program “every time you see a semicolon, separate the entries.” Voila, a list of *individual* items owned.
    4. Create a standard Text Facet and sort by “count” to answer our original question.

    GREL is very powerful, and this simple expression doesn’t scratch the surface of data massaging you can do. You can also combine multiple commands into one expression, speeding up the process!

    Filtering

    Here is a simple one that is used quite often: Text filter.

    1. Click the drop-down on a column, we will use “respondent_roof_type” and select “Text filter”.
    2. You should have another box on the left-hand side pop up, but this time with a search box. You can use regular expressions if you have that skill, otherwise just enter something you want to filter by, such as “mabat.”
    3. Make sure to check how many rows you are displaying on the top of the spreadsheet, but otherwise you should see every entry in that column with the search term you entered!

    This can be a great and simple tool for performing early data exploration.

    Saving/Exporting Project

    OpenRefine saves as you go, so no need to obsessively check your save state as you go. In the top right of the OpenRefine window there is the option to export. You can export your project to a myriad of file types and places. You can also just export the steps you took in OpenRefine, which is great for sharing with colleagues or for open science.

    The Final Word

    I hope this short explanation and walk-through of OpenRefine has been helpful. If you end up using OpenRefine a lot and want to take your skills to the next level, consider pulling its functionality into a robust programming language like Python or writing scripts to automate some of the more reliable checks.

    If there was one thing I want readers to take away from this overview is: don’t use Excel for data work!

    Okay, there are a lot of other things, but there isn’t an excuse anymore to not use tools that are meant for data exploration, cleaning, and transforming. OpenRefine isn’t the end all be all for data cleaning, but it does many things exceptionally fast and in a transparent way and is free!

    I find myself running my data through OpenRefine to get an overview of what problems it may have. It could be qualitative data that has inconsistent labeling. Or social media data that has a lot of embedded links or HTML that slipped through. I solve what I can in OpenRefine before throwing it into a dataframe in R or Python for further cleaning and/analysis. Another tool in the toolbox.

    Happy cleaning!

    Resource Dump + Acknowledgements

    This blog post is built off of the work done by Lynda Kellam who is the Head of Research Data Services at the University of Pennsylvania. The applied aspects, like the dataset, were adapted from The Carpentries’ “OpenRefine for Social Science Data” workshop. Licensed under CC-BY 4.0.
    Here are some resources for those interested in learning more:

Expectation vs. Reality: Cultivating Transparency in Social Science Research

March 3, 2022

By: Sabrina Porcelli, CCSS Data Science Fellow

As social scientists, we all know about the reproducibility crisis.1 We are told over and over again how numerous findings fail to replicate when existing studies are recreated by other researchers.1 This issue has driven the open science movement,2 which strives to make research more open and transparent with the goal of creating more reproducible science.

Preregistered badge

  • Expectation vs. Reality: Cultivating Transparency in Social Science Research

    As social scientists, we all know about the reproducibility crisis.1 We are told over and over again how numerous findings fail to replicate when existing studies are recreated by other researchers.1 This issue has driven the open science movement,2 which strives to make research more open and transparent with the goal of creating more reproducible science. One prominent tool that has emerged from this movement is Center for Open Science’s (COS)3 Open Science Framework (OSF).4 OSF4 is an open-source tool for developing and storing research project workflow, including data, code, and much more.

    Headshot of Sabrina Porcelli

    OSF offers three types of badges for research that is open: (1) open data, (2) open materials, and (3) preregistration.5,6Open data badges are given when researchers share their data in an online repository, with one option being through the OSF platform. When sharing data openly, researchers have to be extremely cognizant of the privacy and confidentiality of their participants. Open materials badges are awarded when researchers provide materials such as measures, stimuli, or code. When sharing code openly, researchers need to write their code in a way that can be easily reproduced by other researchers, including adding comments to enhance clarity and organizing analyses in a clear and logical way. Preregistration badges are earned by providing a written record of the researcher’s  plan for their research study before they actually conduct it. This can take many forms, including "AsPredicted," "OSF Prereg," and "Secondary Data Preregistration."7 While researchers should and most often do go into their projects with a plan, writing a preregistration involves taking these ideas a step further by fully outlining the key variables, measures, and plans for analyses and publishing these in an open forum. 

    These tasks are necessary to create greater credibility and reproducibility within social science research. Without such transparency in research practices, it is difficult to really know what went on behind the scenes compared to what is included in a final publication. However, all of these tasks can be incredibly time consuming. To incentivize researchers to engage in these practices, many social science journals have begun awarding OSF badges to publications, which are then displayed next to the article for readers to see.6 Even so, researchers may not see open science as worth the time investment or may have concerns about what to do if they need to deviate from their originally preregistered protocol.

    The way the open science system is built makes it difficult to enter and intimidating to deviate from. Researchers may feel incentivized to stick to a preregistered protocol in order to uphold their initial (and public) promise regarding their study. However, this is not always what is best for a particular study. Changes in circumstances or researcher knowledge that occur throughout the research process may warrant deviations from preregistered protocols. For instance, in my own research, I worked with a team on a preregistered study exploring purpose in life and trait-urgency in relation to risky and self-destructive behavior.8,9 Our initial plan was to use either zero-inflated negative binomial or zero-inflated Poisson models to examine interactions between purpose in life and urgency in predicting behavior. However, we received advice from a statistical consultant that this may not be the best approach given the extreme skew of our data, and that we should switch to a multinomial approach where we created groups of no/low risk, moderate risk, and high risk behavior cutoffs. So, we chose to deviate from our initial protocol, which resulted in the creation of a document to track any and all deviations. Then, upon submitting our paper for review, we learned from reviewers with strong statistical knowledge in this specific area that our initial strategy of negative binomial regression would have been best, particularly when paired with a new strategy for testing interaction effects at different points within such models.10 After conducting further research into this approach, we agreed with reviewers and decided to learn this new method and completely redo our statistical analyses, which led to essentially rewriting our entire paper. This led to even more deviations and a great deal of explanation both in-text and in our deviation document.

    Since I was part of a team that highly valued both open science and conducting the most appropriate analyses given our questions and variables, we all agreed to be completely transparent about every step of this process and to alter our analytic strategy multiple times for the sake of the study. However, I can imagine that not all researchers would have reacted the same way. We could have simply chosen to take our paper to a different journal, stick with the strategy suggested by the statistical consultant, and gotten the paper published as it was. Alternatively, we could have ignored the advice of the statistical consultant and stuck with our original pre-registered plan, but left out the tests of interactions at different levels of each variable, since we originally were not aware of these methods. Each change would have painted an entirely different picture of our data and would not have been in the interest of doing the best research possible. However, the process of being transparent about all of these changes was a serious struggle and immense time commitment under the previous preregistration system.

    In response to such concern over protocol deviations, on December 3rd, 2021, OSF announced an update to their preregistration system which allows researchers to “transparently report changes in [their] preregistrations.”11 Previously, such transparency could only be achieved through notation within a manuscript or in supplemental materials. This additional step of determining how and when to disclose deviations may have deterred some researchers from engaging in preregistration practices at all. The shift to a modification process that is integrated into the OSF system makes it much easier to report changes to an existing preregistered protocol and may increase researchers’ willingness to engage in open science practices. This OSF update is not just a huge shift in the process of preregistration, but is also a shift in how we think of preregistration. My hope is that the shift to an integrated modification system will enable researchers to feel more comfortable both reporting changes to preregistration and choosing to make changes to begin with when the study warrants.

    Another difficulty with the open science system is that researchers may be penalized for not conducting their research in the “perfect” open way in the eyes of reviewers. With the same project of my research team, we submitted our preregistration after the data was already collected due to the study being part of a larger thesis project for a master’s student. Using the "AsPredicted"12 template for preregistration, the first question asks if any data has been collected and allows “Yes, at least some data have been collected for this study already” as one answer option. However, our editor took issue with this. They felt it was difficult to know when we developed our research plan given that the data were already collected at the time of preregistration. This is a very difficult predicament to be in. We completely understood the point of our editor – it is difficult to prove when we developed our plan, and many researchers might take advantage of this option to develop a protocol after looking at their data. However, such a stringent view of preregistration brings into question whether preregistering any studies with already existing data is worthwhile and would deter a plethora of researchers from preregistering, especially any who conduct research with secondary data. In my view, preregistering at any stage is worthwhile as it allows for a record of researchers’ intent, which can then be compared to the final product. Further, research with secondary data is highly valuable, and creating transparency in such studies would be highly beneficial to any field.

    As open science continues to grow, I hope it will also continue to adapt. Being open about our research should not be contingent on being “perfect.” Science is not perfect, and continuing to add this pressure and stigma to the open science process will alienate researchers from conducting open research. This will only contribute to and exacerbate the reproducibility crisis, rather than reduce it. The recent update to the OSF preregistration system is a step in a positive direction for increasing adaptability in open science. Hopefully this shift in protocol is followed by an equal shift in mindset from reviewers, journals, and other researchers so that scientists feel comfortable being open about any deviations from their existing preregistrations.

    References:

    1. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). https://doi.org/10.1126/science.aac4716
    2. United Nations Educational, Scientific and Cultural Organization. (n.d.). Open Science Movement. http://www.unesco.org/new/en/communication-and-information/portals-and-platforms/goap/open-science-movement/ 
    3. Center for Open Science. https://www.cos.io/ 
    4. Open Science Framework. https://osf.io/ 
    5. Open Science Framework. (n.d.). Badges to Acknowledge Open Practices. https://osf.io/tvyxz/wiki/home/ 
    6. Center for Open Science. (n.d.). Open Science Badges. https://www.cos.io/initiatives/badges 
    7. Open Science Framework. (n.d.). Templates of OSF Registration Forms. https://osf.io/zab38/wiki/home/ 
    8. Ratner, K., Porcelli, S. E., & Burrow, A. L. (2021). Purpose in life, urgency, and the propensity to engage in risky and self-destructive behaviors. Motivation and Emotion. https://doi.org/10.1007/s11031-021-09915-0 
    9. Project repository for Purpose in life, urgency, and the propensity to engage in risky and self-destructive behaviors. https://osf.io/bgcsm/ 
    10. McCabe, C. J., Halvorson, M. A., King, K. M., Cao, X., & Kim, D. S. (2020, April 8). Estimating and interpreting interaction effects in generalized linear models of binary and count data. PsyArXiv. https://doi.org/10.31234/osf.io/th94c
    11. Brian Nosek [@BrianNosek]. (2021, December 3). Even the best laid plans sometimes don't go as expected in reality. [Video attached][Tweet]. Twitter. https://twitter.com/briannosek/status/1466814366469218317?s=12
    12. AsPredicted. https://aspredicted.org/

Fine-Tuning A Transformer Language Model to Summarize Legislative Documents

March 3, 2022

By: Remy Stewart, CCSS Data Science Fellow

Milestone innovations in deep learning models for natural language processing (handling text as data) are leading the way for applying language-based AI across academic, industry, and government use cases. In this coding demo, I provide an accessible introduction to a text summation model designed to make government legislative documents more concise and simpler to read through summaries generated by machine learning.

Image of signing congress document

  • Fine-Tuning A Transformer Language Model to Summarize Legislative Documents

    Check out the Google Colab notebook this post is based on here!

    Documents produced across all levels of government are notoriously lengthy, can be confusingly formatted, and are often filled with legal jargon. While there's a long history of government and legal precedent that guides the creation of legislative documents, it's undeniable that said standards keep critical information about policy initiatives that can directly impact individuals' daily lives inaccessible to the general public. My research explores how technology and digital platforms are reshaping social dynamics relevant to policymaking in American cities. As a computational social scientist passionate about developing machine learning methods that have a direct social benefit for diverse stakeholders, I’ve been investigating how to use state-of-the-art tools to work towards a viable solution to this ongoing disconnect between government and the communities directly impacted by policy design.

    Headshot of Remy Stewart

    Text summation is a specialized task within natural language processing (NLP) that can aid with streamlining government texts into more concise and simplified formats. The technique has flourished with the rise of transformer-based language models within NLP. I won’t be going too into the details of transformers within this post and instead refer you to other great resources (1, 2, 3) to learn their mechanics. In brief, transformers refer to a subset of deep learning model “architectures” that have a comparatively innovative structure regarding how the model learns the context of how language is used within a text dataset.

    Transformer models trained for summation take long text passages and create original summaries of the core themes as learned from the original document. They have been used to provide accessible synopses of complex text sources such as lengthy news reports, Wikipedia articles, and academic publications. One of the many perks of transformers is that they can be accessed as downloadable pre-trained models that have been previously developed on massive text datasets through data science platforms such as Hugging Face. Hugging Face is a fantastic open-source community that provides the transformers library in Python for easy implementation of these models, as well as both model and data repositories for direct downloads4. I will therefore use an available pre-trained text summation model that I will subsequently fine-tune for the specific use case of interpreting government legislation.

    BillSum is a dataset introduced by Kornilova and Eidelman5 that provides paired US Congressional documents with summaries of the proposed legislation across a range of policy topics. There are almost 19 thousand documents records that I can use for model fine-tuning. My following proof-of-concept model has a range of potential applications towards increasing the transparency of proposed legislation beyond regulatory legalese across government bodies and their various produced legislative documents.

    Throughout this walkthrough, I’ll be keeping technical details brief and instead focus on the simple workflow that the transformers library in Python offers to fine-tune your own language models for a wide range of uses. We'll start by loading in our necessary libraries and the BillSum data set as publicly available from Hugging Face's data repository.

    Code

    The above output demonstrates that BillSum comes loaded in pre-split with training and testing sets, referring to the core protocol within supervised machine learning of having your model first learn underlying relationships within a data set before testing its performance accuracy on unseen data. "CA_test" refers to a selection of California legislative records contrasted to the US Congress as the primary document source for training and testing. We'll focus on the congressional records for model fine-tuning and then draw from the California documents for final model testing. Let's get a closer look at one of the data records within the training dataset to understand its key features.

    Code

    The 'text' records capture the congressional report in its entirety, while the summary focuses on the key point of business liability when working with nonprofits. These are respective the input data that the model trains on, as well as the “ground-truth” hand-written summary to compare the model’s generated summary to. There is still a fair amount of meaningless character noise throughout the full legislative document records, which is quite common when working with text data. I therefore create a function with a series of regular expressions (regexes) tailored to remove the types of noise present within the full documents such as HTML line break indicators, excessive white space, and unnecessary punctuation. While regexes can certainly look like character gibberish as well, each featured string captures specific components to remove within the text through the regex package’s sub command.

    Code

    Now that our legislative bill dataset is cleaned and preprocessed into a format ready to be configured for reading into our transformer language model, let's dig more into the details of my model of choice for the text summation task at hand. I'll be fine-tuning a DistilBART model that was originally trained on a large data set featuring summaries of news articles. I'll explore two key concepts behind why DistilBART is an ideal model for building my legislative document summarizer. This includes the novel architectural strengths of original BART models as well as the advantages of distilled language models for resource-limited environments such as my own.

    BART features a sequence-to-sequence (Seq2Seq) model architecture which makes the framework ideal for NLP tasks where the sequence of words is of particular importance, such as with generating sensible summary paragraphs of an input document where sentences should have a natural order with each other6. BART’s unique design allows the model to accurately create a logical sequence of what core components of the original text should be included within its generated summaries. Distillation is a technique to mitigate the computational resource demands of complex language models. The widespread creation of distilled versions of popular transformers such as with DistilBART allows these models to be used by a wider audience within the machine learning community that often have limited access to computational resources.

    DistilBART requires text data to be “tokenized” – prepared within a set format- to be successfully read into the model’s starting layer of its architecture. I therefore instantiate both the DistilBART model itself and its complementary tokenizer as follows as easily provided through calls to Hugging Face’s fantastic model library. 

    Code

    Code

    You can see via the example sentence how the input text is tokenized into respective token id values. The ‘attention_mask’ output is a binary indicator for whether the model will consider the specific token when learning its text summation task- with the sequence of 1’s meaning that all of the words are indeed included. I’ll skip delineating the tokenization process of our BillSum data due to its technicality, but for interested readers, I’ll point you to the Google Colab to learn more.

    With the Billsum data set documents & summaries now prepared to be read into the DistilBART model, I instantiate both the training model and its associated arguments. Transformers as for deep learning models overall feature a range of hyperparameters, which allows data scientists to pre-determine the model’s behavior on core actions such as how many times it will conduct a complete review of the data set, how often and how much it will adjust the importance of certain components of the text for achieving its task of accurate document summation, and beyond. Hugging Face transformers come with default values for each hyperparameter, but there are a few I’d personally like to adjust as best suited for this demo. I determine those by creating a “training_args” object, and then finalize our model with its associated hyperparameters, training data, and testing data. 

    Code

    And with that, we're ready to train the model! Our loss metric steadily decreases across batches and obtains the lowest score on the test set, which is an encouraging sign that our model has learn how to better summarize the legislation documents across the individual records. There are specifc training metrics we can use to measure our text summation performance such as ROUGE similarity metrics7, but I'll leave exploring those to a more resource-abundant computing environments than Google Colab.

    Code

    The best way to understand our fine-tuned model's performance is to directly compare its generated summaries to human-written summaries on a set of previously unseen documents. I therefore test our fine-tuned model on three legislative records sourced from the California state government. This additionally allows us to test the model's performance on documents similar to the US Congressional files it was trained on, but that also features a slightly different format and language use unique to California government compared to the federal level. After creating another function to generate model predictions as featured in the Colab file, I combine the original documents “0”, the ground-truth summaries “1”, and the generated predictions “2” into a single data frame. 

    Code

    Both the legislation itself, ground-truth summaries, and the predicted summaries are quite long, so I’ll stick to just showing the initial few words for each. Overall, the model generates good-faith predictions that reflect the overarching theme of the primary legislative document. While it certainly struggles with some core points- such as missing sections of the lengthy legislation by only focusing on a few clauses within its summary- this serves as promising evidence that this framework can be built upon to craft increasingly accurate summation models. Incorporating more summation task-specific performance metrics, doing more complete passes known as epochs over the entire data set, and tinkering with hyper-parameter options would all be great next steps to improve this baseline model. 

    As transformer models continue to advance in accuracy and capability to handle complex NLP tasks such as generating domain-specific summaries, their potential benefits towards initiatives such as increasing government transparency and streamlining communication with the general public will become even more of a reality. These resources are becoming increasingly more accessible for data scientists as facilitated by open-source initiatives such as Hugging Face and the transformers Python library. Transformer models are undeniably powerful, and therefore must be handled with care to ensure they provide more good than collective harm as with all AI. There are many potential applications for automated document text summation, and it’s my belief that computational social scientists are in a unique position to ensure that said innovations can be cultivated and implemented for use cases aimed specifically for widespread social benefit. 

    References:

    1. David Mimno, Melanie Walsh, and Maria Antoniak. 2022. “BERT for Humanists.” www.bertforhumanists.org/

    2. Jay Alammar. 2018. “The Illustrated Transformer” jalammar.github.io/illustrated-transformer/

    3. Hugging Face. 2022. “The Hugging Face Course.” huggingface.co/course/chapter1/1

    4. Anastassia Kornilova and Vlad Eidelman. 2019. “BillSum: A Corpus for Automatic Summarization of US Legislation.” ACL EMNLP 2019. github.com/FiscalNote/BillSum

    5. Thomas Wolf et al. 2020. “Transformers: State-of-the-Art Natural Language Processing.” ACL EMNLP 2020. github.com/huggingface/transformers

    6. Mike Lewis et al. 2020. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.” ACL Proceedings (58).

    7. Kavita Ganesan. 2018. “ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks.” github.com/kavgan/ROUGE-2.0

  • We'd love to hear your ideas, suggestions, or questions!

    Are you