As I was writing my thesis (nearly 2 years ago now), I came across the following requirement in our school’s thesis guidelines:
I needed a way to generate a list of abbreviations in the text of my thesis. I didn’t want to do this manually for ~280 pages, as it would take a long time and that there was a chance that I would miss some abbreviations. Furthermore, I kept revising my thesis until close to the submission deadline, which meant that I was adding and deleting abbreviations that then needed to be updated in the master abbreviation list. By the end, I just wanted a way to automatically generate an alphabetized list of abbreviations from the text of my thesis.
I wrote my thesis using Microsoft Word. There are some existing approaches to creating a list of abbreviations (using macros) in Word that require some manual annotation (see the end of this post for existing tools). However, I didn’t have the foresight to set this up from the beginning. Word has the option to search for phrases using regex (using “Find and Replace”), but the difficulty was in collecting all these matches and generating an alphabetized list.
I first attempted to copy-paste directly from my thesis into a plain text file in Sublime text editor. I played around with a few regex expressions that I found on StackOverflow to capture the abbreviations. Here is a screenshot from one of the earliest regex’s that I tried:
I used “Find All” after entering the regex to highlight all matches, and then copy-pasted the matches into a new text file. Each match was listed as a new line:
Then, I used the “Sort Lines” option, followed by “Permute Lines -> Unique” in Sublime to generate an alphabetized list with unique entries:
So there we have it! An alphabetized list of abbreviations.
I then modified the regex to identify abbreviations with multiple capital letters, different punctuation (e.g. periods and dashes), and different string lengths, and ended up with the following expression:
The expression is a bit complicated, but it mostly works. I’ll refer you to tutorials on regex if you’d like to learn more.
While looking at the output after doing “Unique Lines,” I noticed that there were a few entries that were duplicated, but with minor differences in capitalization (e.g. ‘GTEX’ vs ‘GTEx’).
I then went back into my thesis, did “Ctrl + F” to find all entries that matched ‘GTEX’, and changed them to the desired ‘GTEx’. This list of abbreviations also doubled as a spell-check!
Now that I had this semi-automated workflow, I wanted to go one step further. I wanted to automatically generate an alphabetized list of abbreviations using any text input (Word, Excel) by directly copy-pasting the text from a document into an app. I basically wanted this app to do the following:
- Accept pasted text from any document type (e.g. word, excel, pdf, etc.)
- Return an alphabetized list of abbreviations
- Have a user-friendly interface
The end result of this quest was abbreviatoR, a Shiny app that does exactly this. Here’s a screenshot of the UI:
Here’s how to use abbreviatoR:
Modify the regex (if desired) to match specific types of abbreviations.
Set the maximum number of characters in a result (e.g. 5 means that the returned abbreviations will be no longer than 5 characters long).
Check the box if you are pasting from a word doc with intext citations (there is some code to remove the abbreviations specific to Microsoft Word)
Paste your text into the box that says “Paste text here”
Abbreviator will automatically run once it receives input into the text box, and will automatically update if there are any changes in the text input or parameters.
Copy-pasting the text from my ~280 page thesis into this app returns a list of abbreviations in < 5 seconds!
I copy-pasted the results from the app back into my thesis under the “List of Abbreviations” section, and manually added a definition for each abbreviation. This saved me a lot of time, as all of the searching for abbreviations was automated.
That’s all for now! Future directions with this app are to automatically guess the meaning of the abbreviation given its context and nearby words, and to flag abbreviations that are undefined.
The source code for AbbreviatoR, which can also be run locally, is available on Github.
If this post was helpful, or if you have any improved regexs that you’d like to share, please feel free to comment below. If abbreviatoR saves you time, you can buy me a coffee as a thank-you. Thanks!
Here is a list of existing tools for generating lists of abbreviatons, along with their use cases and limitations:
https://intelligentediting.com/apps/abbreviation-list/ - for Google Docs and Office 2013
https://www.dcode.fr/abbreviation-list-generator - This tool was close to what I wanted. Text is copy-pasted into a text box. But it would not detect acronyms that had periods (e.g. would detect “USA” but not “U.S.A.”). The back-end code is also not open source.
https://www.thedoctools.com/word-macros-tips/word-macros/extract-acronyms-to-new-document/ - specific for creating an acronym list from a word document
https://word.tips.net/T000446_Auto_Creation_of_an_Acronym_List.html - also specific for creating an acronym list from a word document