Workshops

Peeter Tinits - Introduction to R and Tidyverse

Lecturer: Peeter Tinits (peeter.tinits@ut.ee), University of Tartu, Tallinn University

Date: 24.08.

 

Description

R is a scripting language often used for data processing in humanities and social sciences. It provides the means to produce analyses as a reproducible workflow that is transparent to readers and easy to update. We will start with the very basics of R and RStudio, and quickly work our way through to simple data processing via tidyverse packages. Tidyverse is a set of packages that aims to make R easy to use especially for beginners. We will learn 1) basic R syntax, 2) reading data into R, 3) selecting data points and features, 4) making quick summaries of data, 5) creating variables, 6) transforming data and data frames, 7) joining datasets together.

This is a very practical introduction to R. We will focus more on how to do these things in R, and less on the research questions that drive these needs. All the data processing is done in tidyverse, so if you know R but not tidyverse, it may be interesting for you too.

We will rely on personal laptops in this tutorial, you will need to install R (https://www.r-project.org) and RStudio  (https://www.rstudio.com) a few days beforehand. Short instructions will be shared.

If you have no previous experience in R, this tutorial is a requirement for attending other workshops using R in this summer school.

 

References:

- Grolemund, Garrett, and Wickham, Hadley (2017) R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.

 

About the instructor

Peeter Tinits is a digital humanities specialist in the University of Tartu, and teaches various digital humanities courses. His own research has been on spelling standardization of Estonian, the rise of environmentalism in the 20th century, and structural changes in film production crews. He is a firm believer that anyone can learn to code, and the humanities have a lot to gain from adopting reproducible research practices.

Kristiina Vaik - Introduction to natural language processing using Pandas and spaCy

Lecturer: Kristiina Vaik (kristiina.vaik@ut.ee), University of Tartu

Date: 24.08.

Room: Jakobi 2-106

 

Description

This workshop aims to introduce an alternative programming language used in natural language processing - Python. Python has a simple syntax and transparent semantics and is widely used for analyzing, understanding, and deriving information from structured and unstructured data. This course will start with a basic introduction to Python, we will quickly go through topics such as syntax, variables, data structures, conditionals, loops, and IO. This will be continued with an introduction to Pandas, a powerful Python data analysis toolkit that is used for data exploration and manipulation. Finally, we will move to get to know spaCy which is a free open-source library with a lot of built-in capabilities for text processing. We will use spaCy for data (pre)processing, e.g noise removal, tokenization, lemmatization, etc. Additionally, we shall see how to apply pre-built models for different downstream tasks, e.g morphological and syntactical parsing, named entity recognition, etc.

Students will be provided with Jupyter notebooks containing the code used in this tutorial. Knowledge of Python is not mandatory but highly recommended. I also recommend using your own laptop, instructions on what packages to download will be shared beforehand.

 

About the instructor

Kristiina Vaik is a Ph.D. student at the University of Tartu. She has worked as a programmer in the Natural Language Processing Research Group at the University of Tartu and as a data analyst at TEXTA.

Martin Mölder - Web-scraping with R

Lecturer: Martin Mölder (martin.molder@ut.ee), University of Tartu

Date: 25.08.

 

Description

The Internet is full of data - both textual and numerical - that we could use and analyse and it is up to us to collect it, systematise and clean it and finally to analyse it. This workshop focusses mostly on the first step in this process - how to automatically collect information form the internet using R. This process is in general called web-scraping. R connects to the Internet and you can download web-pages into R. By familiarising yourself with the structure and functioning of a web-site you can write R code that will systematically go though the content of a web page and download the information - blog-posts, comments, articles, etc - that you need. There are packages in R that can make this process rather smooth and streamlined, but some knowledge of what to look for and where on a web page is still necessary. In this workshop we will go through some simpler and more complicated examples of how to construct and automate this process of information gathering in R.

 

About the instructor

Martin Mölder is a researcher at the Johan Skytte Institute of Political Studies at the University of Tartu. He teaches about quantitative methods and party politics. Much of his current (and future) research interests and activities revolve around quantitative text analysis.

Iza Romanowska - Practical introduction to agent-based modelling (participating remotely)

Lecturer: Iza Romanowska (participating remotely)

Co-lecturer: Andres Kimber

Date: 25.08.

Room: Jakobi 2-106

 

Description

The goal of the workshop is to provide a quick and easy introduction to the methodology of agent-based modelling and the software most commonly used in archaeological simulation: NetLogo. Agent-based modelling is the easiest, most user-friendly and fun simulation technique enabling even non-coders to develop an artificial world and test their ideas on it. The workshop will focus on explaining the process of developing a simulation as well as provide a practical hands-on introduction to NetLogo. It will consists of a practical session demonstrating the basics of modelling through an archaeological example. An extensive list of will enable anyone who would like to consult their ideas for a simulation, needs help developing a model, or would like directions towards relevant resources.

NetLogo was chosen thanks to its versatility as an open-source platform for building agent-based models, which combines user-friendly interface, simple coding language and a vast library of model examples, making it an ideal starting point for entry-level agent-based modellers, as well as a useful prototyping tool for more experienced programmers. It was developed with schoolchildren in mind but is widely used in social sciences and ecology. No previous experience in coding or simulation is required to join the workshop, but please install NetLogo in advance (https://ccl.northwestern.edu/netlogo/download.shtml).

 

About the instructor

Iza Romanowska is a complexity scientists working on the interface between social sciences and computer science. She originally trained and worked as an archaeologist before switching to computer-based research. Currently, she is working as a senior researcher and the head of the Social Simulation and Digital Humanities Research Group at the Barcelona Supercomputing Center leading a team of engineers and computer scientists who develop solutions for agent-based simulation (ABM) using High Performance Computing (for example, our supercomputer MareNostrum). We create models of mobility in ancient cities, look for patterns in demographic data, and create platforms for real-time pedestrian flow modelling. Dr Romanowska is a vocal advocate for a wider use of simulation in archaeological research, training next generations of ABM modellers through courses, workshops and published tutorials (e.g., tinyurl.com/y7hhqc4d). She is also a co-author of an upcoming textbook on archaeological ABM.

Cornelius Puschmann - Sentiment analysis with R (participating remotely)

Lecturer: Cornelius Puschmann (puschmann@uni-bremen.de), University of Bremen (participating remotely)

Co-lecturer: Sander Salvet (sander.salvet@ut.ee), University of Tartu

Date: 25.08

 

Description

Audience:
MA/PhD students and faculty in all fields interested in quantitative social media research, especially doctoral students in media & communication research and related fields.

Learning outcomes:
Participants will learn how to obtain and analyze large-scale social media data sets to answer questions relevant to the textual expression of sentiment/emotions. In order to achieve this goal, they will be introduced to the use of R for content analysis with quanteda and additional software packages. They will also learn the fundamentals of interacting with social media platform APIs, as well as managing data and visualizing results.

Prerequisites:
The course will assume familiarity with R (r-project.org) and RStudio, especially R Notebooks. Participants should be able to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression.

Content:
This class focuses on how the types of questions that are relevant to communication and media studies as well as political science, sociology and  other fields interested in leveraging digital data from social media platforms in combination with innovative computational methods for content analysis (“big data” research). The platforms used as examples include Twitter and Facebook and the techniques covered will include sentiment analysis through the use of dictionaries and third-party APIs.

Study materials and literature:
The course will use the open-source software R and the development environment RStudio, which greatly facilitates coding with R. Both R and RStudio are freely available and each participant have access to a laptop computer on which the current version of R and RStudio are preinstalled, and on which they have the necessary permissions to install packages.

Modes of study:
The course will follow a hands-on approach, with short theoretical sessions followed by coding challenges where participants will need to apply new methods.

 

About the instructor

Cornelius Puschmann is a professor of media and communication at ZeMKI, University of Bremen and an affiliate researcher at the Leibniz Institute for Media Research, as well as the author of a popular German-language introduction to content analysis with R. His interests include digital media usage, online aggression, the role of algorithms for the selection of media content, and automated content analysis.

Andres Karjus - Visualizing your data using R

Lecturer: Andres Karjus (andres.karjus@hotmail.com), University of Edinburgh

Date: 26.08.

 

Description

In this workshop, we’ll be focusing on visualizing different kinds of data using R, an excellent programming language for doing anything related to stats and data science. We will mostly be using ggplot2 and its addons, starting out with basic examples like scatterplots and time series, and how to balance legibility and the amount of information on a plot. We will also look into a few other packages for creating networks and maps, as well as interactive plots and animations that can be published on the web and included in slide presentations. Some time will also be dedicated for discussing the ethics of data visualization, or how to make sure you are not misleading your audience (and how to spot bad faith graphs in the wild).

 

About the instructor

Andres Karjus is a PhD student at the Centre for Language Evolution at the University of Edinburgh, and a tutor at the School of Philosophy, Psychology & Language Sciences. He uses R daily in his research and has been teaching occasional R workshops since 2015. He holds degrees in linguistics (BA, MA) and computer science (MSc). Personal website: andreskarjus.github.io

Simon Hengchen - Introduction to diachronic word embeddings with Python (participating remotely)

Lecturer: Simon Hengchen (simon.hengchen@gu.se), University of Gothenburg (participating remotely)

Co-lecturers: Peeter Tinits, Artjoms Šela

Date: 26.08.

Room: Jakobi 2-106

 

Description

The increasing availability of textual data gives new opportunities for humanities and social sciences that we are only beginning to explore. The nature of the data can vary quite a bit ranging from old digitized newspapers to Twitter or forum posts that are born and live digitally. Provided that we can access the data, they allow quite diverse questions to be answered. Concurrently, these past years have seen the rise of computational methods to detect, track, qualify, and quantify how a word’s sense – or senses – change over time.

In this tutorial, we will learn how to get and prepare textual content to build word embedding models with Python. Word embeddings are a rough approximation of the distributional hypothesis (Harris 1954), which states that words occuring in the same context tend to have the same meaning. Using such models means that we represent words as vectors (a one-row table, filled with numbers) in multi-dimensional space -- which in turn allows us to go beyond the simple string comparison: we now have an easy access to a word's sense(s), among other things. 

To reuse a famous example: vector_king - vector_man + vector_woman ≃ vector_queen.

When trained over diachronic data, these models allow for the detection and quantification of a word's sense.

In this tutorial, students will learn about:

- diachronic textual data and where to find them

- training different embedding models, as well as how to use them

- different ways of tackling time

Students will be provided with Jupyter notebooks containing the bulk of the code used in this tutorial -- as such, they do not need to be proficient in Python, although this is recommended. Students should at least have a read and understood the material in Sinclair and Rockwell (2016).

 

Requirements:  

- The workshop will take place in a computer class where the software is preinstalled

- If you use your own computer, you must be able to run Jupyter notebooks, as well as install python packages. We will be using python 3. Instructions to do so are available in Sinclair and Rockwell (2015)

- If you want to have a go with your own data, please email Simon by August 10 for a go-ahead.

- Please read the readings

 

Readings:

- Tahmasebi, Nina and Hengchen, Simon, 2019. The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies. Samlaren: tidskrift för svensk litteraturvetenskaplig forskning, 140, pp.198-227. http://uu.diva-portal.org/smash/get/diva2:1415010/FULLTEXT01.pdf

 

References:

- Sinclair, Stéfan and Rockwell, Geoffrey, 2015. The Art of Literary Text Analysis: https://github.com/sgsinclair/alta/blob/915579fc1c6926b8fcb2a38f95349a2d6cba00b5/ipynb/GettingSetup.ipynb

- Harris, Zellig S., 1954. Distributional structure. Word, 10(2-3), pp.146-162.

 

About the instructor

Simon Hengchen holds degrees in language (MA) and information science (MSc, PhD). In his short career, he has been involved in and employed by digital humanities groups, most recently the Computational History group (COMHIS) in Helsinki.

He is currently working at the Swedish Language Bank (Språkbanken Text) at the University of Gothenburg, where he focuses on his main research interest, computational lexical semantic change, within the Language Change project.

More information about his NLP and DH work, as well as current projects, can be found at https://hengchen.net.

Artjoms Šela - Introduction to stylometry and multivariate text analysis in R

Lecturer: Artjoms Šela (artjoms.sela@ut.ee), University of Tartu, Institute of Polish Language

Date: 27.08.

 

Description

Stylometry – a discipline that measures variation of features within a text or a set of texts – appeared much earlier than computers, but the age of computations allowed to see a style as a clearly distributed phenomena: hundreds of textual features taken simultaneously seemed to describe individuality much better than handful of hand-picked examples. The usual and well-documented application of stylometric techniques was always an authorship attribution and forensics. In this workshop we will use the general principles behind the multivariate analysis of style and authorial identity to follow the workflow of almost any textual analysis: extracting features, dealing with texts as vectors of these features, surfing the multidimensional space of these vectors.

The workshops starts with introducing the “stylo” package for R (Eder, Rybicki , Kestemont 2016), which is simple to use yet powerful enough to be customizable and open to the research needs. After covering the basics we will move to build our own simple stylometric tool using “tidyverse” and “tidytext” packages that will allow us to demystify the process. Finally we will discuss how to use stylometry beyond authorhship attribution and will run a small experiment on supervised classification of text genres. Participants are encouraged to bring in their datasets, text collections and research questions!

 

About the instructor

Artjoms Šela is a research fellow at University of Tartu and currently is doing postdoctoral research at Methodology department of Institute of Polish Language (Krakow). In 2018 he received his PhD in Russian literature at University of Tartu. He teaches courses focusing on digital humanities, computational methods and literature.

Ülo Maiväli, Taavi Päll - Introduction to Bayesian inference in RStan & brms

Lecturer: Ülo Maiväli (ulo.maivali@ut.ee), University of Tartu

Lecturer: Taavi Päll (taavi.pall@ut.ee), University of Tartu

Date: 27.08.

Room: Jakobi 2-106

 

Description

In this workshop, we introduce statistical applications of the probability theory that are based on Bayes theorem. We will learn to work with posterior samples on the simple example of bootstrap, after which we will apply the Bayes theorem on binomial models. Then we will enter the world of Monte Carlo Markov Chains simulation (MCMC) and briefly foray into the Stan programming language to fit some flexible models (binomial and others). However, we will mainly use the ‘brms’ package in R, which allows for specification of a large array of regression models in the common R modelling language. Finally, we will stick our snouts into the truffles of multilevel shrinkage models.

 

About the instructor

Ülo Maiväli works in the Institute of Technology, University of Tartu. He is interested in molecular biology of protein synthesis, biomedical data analysis, metascience, and small dogs; albeit not necessarily in that order.