This course introduces the basic and advanced concepts and ideas in text mining and natural language processing. In this course, students will learn how to apply text mining methods on text data and analyse them in a pipeline with machine learning and deep learning algorithms. The course has a strongly practical hands-on focus, and students will gain experience in using text mining on real data from social sciences, humanities, and healthcare and interpreting the results.
Given the rapid rate at which text data are being digitally gathered in many domains of science, there is growing need for automated tools that can analyse, classify, and interpret this kind of data. Text mining techniques can be applied to create a structured representation of text, making its content more accessible for researchers. Applications of text mining are everywhere: social media, web search, advertising, emails, customer service, healthcare, marketing, etc. This course offers an extensive exploration into text mining with Python. The course has a strongly practical hands-on focus, and students will gain experience in using text mining on real data from for example social sciences and healthcare and interpreting the results. Through lectures and practicals, the students will learn the necessary skills to design, implement, and understand their own text mining pipeline. The topics in this course include preprocessing text, text classification, topic modeling, word embedding, deep learning models, and responsible text mining.
The course deals with the following topics:
- Review the fundamental approaches to text mining;
- Understand and apply current methods for analysing texts;
- Define a text mining pipeline given a practical data science problem;
- Implement all steps in a text mining pipeline: feature extraction, feature selection, model learning, model evaluation;
- Understand and apply state-of-the-art methods in text mining;
- Implement word embedding and advanced deep learning techniques.
The course starts with reviewing basic concepts of text mining and implementing advanced concepts in natural language processing. At the end of the week, participants will master advanced skills of text mining with Python.
Participants should have a basic knowledge and a motivation of scripting and programming in Python.
This course is part of a series of 5 courses in the Summer School Data Science specialisation taught by UU’s department of Methodology & Statistics. Please see here for more information about the full specialisation. This course can also be taken separately.
Summer School Data Science specialisation:
- Data science: Statistical Programming with R (S24: 5 - 9 July)
- Data science: Introduction to Text Mining with R (S41: 12 - 14 July)
- Data science: Multiple Imputation in Practice (S28: 12 - 15 July)
- Data science: Data analysis (S31: 19 - 23 July)
- Data science: Applied Text Mining (this course)
Upon completing 3 out of 5 courses in the specialisation (no more than one text mining course), students can obtain a certificate. Each course may also be taken separately.
Please note that there is always the possibility that we have to change the course pending COVID19-related developments. The exact details, including a day-to-day program, will be communicated 6 weeks prior to the start of the course.
Dr. Ayoub Bagheri
Dr. Berit Janssen
Dr. Dong Nguyen
This course works best for learners who are comfortable programming in Python, want to acquire skills in text mining approaches, and have a basic knowledge of machine learning.
Participants should also have a basic knowledge and a motivation of scripting and programming in Python. Participants from a variety of fields, including sociology, psychology, education, human development, marketing, business, biology, medicine, political science, and communication sciences, will benefit from the course. A maximum of 80 participants will be allowed in this course. Please note that the selection for this course will be done on a first-come-first-served basis.
Aim of the course
The course teaches students the basic and advanced text mining techniques using Python on a variety of applications in many domains of science. The skills addressed in this course are:
- Python environment;
- Preprocessing text and feature extraction;
- NLTK, Gensim, spaCy;
- Text classification;
- Sentiment classification;
- Text clustering;
- Topic modeling;
- Word embedding;
- CBOW vs Skip-gram;
- Convolutional neural networks;
- Recurrent neural networks;
- Attention models;
- Responsible text mining;
- Text summarisation.
For an overview of all our summer school courses offered by the Department of Methodology and Statistics please click here.
Four full days. A typical course day starts at 9.00 and ends at 17.00 with breaks for coffee, lunch and tea.
Please note that there are no graded activities included in this course. Therefore, we are not able to provide students with a transcript of grades. You will obtain a certificate upon completion of this course.
Extra information about housing
You can choose between two options for participating in this course, but please note that there is always the possibility that we have to change the course pending COVID19-related developments:
- If you choose the livestream option, you will get a discount on the course fee since we will not provide lunch then. The lectures will be broadcasted in Central European Summer Time via a livestream (not recorded). Participants can ask questions via the chat which will be moderated by a second lecturer who will either directly answer your questions via the chat or ask your questions to the first lecturer during class. You will also receive online support during the group computer labs from our team. Additionally, Q&A sessions will be organised so you will benefit from our normal high level expertise while enjoying the class from the comfort of your own chair.
- If you choose the campus option, you will be able to attend the lectures and computer labs at our campus. Of course, we will follow all COVID19-guidelines that hold at the time of the start of your course. We will keep you updated about the newest developments (see also https://www.uu.nl/en/information-coronavirus). Note that, at the moment, it is unclear how many participants will be allowed in our lecture rooms. Therefore, if you register for the campus option, we will also register you for the livestream option such that you are guaranteed a spot via the livestream option (and at first, send an invoice for this option only). We will put you ‘on hold’ for the campus option until we have more information about how many participants are allowed in our lecture rooms. As soon as we hear from the university, we will contact you and send you a second invoice for the part of the fee related to catering and campus registration.
If you are interested in the campus option, let us know via a message in the application form under ‘Student Comment’.
The physical course costs €615, but if you participate via the livestream you will get a 80 euro discount. Note that if you choose the campus option, you will be asked to first pay the livestream-fee (€535) and, when we have permission from the university to actually organise classes on location, we will send a second invoice for the remainder of the fee. This way, you will be ensured to have at least a spot for the livestream.
Tuition fee for PhD students from the Faculty of Social and Behavioural Sciences from Utrecht University will be funded by the Graduate School of Social and Behavioural Sciences.
There are no scholarships available for this course.
Irma Reyersen | E: MS.firstname.lastname@example.org
For frequently asked questions about how we organize our summer school courses during the pandemic, please click here