Czechitas Digital Academy Final Project

Authors: Karolína Horká, Eva Klímová
Mentors: Markéta Špinková, Petr Hamerník (Geneea)

Introduction

We are Eva and Karolína, and the aim of our blogpost is to present our final project realised as part of the “Digital Academy: Data” program organised by Czechitas.

We both have a linguistic background, and we chose to work together because in the field of data, we are both interested in web scraping, cleaning textual data, and text analytics. When it was time to choose the topic for our project, we discovered that we also share a common interest in international relations. Given the context of recent events that seem to indicate a deterioration in relations between the Czech Republic and Russia, we decided to explore the image of Russia in Czech online media. In a broader context, the annexation of Crimea at the beginning of 2014 marked a serious rupture in the relationship between the EU and the Russian Federation. We therefore established a timeframe from January 2014 to mid-October 2020 (start of our project) for our analysis.

The goal of our project is to create a publicly accessible set of dashboards/diagrams where anyone interested in the topic can find answers to the following questions:
1. What events/topics do journalists write about?
2. Who are the most quoted and most frequently mentioned people in articles about Russia?
3. Are there any differences between mainstream and other media?

Workflow

This is a simplified scheme of our project’s workflow with the tools we used. In practice, the workflow was not linear; this is mainly because we redefined the selection of articles in our dataset based on the results of text analysis or visualization several times.

Web scraping: Apify

We decided to obtain our data by scraping articles from news websites selected from the Media Map elaborated by Josef Šlerka:
● 5 mainstream news websites: Aktualne.cz, iDNES.cz, iRozhlas.cz, Lidovky.cz, Novinky.cz;
● 1 political tabloid website: ParlamentniListy.cz;
● 1 anti-system news website: Aeronet.cz (AE News);
● 1 Russian state-owned Czech news website platform: cz.sputniknews.com (Sputnik Česká republika) (not listed in the Media Map).

Notes:
1) iRozhlas.cz was launched in April 2017. However, the website also contains a small number of older articles from its predecessor, the news section of the website of Czech Radio (Český rozhlas).
2) cz.sputniknews.com was founded in March 2015 and contains only articles from then on.
3) We did not include sport, lifestyle and cultural section from websites with contents divided into sections.

In order to scrape articles from these media, we had to learn to work with the Apify platform, study the basics of web scraping and of the HTML structure of websites, and write simple codes in JavaScript (see example of code). On advice from Kačka Hroníková from Apify, we used the Cheerio Scraper, a fast crawler suitable for scraping static websites. Here we would like to thank Kačka and Jakub Balada from Apify for their help.

Transformation: Snowflake SQL and Python in Keboola

When all the scrapers finished running, we had 8 datasets containing a total of over 7 million rows, and we were able to start cleaning and transforming our data. We used SQL transformations in Keboola mainly to: remove records without article text, remove duplicates, unify the date format, complete missing data where possible, and connect data from different media into one dataset with over 1 million rows and with the following 9 columns: article ID, source, date, section, author, title, opener, text, and URL. See part of SQL script here.

We also used Python, mainly for cleaning the text of the articles — removing excess line terminators, spaces, and irrelevant parts of text (for example links to other articles included in article texts). We used regular expressions for some of these tasks. See code here.

Text analysis: Geneea

For the text analysis part, we used Geneea’s NLP platform The Interpretor. Based on embedded algorithms, this platform identifies elements such as keywords, phrases, locations, or persons and detects sentiment.

Markéta Špinková from Geneea launched many analyses for us during the entire project. We used their results to modify the “training file”, an Excel file where we created a set of rules for the following analyses. For example, if we want to count how often different media mention the Kremlin Press Secretary Vladimir Peskov, we need to group ideally all the forms under which he occurs in our articles under one entity. (The Interpretor uses a knowledge base which allows it to recognize names, but it does not automatically group all different forms under which a single person can occur under one item.) We proceeded analogically with relevant locations, organizations, and custom entities, the final training file contains over 1000 rows.

In order to make sure that the main topic of the selected articles is actually Russia, we launched an analysis including only article headlines. Based on relevant keywords from headlines, we defined a new condition selecting only articles about Russia from our dataset (see the relevant part of SQL script here). The resulting dateset contains 52,564 articles.

This dataset was used for all the following analyses. Based on extensive exploration of the output from Frida we selected 11 important topics/events related to Russia present in our articles and linked relevant keywords to them in the training file:

  1. The Crimea Crisis
  2. Russia/Syria
  3. Sanctions Against Russia
  4. Russian Hackers
  5. Russian Trolls
  6. The Konev Affair (info)
  7. Alexei Navalny
  8. Sergei Skripal
  9. Boris Nemtsov
  10. The Řeporyje Affair (info)
  11. Alexander Litvinenko and murdered Russian activists

The set of these topics contains 25 000 articles. In the remaining 27 000 articles, the topic was different.

To find out who is quoted in our articles, we ran several analyses of quotes we obtained from the texts with regular expressions. We decided to work only with direct speech in our project. Later we used regular expressions to split the quotes into two parts (part outside of the quotation marks and part inside of them — the quote itself) and we ran separate analyses to get as precise data as possible about the quoted and mentioned people. Each of these datasets contains 120 728 rows. From hundreds of quoted speakers we selected 20 most quoted ones for further work.

Visualization and results

A) Results of the analysis of topics

Our dataset contains over 52 000 articles related to Russia. These articles cover thousands of topics and we managed to detect one of 11 big topics (events) in almost half of the articles. We can see we were able to categorize the articles better in 2014 than in 2019. The following diagram says why.

A large portion of the media’s attention in 2014 was drawn to the recent annexation of Crimea. A year later Russia intervened in the Syrian Civil War and a high number of articles concerned this topic. While there were large events dominating the beginning of our time frame, there was not any comparable international affair involving Russia recently and the topics in the last two years have been more fragmented.

B) Results of the analysis of quotations

The dataset we obtained after extracting quotes from the articles consists of approximately 120 000 quotes. Each of the quotations has an author, however, we did not detect them in every case — it would be possible, but we decided not to due to the limited amount of time. From the thousands of known authors we selected the 20 most quoted Czech and Russian politicians and other public figures. These 20 men are authors of 10 % of quotes in our dataset.

We can take a more detailed look on the 15 most quoted people from our selection in different types of media (mainstream versus other media) and in every single medium:

We can see there are no major differences between the mainstream media. On the other hand, the anti-system news website Aeronet is dominated by Vladimir Putin and the Russian state-owned Czech news website Sputnik gives more space to Dmitrij Peskov and Sergej Lavrov compared to other media.

We can take a look on 15 most quoted Czech and Russian persons and how many times they commented on each topic/event:

Thousands of politicians, public figures and common people were mentioned in the quotes in our dataset. This diagram presents the 30 most mentioned people:

Interactive visualizations: Tableau public

Conclusion

We were able to find answers to the questions we asked at the beginning. We detected the events/topics related to Russia that Czech news websites write about the most, we found out who are the most quoted and mentioned people and we discovered that there are differences between who is quoted in mainstream and anti-system media.

In order to get more detailed and precise results, following steps could be done:

● enlarge the dataset: find more relevant websites with large number of available articles and scrape more data;
● detect more topics and connect them with corresponding keywords: the ideal would be to categorize as many articles as possible;
● add indirect speech to the analysis of quotes;
● try to find relations between speakers and mentioned persons;
● explore more and interpret the results of sentiment analysis in Frida.

Our project allowed us to learn how to use many new tools and gain experience with the entire process of text analysis. Given the nature of the used data, it also helped to improve some features of Geneea’s tool Frida. Last but not least, a large and valuable dataset of news articles was obtained during this project. Since the dataset contains articles not only about Russia, it can be used for analysis of a different topic in the future.

Karolína Horká

With Eva we were both fully engaged in all the work during our project. With the help of Apify’s tutorials I wrote scripts used for scraping four websites (parlamentnilisty.cz, lidovky.cz, idnes.cz, cz.sputniknews.com). During the initial transformations of our data I used mainly SQL for cleaning the articles from “my” media — unifying the date format, trimming the texts and removing unnecessary data. In the following phase I was mainly focused on work with regular expressions, extraction of quotes from the articles and Python transformations of our data in general. During these transformations I learned how to use the Pandas library. After each text analysis in Frida I worked on selection and definition of topics, persons and keywords in our training file. When our data was ready for visualization I used Tableau to make different kinds of charts using groups, sets and calculated fields.

Eva Klímová

Both Karolína and I wanted to learn thoroughly all the tools and methods used in our project and we tried to split the work as equally as possible. We were in daily contact and shared all newly acquired skills with each other. I wrote scripts and defined settings in the Apify platform for scraping articles from four news websites (aeronet.cz, aktualne.cz, irozhlas.cz, novinky.cz). Then I used a series of Snowflake SQL transformations to clean and connect the obtained data (unify date format, split and trim texts, remove duplicates, joins). I used basic functions of the Pandas library and worked with regular expressions in Python to clean article texts. I also designed one of the two regex patterns matching quotes. I got familiar with Keboola that I used mainly for SQL transformations. I used the Frida platform to explore text analysis results and define parameters for further analyses (persons, custom entities and categories). I worked with Tableau to visualize the results of our work.

Acknowledgements

We would like to thank our mentors Markéta Špinková and Petr Hamerník for providing guidance and feedback throughout this project, the entire Czechitas team for their engagement, and our families for their patient support during our work.