Comment
Author: Admin | 2025-04-28
Works with Library of Congress subjects and topicsLet’s use a different corpus of documents, to see what terms are important in a different set of works. Let’s download some classic science texts from Project Gutenberg and see what terms are important in these works, as measured by tf-idf. We’ll use three classic physics texts, and a classic Darwin text. Let’s use:Discourse on Floating Bodies by Galileo Galilei: http://www.gutenberg.org/ebooks/37729Treatise on Light by Christiaan Huygens: http://www.gutenberg.org/ebooks/14725Experiments with Alternate Currents of High Potential and High Frequency by Nikola Tesla: http://www.gutenberg.org/ebooks/13476On the Origin of Species By Means of Natural Selection by Charles Darwin: http://www.gutenberg.org/ebooks/5001These might all be physics classics, but they were written across a 300-year timespan, and some of them were first written in other languages and then translated to English.library(gutenbergr)sci Now that we have the texts, let’s use unnest_tokens() and count() to find out how many times each word was used in each text. Let’s assign this to an object called sciwords. Let’s go ahead and add the tf-idf also.scitidy % unnest_tokens(word, text)sciwords % count(word, author, sort = TRUE) %>% bind_tf_idf(word, author, n)sciwords## # A tibble: 16,992 x 6## word author n tf idf tf_idf## ## 1 the Darwin, Charles 10287 0.0656 0 0## 2 of Darwin, Charles 7849 0.0501 0 0## 3 and Darwin, Charles 4439 0.0283 0 0## 4 in Darwin, Charles 4016 0.0256 0 0## 5 the Galilei, Galileo 3760 0.0935 0 0## 6 to Darwin, Charles 3605 0.0230 0 0## 7 the Tesla, Nikola 3604 0.0913 0 0## 8 the Huygens, Christiaan 3553 0.0928 0 0## 9 a Darwin, Charles 2470 0.0158 0 0## 10 that Darwin, Charles 2083 0.0133 0 0## # ... with 16,982 more rowsNow let’s do the same thing we did before with Jane Austen’s novels:sciwords %>% group_by(author) %>% top_n(15) %>% ungroup() %>% mutate(word=reorder(word, tf_idf)) %>% ggplot(aes(word, tf_idf)) + geom_col() + labs(x = NULL, y = "tf-idf") + facet_wrap(~author, ncol = 2, scales = "free") + coord_flip()We see some weird things here. We see “fig” for Tesla, but I doubt he was writing about a fruit tree. We see things like ab, ac, rc,
Add Comment