Abstract
This project was a research project in which I embedded my private writings, questions, writing fragments, notes, and reading wishlist into a semantic space using a language model to create an emergent map of a mind. Such a platform would have many interesting applications, particularly if you shared it with others:
- SYNTOPIC READING: focus on one piece at a time. Read about it from all angles.
- WRITING: find that writing is an indispensable part of thinking. I also have to ruminate on a subject for several days to do my best writing, especially for a new topic. I make notes of thought fragments, and they're added into a writing inbox automatically.
- REMEMBERING: arbitrary subsets of semantic space can be exported to Anki.
- IDEATING: one generator is combining ideas. Great combinations are not too similar and not too distant - there's a Goldilocks zone that's most fertile. This corresponds to a range of distances in semantic space.
- Pair two complementary topics, like writing and cognition or love and friendship, can also be generative. DOUBLE SYNTOPIC READ-IATING
- IDENTIFY BLIND SPOTS: multiple projections can exist: one optimized for laying out your content, one for you and a group of friends, and another for everyone's content. if you identify a gap in your semantic space, you could find recommendations from friends who have read things in that space.
- DEEPEN FRIENDSHIPS: with a more complete understanding of where your mind graphs overlap, you can give and receive better recommendations and start more and deeper conversations.
- IMPROVE INTELLECTUAL DIVERSITY: find people with totally complementary knowledge.
- IDENTIFY COLLABORATORS: find people with important overlaps, and connect with them as study buddies, paper reviewers, cofounders, or in relationships.
- IDENTITY SHARING: I might even generate an old timey map from this, where place names correspond to dominant topic models from that area, and where mountain ranges correspond to the depth of your writing.
This was fun, but I became distracted by AI research and then the Archive and it fell in my priority list. If you'd like to pick up where I left off, DM me! This document is my stream-of-consciousness Captain's Log.
Table of contents
- Abstract
- Table of contents
- RETIRED in favor of a new research log in Roam.
- Break: off to Australia until mid-January
- MILESTONE: possible use cases
- 12/13: better document representations and/or clustering
RETIRED in favor of a new research log in Roam.
Break: off to Australia until mid-January
Amazing Twitter scraper: https://github.com/twintproject/twint
Contact Misha from hier topic modeling opt transport paper
Ways to build document representations:
Complete app ideas:
- Ways to test the social hypotheses: I can scrape the Goodreads accounts of my friends! In fact, I can get my whole friendlist.
- Chrome extension that lets you build up research lists. Then, you can read about whole topics, grouped together.
- Topic pairings. Choose word clouds to combine.
- ...
The goal is to reliably and quickly accumulate wisdom.
What I really want is to think about one topic at a time, from many sources (Twitter is a source too, many great thinkers aren't book writers), collect/recombine/internalize my perspectives, and integrate my conclusions into a worldview.
Observation: forgetting is default, write about what you read, articles are great too not just books, great sources accumulate constantly and perfect curriculum generation ahead of time is difficult especially if you consider articles, context switching happens at many granularities, great essays are composed of many great thought fragments, articulating them allows you to test them, great writing requires days of ruminating unless you already have a deep understanding of the topic.
Query: a few words, then find similar docs and generate reading/remembering/writing inboxes. Let these mix together for a week or two and you'll produce excellent ideas and writing.
The best thinkers on many topics have not written books. They communicate through blogs and Twitter.
Keeping abreast of this content and integrating it into durable wisdom is overwhelming.
Solutions: curriculum (books), manually tag (frictionful)
Project it all into semantic space, and only engage with one selection at a time. As you focus on that, your clippings and favorites will accumulate safely in other parts of your mind graph.
Useful platform. A few ideas:
Each dot will either be a Tweet, a book from my Goodreads, an article I clipped into Evernote, or a document I wrote in Notion. In effect my whole knowledge graph will be in here in one ZUI.
MILESTONE: possible use cases
SYNTOPIC READING: focus on one piece at a time. Read about it from all angles.
WRITING: find that writing is an indispensable part of thinking. I also have to ruminate on a subject for several days to do my best writing, especially for a new topic. I make notes of thought fragments, and they're added into a writing inbox automatically.
REMEMBERING: arbitrary subsets of semantic space can be exported to Anki.
IDEATING: one generator is combining ideas. Great combinations are not too similar and not too distant - there's a Goldilocks zone that's most fertile. This corresponds to a range of distances in semantic space.
Pair two complementary topics, like writing and cognition or love and friendship, can also be generative. DOUBLE SYNTOPIC READ-IATING
IDENTIFY BLIND SPOTS: multiple projections can exist: one optimized for laying out your content, one for you and a group of friends, and another for everyone's content. if you identify a gap in your semantic space, you could find recommendations from friends who have read things in that space.
DEEPEN FRIENDSHIPS: with a more complete understanding of where your mind graphs overlap, you can give and receive better recommendations and start more and deeper conversations.
IMPROVE INTELLECTUAL DIVERSITY: find people with totally complementary knowledge.
IDENTIFY COLLABORATORS: find people with important overlaps, and connect with them as study buddies, paper reviewers, cofounders, or in relationships.
Emergent organization system of your mind. Platform with many interesting applications.
IDENTITY SHARING: I might even generate an old timey map from this, where place names correspond to dominant topic models from that area, and where mountain ranges correspond to the depth of your writing.
Compared to roam: this is a meta layer over all document apps. anything that roam can do export will work. I'm very careful not to build a note app - I've written a suite of data migration tools instead. All links open in their respective apps, and this app updates seamlessly where possible.
PROBLEMS: documents are often related in a graph, not just semantically - how to create embeddings that represent this? Just concatenate a topic model to a graph representation and project it?
Modules approach? Amazing Marvin style? Teach people to use each one?
12/13: better document representations and/or clustering
https://github.com/IBM/HOTT/blob/master/distances.py
https://github.com/IBM/HOTT/blob/master/hott.py
https://github.com/IBM/HOTT/blob/master/knn_classifier.py
https://github.com/IBM/HOTT/blob/master/main.py
Questions:
- what is a topic model, and how do I use it?
- can i project topic models down with UMAP?
HOTT paper description outline:
- Topic model: a topic is a distribution over a vocabulary
- Document reps: distribution over topics.
- WMD, but between topics in a doc instead of words in the doc, makes sense.
ot.emd2(p, q, C)
p, q are 1D histograms (sum to 1 and positive). C is the ground cost matrix.
In POT, most functions that solve OT or regularized OT problems have two versions that return the OT matrix or the value of the optimal solution. For instance ot.emd return the OT matrix and ot.emd2 return the Wassertsein distance. This approach has been implemented in practice for all solvers that return an OT matrix (even Gromov-Wasserstsein)
[method(doc, x, C) for x in X_train.T]
Word compositionality
Gensim. Try to detect bigrams and trigrams.
from gensim.models.phrases import Phrases, Phraser
tokenized_train = [t.split() for t in x_train]
phrases = Phrases(tokenized_train)
bigram = Phraser(phrases)
Oooh, I really ought to use sense2vec to preprocess this stuff, not word2vec.