Abstract
This project was a research project in which I embedded my private writings, questions, writing fragments, notes, and reading wishlist into a semantic space using a language model to create an emergent map of a mind. Such a platform would have many interesting applications, particularly if you shared it with others:
- SYNTOPIC READING: focus on one piece at a time. Read about it from all angles.
- WRITING: find that writing is an indispensable part of thinking. I also have to ruminate on a subject for several days to do my best writing, especially for a new topic. I make notes of thought fragments, and they're added into a writing inbox automatically.
- REMEMBERING: arbitrary subsets of semantic space can be exported to Anki.
- IDEATING: one generator is combining ideas. Great combinations are not too similar and not too distant - there's a Goldilocks zone that's most fertile. This corresponds to a range of distances in semantic space.
- Pair two complementary topics, like writing and cognition or love and friendship, can also be generative. DOUBLE SYNTOPIC READ-IATING
- IDENTIFY BLIND SPOTS: multiple projections can exist: one optimized for laying out your content, one for you and a group of friends, and another for everyone's content. if you identify a gap in your semantic space, you could find recommendations from friends who have read things in that space.
- DEEPEN FRIENDSHIPS: with a more complete understanding of where your mind graphs overlap, you can give and receive better recommendations and start more and deeper conversations.
- IMPROVE INTELLECTUAL DIVERSITY: find people with totally complementary knowledge.
- IDENTIFY COLLABORATORS: find people with important overlaps, and connect with them as study buddies, paper reviewers, cofounders, or in relationships.
- IDENTITY SHARING: I might even generate an old timey map from this, where place names correspond to dominant topic models from that area, and where mountain ranges correspond to the depth of your writing.
This was fun, but I became distracted by AI research and then the Archive and it fell in my priority list. If you'd like to pick up where I left off, DM me! This document is my stream-of-consciousness Captain's Log.
Table of contents
- Abstract
- Table of contents
- RETIRED in favor of a new research log in Roam.
- Break: off to Australia until mid-January
- MILESTONE: possible use cases
- 12/13: better document representations and/or clustering
- 12/9-12-12: NeurIPS 2019 main conference
- MILESTONE: working semantic projections + visual explorer
- 12/6: executable writing system brainstorm
- 12/5: ship templates for Anki
- 12/2/19: Anki templates!
- 11/30/19: Backlinks
- 11/28-11/30/19: Similarity suggestions!
- MILESTONE: BERT-powered semantic similarity
- 11/26/19: Andy's tweet, knowledge categories, Idea Stream, valid goals
- 11/22/19: streams of highlights like Gmail
- 11/20/19: NLP research areas
- 11/17/19: design ideas: notes app + similarity
- 11/15/19: embeddings for everything, interested people
- 11/14/19: exploring Goodreads API
- 11/13/19: Bert → UMAP
- MILESTONE: BERT & UMAP projections
- MILESTONE: dashboard for my writing funnel
- 11/12/19: text extractors/classifiers
- 11/11/19: Notion → Anki
- 11/10/19: Evernote memex, backlinker, Anki
- 11/9/19: Anki MVPs
RETIRED in favor of a new research log in Roam.
Break: off to Australia until mid-January
Amazing Twitter scraper: https://github.com/twintproject/twint
Contact Misha from hier topic modeling opt transport paper
Ways to build document representations:
Complete app ideas:
- Ways to test the social hypotheses: I can scrape the Goodreads accounts of my friends! In fact, I can get my whole friendlist.
- Chrome extension that lets you build up research lists. Then, you can read about whole topics, grouped together.
- Topic pairings. Choose word clouds to combine.
- ...
The goal is to reliably and quickly accumulate wisdom.
What I really want is to think about one topic at a time, from many sources (Twitter is a source too, many great thinkers aren't book writers), collect/recombine/internalize my perspectives, and integrate my conclusions into a worldview.
Observation: forgetting is default, write about what you read, articles are great too not just books, great sources accumulate constantly and perfect curriculum generation ahead of time is difficult especially if you consider articles, context switching happens at many granularities, great essays are composed of many great thought fragments, articulating them allows you to test them, great writing requires days of ruminating unless you already have a deep understanding of the topic.
Query: a few words, then find similar docs and generate reading/remembering/writing inboxes. Let these mix together for a week or two and you'll produce excellent ideas and writing.
The best thinkers on many topics have not written books. They communicate through blogs and Twitter.
Keeping abreast of this content and integrating it into durable wisdom is overwhelming.
Solutions: curriculum (books), manually tag (frictionful)
Project it all into semantic space, and only engage with one selection at a time. As you focus on that, your clippings and favorites will accumulate safely in other parts of your mind graph.
Useful platform. A few ideas:
Each dot will either be a Tweet, a book from my Goodreads, an article I clipped into Evernote, or a document I wrote in Notion. In effect my whole knowledge graph will be in here in one ZUI.
MILESTONE: possible use cases
SYNTOPIC READING: focus on one piece at a time. Read about it from all angles.
WRITING: find that writing is an indispensable part of thinking. I also have to ruminate on a subject for several days to do my best writing, especially for a new topic. I make notes of thought fragments, and they're added into a writing inbox automatically.
REMEMBERING: arbitrary subsets of semantic space can be exported to Anki.
IDEATING: one generator is combining ideas. Great combinations are not too similar and not too distant - there's a Goldilocks zone that's most fertile. This corresponds to a range of distances in semantic space.
Pair two complementary topics, like writing and cognition or love and friendship, can also be generative. DOUBLE SYNTOPIC READ-IATING
IDENTIFY BLIND SPOTS: multiple projections can exist: one optimized for laying out your content, one for you and a group of friends, and another for everyone's content. if you identify a gap in your semantic space, you could find recommendations from friends who have read things in that space.
DEEPEN FRIENDSHIPS: with a more complete understanding of where your mind graphs overlap, you can give and receive better recommendations and start more and deeper conversations.
IMPROVE INTELLECTUAL DIVERSITY: find people with totally complementary knowledge.
IDENTIFY COLLABORATORS: find people with important overlaps, and connect with them as study buddies, paper reviewers, cofounders, or in relationships.
Emergent organization system of your mind. Platform with many interesting applications.
IDENTITY SHARING: I might even generate an old timey map from this, where place names correspond to dominant topic models from that area, and where mountain ranges correspond to the depth of your writing.
Compared to roam: this is a meta layer over all document apps. anything that roam can do export will work. I'm very careful not to build a note app - I've written a suite of data migration tools instead. All links open in their respective apps, and this app updates seamlessly where possible.
PROBLEMS: documents are often related in a graph, not just semantically - how to create embeddings that represent this? Just concatenate a topic model to a graph representation and project it?
Modules approach? Amazing Marvin style? Teach people to use each one?
12/13: better document representations and/or clustering
https://github.com/IBM/HOTT/blob/master/distances.py
https://github.com/IBM/HOTT/blob/master/hott.py
https://github.com/IBM/HOTT/blob/master/knn_classifier.py
https://github.com/IBM/HOTT/blob/master/main.py
Questions:
- what is a topic model, and how do I use it?
- can i project topic models down with UMAP?
HOTT paper description outline:
- Topic model: a topic is a distribution over a vocabulary
- Document reps: distribution over topics.
- WMD, but between topics in a doc instead of words in the doc, makes sense.
ot.emd2(p, q, C)
p, q are 1D histograms (sum to 1 and positive). C is the ground cost matrix.
In POT, most functions that solve OT or regularized OT problems have two versions that return the OT matrix or the value of the optimal solution. For instance ot.emd return the OT matrix and ot.emd2 return the Wassertsein distance. This approach has been implemented in practice for all solvers that return an OT matrix (even Gromov-Wasserstsein)
[method(doc, x, C) for x in X_train.T]
Word compositionality
Gensim. Try to detect bigrams and trigrams.
from gensim.models.phrases import Phrases, Phraser
tokenized_train = [t.split() for t in x_train]
phrases = Phrases(tokenized_train)
bigram = Phraser(phrases)
Oooh, I really ought to use sense2vec to preprocess this stuff, not word2vec.
I really want to do sense2vec → topic modeling → umap.
After that, might try sense2vec → topic modeling → WMD but for topic senses → umap!
For visuals: activation atlas of topic model senses. and make it a ZUI. Whoa, this will be amazing. Then: submit to ACL?
Wow: to generate
spacy pretrain /path/to/data.jsonl en_vectors_web_lg /path/to/output
--batch-size 3000 --max-length 256 --depth 8 --embed-rows 10000
--width 128 --use-vectors -gpu 0
sense2vec: download other parts:
https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.001
https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.002
https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.003
cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz
topics = lda_model.show_topics(formatted=False)
OK. My corpus is far too small to generate realistic topics. I need to build a more comprehensive topic model on a very large dataset. Wonder if I could find a pretrained one.
Wrong - I just had number of topics set to 4!
Seems like it's about 20, but my dataset was real noisy. Get more data, run again later.
[(0,
'0.000*"cansing" + 0.000*"code" + 0.000*"jasonm" + 0.000*"anddrawing" + '
'0.000*"aprivacy" + 0.000*"arcalineahttps" + 0.000*"artificialwomb" + '
'0.000*"atime" + 0.000*"jamesv" + 0.000*"ibuilthappene"'),
(1,
'0.000*"cansing" + 0.000*"code" + 0.000*"jasonm" + 0.000*"anddrawing" + '
'0.000*"aprivacy" + 0.000*"arcalineahttps" + 0.000*"artificialwomb" + '
'0.000*"atime" + 0.000*"jamesv" + 0.000*"ibuilthappene"'),
(2,
'0.000*"cansing" + 0.000*"code" + 0.000*"jasonm" + 0.000*"anddrawing" + '
'0.000*"aprivacy" + 0.000*"arcalineahttps" + 0.000*"artificialwomb" + '
'0.000*"atime" + 0.000*"jamesv" + 0.000*"ibuilthappene"'),
...
Whelp, looks like coherence score isn't everything. There's a bunch of junk in here. Hmm, they're all zero probas, too.
[]To clean these up: check plaintext of https://www.notion.so/jasonbenn/Jay-G-5901c87500d34106894bf15e50f359a1, clean out any HTML/link junk, then rerun topic modeling and see what else looks weird. James V, too.
Hmm. All these topics are pretty bad. I wonder what's happened. Probably just a ton of noise.
Will need to capture higher LDA coherence before my map makes much sense, unfortunately.
But let's proceed to sense2vec anyway.
Another thing to consider trying: TopMine.
Actually, this deserves more research.
Automated Phrase Mining from Massive Text Corpora: https://github.com/shangjingbo1226/AutoPhrase
Research Jiawei Han's publications: he's got an H-Index of 171 and works on graphs and text representations.
To do this, I just need to manually make them myself, then pass them, along with other lemmatized concepts, to corpora.Dictionary
. From https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ part 11.
Goal:
Documents: should be connected semantically and by graphs. Some of them reference each other. 3m to find papers.
Enriching BERT with Knowledge Graph Embeddings for Document Classification: classifying books with cover blurbs, metadata, and knowledge graph embeddings for author information! Holy shit!
Message Passing Attention Networks for Document Understanding: we represent documents as word co-occurrence networks and propose an application of the message passing framework to NLP, the Message Passing Attention network for Document understanding (MPAD). Includes code.
Document representations for docs - could be tweets, could be 100pg articles
Current: sense2vec → topic model (divide long articles/books to chapters? make artificial summaries using quotes?)
Alternative: anything using language models?
Seeking: long document representations?
Graph representation library DGL. Timo
Guided Similarity Separation for Image Retrieval: they used graphs to identify related images, and used those in embeddings, somehow.
Cluster documents.
UMAP straight on topic models
WMD → UMAP
Talk to about document representations:
- Luke de Oliveira: summarization with language models at Twilio.
- Any recommendations for large-scale document summaries?
- doc2vec, topic models,
- Timo Denk: BERTGrid, saves structure. Char-level, embedding for chars.
- Language models are trained on flat text, would be nice to train one on Markdown. There's so much structure thrown away. Fine tuning BERT?
- Where's your code?
How does UMAP/t-SNE straight on topic models look?
5m: close all tabs.
Compress outstanding questions:
- Is T-SNE alone the cause of that beautiful clustering?
- Why are my topic models bad? Where are bad topics coming from?
- What topics have each of my documents learned? Can that help me make better topics?
- How do I learn senses relevant to my dataset?
- Should I make a pitch deck first, or prototype the focusing thing?
Another idea: write the history of what I've wanted to build, and summarize how it's evolved over time, and what happened to each prototype.
Right now, it's helping manage cognitive overload by helping you manage open/closed loops in your brain without having to manage classification.
https://plot.ly/javascript/zoom-events/
https://plot.ly/javascript/lasso-selection/
On right: newest content. For now, just make it last 5 entries.
Nicer hovertemplate in plotly.
Man, huge progress today.
Let's try rescraping a page - why are properties coming through like this? {')qh{': [['Yes']], "rh!'": [['Life']], 'title': [['Wisdom is mostly an aesthetic']]}
OK, they were slugified.
Writing inbox
Reading inbox
Written
Worldview content that's "presentable"
Read
Read vs unread: whether it had any highlights.
12/9-12-12: NeurIPS 2019 main conference
Roadmap:
I want documents from Notion, Evernote, and Goodreads.
Then, I want document embeddings (weighted average BERT reps, tf-idf).
Then, document embeddings to UMAP.
Finally, a web interface to view all this.
Once I have small-scale control over my thoughts and ideas, what high-level structures feel best? Focusing on a single area for an entire month? 1 per week? Or rotating through 4, 2 days each? So many possibilities!
What ideas might be paired? Could intentionally create bridges between important fields, and fill in gaps in my knowledge graph!
Program your attention.
Knowledge is zoomable. XY is projection, Z is density. PhD is achieving some peak. Almost like you're growing a mountain range of knowledge. Could fly through terrain, haha. Old, foundational readings make up base of mountains.
Idea: stick with a content area for as long as you're having good ideas.
Good ideas = ones rated highly in a distributed science sense.
Well, it's interesting. Sometimes the document is made of many components (Goodreads books = author + series + book description + 3 highlights), sometimes it's just an entire document.
I think it'd be interesting to store information as closely to its original form as possible - which is different entitites that are related to each other - and then provide APIs that produce documents from these representations. And there can be many APIs for making documents. The Goodreads API set could include various combinations of sources. Notion might include or exclude the title or the backlinks. Anything is possible! All that matters is that document APIs take no arguments, read from the DB, and produce lists of strings.
So the Notion APIs are backwards: store the whole document, return them whole as documents, and let my BERT transformer transform things (or not) to Text. I might delete the whole Text concept now, actually. Well - things that can be embedded need a mixin
Shoot. Notion is tough because documents are really collections of blocks, and it's not obvious when each are updated. So I think my best bet, for now, is just to make each document a list of.
OK, I figured out how to serialize and deserialize Notion API responses to the DB - they're just JSON.
OK. Now my problem is that I would like to save JSON somewhere into the DB. This JSON can become an embeddable document, but because it could become one in multiple ways, it doesn't really make sense to make Embeddable have a JSONField. Also, other entities clearly only have one way to embed things... or do they? Maybe they don't. I'd honestly also prefer to have many embeddings. Honestly, I don't have that much text, and it would be convenient to store it all in a way that's duplicated... hundreds of Evernote notes, hundreds of Notion documents, hundreds of Goodreads books - it's really not that much. I can handle. Pocket content should go to Goodreads if I really want to savor it.
How do I elegantly access notion-py JSON things? Init into object, add a bunch of accessors? Yeah, probably. Better than passing a bunch of junk around.
Source.objects.all()
Source.objects.instance_of(NotionDocument)
Project.objects.instance_of(ArtProject) | Project.objects.instance_of(ResearchProject)
Project.objects.filter(Q(ArtProject___artist='T. Turner') | Q(ResearchProject___supervisor='T. Turner'))
doc = NotionDocument.objects.get(url=url)
print(doc.to_plaintext())
Embeddable.objects.create(text=text, source=doc)
make_card_html(doc)
I've come real far, but now I need to scrape and store all nested documents in a doc JSON - but not nested pages.
{'id': 'd010fa67-abf5-4cda-9d98-7d1365032145',
'version': 82,
'type': 'text',
'properties': {'title': [['by 2030 most people will be mindless']]},
'content': ['3f86a76a-5c41-4899-875b-462359bbe00f',
'506ca819-ade9-4d6f-86f5-f7f978c7be4f',
'fb7fe0fd-ab2f-49bf-a9ac-456ed1660a10'],
'created_by': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0',
'created_time': 1575909060000,
'last_edited_by': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0',
'last_edited_time': 1575909300000,
'parent_id': 'cef7fe8f-39bd-4d6f-a24d-244cb0252f75',
'parent_table': 'block',
'alive': True,
'created_by_table': 'notion_user',
'created_by_id': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0',
'last_edited_by_table': 'notion_user',
'last_edited_by_id': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0'}
^ Example child with "content". If I fetch all this content, what are the types?
Hmm. If I flatten the content, it'll look less like its source, and eventually I'll have to rewrite all this. I won't be able to handle complicated pages with multiple columns, for example.
If I do flatten it, I simplify logic for myself in the next method, but... I'm not reflecting the original structure very well, which is a principle I'd like to stick to if possible.
What's the best way to unfold a tree structure? Build the tree with DFS? Explore each node? Then topo sort to lay them all out.
I want to go to the bathroom, get a bite, and then retry this.
What I do now: unfold and flatten the tree with BFS.
What I want to do: it's a bunch of list trees. Those will be easy to flatten once I have the whole structure.
How to do it: add nodes to visit in DFS order to a list, if I hit a string node insert its content right afterwards and continue, figure out how to build up tree structure after. Maybe pass 1 builds dictionary of content to replace,
fuck yeahhhh!! got it, and did it the right way!!
https://arxiv.org/abs/1909.01066
http://dev.evernote.com/doc/start/python.php
https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
https://django-polymorphic.readthedocs.io/en/stable/quickstart.html
OK. Free. Now, find a relaxing spot for a couple hours, then catch the last bits of papers, then dinner at 7, then W&B drinks after, then maybe a party or home to program. What a nice night! Caffeine??? Yes.
{'id': '688cf6ee-9753-4555-9cdc-fb8d28a50e08', 'version': 55, 'type': 'text', 'properties': {'title': [['‣', [['p', 'cab1d75a-1e31-4306-8907-f8bd11a9d140']]]]}}
{'id': '30397187-ce48-4cf3-bec6-4ee065fac2e8', 'version': 16, 'type': 'page', 'properties': {'title': [['Copy of '], ['TOP LEVEL PAGE']]}, 'content': ['38b23acb-e304-4252-9915-4613f3defc98']}
Real world problem: there are many types of things. Some don't work.
Goodreads: API for book, author, series, quotes, save to DB.
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
Plotly docs: https://plot.ly/javascript/reference/#scatter-marker-line-colorscale
Hey, it's not looking half bad!
MILESTONE: working semantic projections + visual explorer
Random intermediate notes...
interface:
to initialize the command, you need to specify the DocumentContainers.
scrape_notion (dbs)
scrape_evernote (notebooks)
scrape_goodreads (shelves)
Systematically generate great ideas.
For what? I want to be wise. For what? I want to be sure I'm spending my life well!
Worldview idea: collect facts, anecdotes, make it easy to build opinions that have more or less support from facts/anecdotes
Support building idea on top of a collection of fragments, and building other, even competing ideas on top of similar fragments. Should be able to see underlying fragments too.
Graph-based approximate NN search, implemented in Rust with Python bindings:
12/6: executable writing system brainstorm
My long-term goals: improve wisdom. To that end, I want a reading and writing queue, and a writing system. I would feel good with this morning if I get that done.
Hmm. A few branches:
- Try the system out! Write out some thought fragments from the The Art of Loving.
- Smoother interface for querying related content while writing.
- Goodreads integration.
- UMAP everything.
- Backlinks.
- Export into Anki more smoothly.
- Task processor. (Spark?!) Make everything could happen more quickly and reliably.
Converation on creativity, how great people think:
Austin: dialog between stages. Research, synthesize, repeat.
IDEO "scales creativity' by harshly delimiting things.
Groups: people naturally go through diff phases at diff times. Wastes time to force everyone to be in sync.
Feynman. "When you come across problems, or tools that will solve your problem, keep it in your head. And when you get a new problem, try your whole collection of tools."
Austin: same, but for systems. How do they all fit together. Try to apply those to new things. Some things are obvious only through the lens of a new system.
Paul Graham. Defining things in opposition to each other. Asks the right question, explores. Probably writes a ton, then edits down a ton.
Scott Alexander, on writing. "Try so many things that you eventually do something without willpower or exertion."
Names: overview effect (astronauts, LSD, monks reaching nirvana), wisdom system, worldview
12/5: ship templates for Anki
Features wishlist: scrape Goodreads lists & quotes, the Twitter Idea Stream, the Kindle highlights Idea Stream (extract with this?).
What do I want? In my free time, I want confidence that I am spending my time well, and growing wiser. To that end, I'd like a process that works with me.
Major obstacle now: Anki is no fun. Make those templates work.
Next obstacle: making writing a habit. Where do I write my book notes? In bed? No, it should be more serious than that. Maybe in the morning, before my shower? But then I won't be thinking about work. Perhaps that's OK on days where I'm doing straightforward things.
For now, let's just get that Anki feature working, then test it on myself and with Taylor.
Bonus goal: export into Anki automatically.
Start: 3:48pm.
Interface: export to Anki command. Preloads templates. Queries all pages, populates templates, makes Anki cards, writes to document for now.
Hmm. I currently open a lot of loops and don't really close them.
What should I do with potential learning materials that I don't need right now? Could add them to learning docs.
12/2/19: Anki templates!
Improve the Anki experience a bit.
I wonder - could I extract all of this to its own program?
11/30/19: Backlinks
Goal 1: backlinks between Worldview notes.
Goal 2: backlinks into People notes!!
First of all: can I detect links? If not, fork this program, or edit it directly. Use this URL: https://www.notion.so/jasonbenn/Status-is-zero-sum-but-we-can-invent-many-status-hierarchies-36c730c458dd4355837fe747baa2991f
What are the types of tables in Notion? The tables we currently support are block (via Block class and its subclasses, corresponding to different type of blocks), space (via Space class), collection (via Collection class), collection_view (via CollectionView and subclasses), and notion_user (via User class).
Does my link exist in these classes? Explore the block class.
Print the raw data from the block table for my mystery ID.
What other kinds of tables exist?
Nice! Found the link ID buried in the blocks with the right caret character!
Backlinks should always be text. Can I write these into another note? Text is better because it doesn't create elements in the sidebar dropdowns. It also is easier to incorporate into sentences.
This would be a nontrivial amount of work...
11/28-11/30/19: Similarity suggestions!
19-11-27 W: similarity suggestionsHoly COW I can't believe how much better cosine similarity is than a simple dot product (for NLP - image interpretability research uses dot products). Cosine similarity normalizes their lengths, right?
These suggestions are just so awesome.
Graph embeddings.
How do you surface unique things?
You could probably quantify how unique it is.
Larger context.
What do I think of "see everything all the time"?
I want supporting data, opposing data, extensions of the idea, opposing viewpoint, trends that inform it. Without filtering, all the other information would be overwhelming.
Large shaded area of everything this article touches on.
NLP will lead to totally different UI/UX:
solicit reasonable inputs
"Reasonable interface to fuzziness"
Theories of the amount of notes I want to see:
- See everything all the time: Andy, David. Presentation/design problem.
- See everything, except the obviously irrelevant stuff.
- See only the relevant stuff.
?: Not convinced this is helpful.
?: How would it feel? Overwhelming?
+: Helps you write the note best. Especially if forced to reconcile opposing connections. You might gloss them over if you were focusing on superfluous, weak connections.
+: Similar to Cal Newport's/Ryan Holiday's blitz-blogging strategies. They bring the notes.
-: miss some especially creative connections.
+: that's what Anki is for.
Tweet: I'm glad you guys are digging into this, because it's the path I feel most uncertain about.
What is thought? Read Wikipedia summaries of great books on thought. Cicero, etc. Philosophy of thought?
Build argument mapping dataset! Classify as supporting/opposing.
https://paperswithcode.com/task/link-prediction
https://paperswithcode.com/area/graphs/representation-learning
https://paperswithcode.com/paper/interacte-improving-convolution-based
https://arxiv.org/pdf/1911.00219.pdf
https://arxiv.org/pdf/1903.12287v3.pdf
https://arxiv.org/pdf/1906.04239v1.pdf
https://github.com/Sujit-O/pykg2vec
Potential datasets for finding opposing viewpoint directions: argument mapping, rumor classification research, RumourEval, rumour stance prediction and rumour verification,
https://www.psychologytoday.com/us/blog/thoughts-thinking
https://www.psychologytoday.com/us/blog/thoughts-thinking/201811/improving-critical-thinking-through-argument-mapping
https://paperswithcode.com/task/relation-extraction
https://paperswithcode.com/task/citation-intent-classification
https://paperswithcode.com/task/text-categorization
https://paperswithcode.com/task/document-classification
https://paperswithcode.com/task/topic-models
https://paperswithcode.com/task/textual-analogy-parsing
https://paperswithcode.com/task/entity-linking
https://paperswithcode.com/task/relation-classification
https://paperswithcode.com/task/opinion-mining
https://paperswithcode.com/task/argument-mining
https://paperswithcode.com/task/relational-reasoning
https://paperswithcode.com/task/joint-entity-and-relation-extraction
https://arxiv.org/pdf/1704.07221v1.pdf
https://www.aclweb.org/anthology/S19-2147.pdf
/Users/jasonbenn/.pyenv/versions/worldview/lib/python3.7/site-packages/bert_serving/client/__init__.py:290:
UserWarning: server does not put a restriction on "max_seq_len", it will determine
"max_seq_len" dynamically according to the sequences in the batch. you can restrict
the sequence length on the client side for better efficiency
warnings.warn('server does not put a restriction on "max_seq_len", '
/Users/jasonbenn/.pyenv/versions/worldview/lib/python3.7/site-packages/bert_serving/client/init.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency warnings.warn('server does not put a restriction on "max_seq_len", '
Wow, these similarities are great. Really, really strong.
Getting doc
SIMILAR TO: If I am the boldest version of myself, what would I be doing with my life?
similarity range: 0.975 - 0.850
-------------------
similarity: 0.975
# Worldview > An autonomous & highly satisfying career archetype: independent research aimed at influencing big companies
-------------------
similarity: 0.975
# Questions > What is my best work self?
-------------------
similarity: 0.975
# Processes > Ideas: recording & sharing
-------------------
similarity: 0.975
# Processes > Prioritizing
-------------------
similarity: 0.973
# Processes > Staying focused
MILESTONE: BERT-powered semantic similarity
11/26/19: Andy's tweet, knowledge categories, Idea Stream, valid goals
Existing landscape of tools for thoughtMost relevant tweet of all time, responded:
Question: when and how do I translate a thought into Twitter to get feedback from the hivemind? Do I create a boolean field? Do I write a program to automatically post them to Twitter sometimes?
Buster Benson has been building and revising his worldview incrementally since 2012! I love how he classifies them by strength and how they might be falsified.
All knowledge you wish to incorporate into your Worldview can be categorized as a Fact, a Belief, a Process, or a Question. Questions grow up into Facts and Beliefs. All can reference each other.
Facts must be Strongly Connected to land in your Anki (2+ backlinks). Otherwise, they're not notable enough to be worth memorizing? Not sure yet about this one - perhaps wait until I encounter lots of noise I don't care about to implement this one.
Beliefs, processes, and questions always export to your Anki.
People must also be Strongly Connected (2+ references in my journal) to land in Anki.
What about Ideas? Is that worth elevating to a concept I think about repeatedly? I imagine flipping through my ideas deck would be really fun and inspiring. Putting them in Anki gives me more confidence that I won't forget them... otherwise building all these different processes to check various parts of my Notion would be too hard/I'd get lost.
The Idea Stream:
- classified automatically into fact, opinion, or belief
- connections (of any of the above types) are suggested to the right, with a typeahead at the top to filter. Because there will be at most tens of thousands, just display them all, and sort by relevance. You should also be able to scan through every note yourself if you desire, and add connections as they come up without losing your place.
- clicking on a connection adds you to that note, where either it will be added as a "Source" or it'll let you smoothly start describing the connection.
- optional "retweet" text field on each piece of content lets you describe why you think it's interesting
- snooze button lets you revisit that content at a random future date
- archive button lets you decline to incorporate that content into your worldview
Make connection similarity a tuneable parameter: how similar does content have to be for it to be suggested. Could stimulate creativity.
Valid possible goals:
- Build a worldview
- Connect with the people in your life
- Manage information overload
- Improve conversations with people ("Most people have just 5-10 ideas, only 1 are new or interesting, and it's the main things they talk about")
Possible pain points:
- I don't remember what I read
- I come across good ideas and don't know where to put them, feels like losing them
- My opinions are poorly thought through compared to others
- I'm not interesting enough
- I don't think about other people
Beautifully designed landing page, I know exactly what it does and why I'd want it:
11/22/19: streams of highlights like Gmail
Core parts of my system:
- mapping from content to representation in my DB.
- exporting Notion to Anki. Make this more user-friendly? I would like this to happen more often. Ideally from a web interface.
Let's get the ml-box online more reliably, so that we can do 1. Move keyboard, find a mouse, get the projector hooked up. Gah. That sounds annoying. How bout: make random vectors, save those instead lol.
Work on 1.
Done at 9:47pm.
Definitely should not feel like chores, like email. Or even FB notifs.
Should feel more like Instagram. They're rewards! And maybe they fade after a week or so. Or they come up again, later, randomly, so you don't have to worry about connecting them when they come up.
An Edges app.
Solves the problem of worrying about where to put stuff as you find it.
Roam also solves this problem: you just drop a note.
But: "the more I use Roam, the less organized I feel." (Taylor).
You promise yourself you'll organize everything, but you never do.
I wonder if revisiting them periodically with Anki would change this? Seems unlikely.
Roam's problem: enables too many connections, doesn't force you to prioritize or actually think about them. Also doesn't make you evaluate quality of source or quality of connection.
Oh snap - my Evernote Tweet inbox is this! Similarity on the side:
It's pretty interesting. Definitely feels good to review the thoughts I liked earlier. More and better context would be good, though. Iframes? Kindle context? That would be insane, though - it'd be like implementing a whole Kindle reader, a whole Twiter reader, all in the same app.
I have no desire to connect things, though, I'm too passive atm. Maybe if it were easier...? Eh, idk. Might just be about having to move my hand to click the mouse.
Also, doesn't really feel like I'm building a web of knowledge. When would I review this?
Maybe I would review connections in addition to the whole ideas? Might be interesting.
Stream of highlights since you last checked.
For each, suggest similar opinions. Hotkeys can either establish links or you can write something about that link. Can send to either Thought doc or Anki flashcard (suggest the deck!).
Great inbox processing software: gmail. Superhuman.
Great software for forwarding/making connections: Notion, Bear, etc.
Mindmaps? Using arrow keys to select/highlight possible connections?
Force diagrams? Each highlight has a constellation of possible connections?
You've got a snippet, you use arrows to page up and down and quickly glance at the contents, and you just start typing, and you're already in the document, and the new snippet is already copied in and linked. And you can keep going and link it to multiple documents, too. Wait - do I actually want to type about some of these things, or just make a connection? Could batch process connections later.
Zoomable UI?
Oh snap, Zapier supports webhooks! That covers: Twitter, Evernote, Pocket.
I can export manually from: MarginNote, Otter, Kindle.
requests.get("http://157.245.238.241:8909", input_data)
print(input_data)
return 0
If a piece of information isn't incorporated into an opinion, a process, or my facts base, then what's the point?
Might want to share it.
Might want to connect it in the future to other things.
Might be gathering it for some potential future opinion.
Might be collecting things like that for some purpose. Cool jobs, for example.
Perhaps the ideal process is to just immediately suggest the connection as you're reading it. After all, that is when you have the context about that thing.
Examples of great explanations of tools for thought:
Knowledge platforms:
Trustory: so, so bad. So many uninformed opinions. No credibility whatsoever. I disagree with literally every opinion in their programming community. This is where Quora shines: subject matter experts respond to questions, and questions and answers are heavily moderated and cleaned up.
Allsides: presenting opposing viewpoints. How does it feel to read both at the same time? Confusing - don't know where to click. Nudges me towards center when sometimes facts really do support one side or another. It's pure opinion, but it should be more fact.
Debate is often framed by the phrasing of the question. Case in point: RationalWiki, which purports to offer objective, rational analysis of issues and individuals in the world of politics, business, and media, but which in reality has turned into a Marxist/leftist extremist dunkfest where they shit on anyone to the right of Bernie.
These are all trying to mix the opinions of others. I just want to collect my opinions and make sense of them all, and support/refute them as appropriate.
What we need is a boxscore for each piece of evidence. Maybe that's the layer that could go over everything. Reputation as a service. Reviews, trustworthiness. Different from reach. Relies on inventors, not influencers.
Tweet from BalajiS
As I think about it, this is a new kind of boxscore for a news story.
- First party confirmation or denial
- Primary sources
- Independent confirmations
- Anonymous sources
And if you login, it shows whether these sources are economically or otherwise aligned with you.
Algorithmic tools to evaluate truth. PageRank was an incredible good first attempt, because it works like humans do (credibility of the source = how many links)
"Modeled after academic citations" which, as a system, has plenty of problems.
https://twitter.com/c4mer0n/status/1197683865369755648
https://changeaview.com/
https://felonvoting.procon.org/
http://www.parli.co/
https://www.qutee.com/
https://perspectroscope.com/
https://twitter.com/LogicallyHQ
https://changeaview.com/
https://twitter.com/balajis/status/1197696369210904576
https://en.wikipedia.org/wiki/Knowledge_Graph
11/20/19: NLP research areas
Flashcards should have multiple priority (abstraction?) levels, and when you're not prioritizing a deck, it should only show you the highest-level cards. You should be able to shift your focus.
Research interests
- How do I train on structured (Markdown, graphs) data? Melody Dye suggestion: learning from graphs (TransE). Retrain
- I don't just want to identify similar documents, but also those with opposing views. How do I identify (or create) contrary directions in representation space? Do these datasets exist? Scrape AllSides.com? Scrape kialo.com, DAG arguments.
- How do I produce representation on long sequences (books)? Transformers XL?
Scrape allsides?
Finetune model.
Two classifiers: multiclass vs positive/negative
Collaborative filtering
You could prime it with hints
Goal features
When I stake out an opinion on ML, relevant papers I've saved surface. I can toss papers into my system and know they'll come later when I'm most curious about them. (Scrape abstracts?)
Understands the relationship between papers and people (entity resolution).
Goal tonight: get Goodreads into my DB.
11/17/19: design ideas: notes app + similarity
Design!
11/15/19: embeddings for everything, interested people
Observation: it's really exciting to hear a great idea and put it into my system. I have confidence that I'll think about it more, even if I don't have time right now. I don't have to be preoccupied with forgetting it. I think this system enables me to search more widely for ideas, in the same way that Anki enabled wider learning.
Things that can be embedded: text, embedding
Things that can have embeddings:
Goodreads book: title, description, FKs to authors, FK to series
Goodreads author: name, URL, about
Goodreads series: title, description
Highlight? Source document, highlighted text...?
Some types are composite embeddings:
A book is a weighted average of the book description, author description, and highlights
An article is all highlights
These are themselves records, and could be exploded into their parts
What about PDFs/other docs that could live on S3? Should I host those myself, or just reference links that could go down?
Notion. Scrape via API.
Goodreads. Scrape via API.
Kindle Notes & Highlights. Can export through web interface.
Otter transcripts. Can export through web interface!
Pictures of text with my finger pointing at a passage? Eh, just
Interface:
Admin page for each type of parent model. NotionAccount, GoodreadsAccount, KindleAccount, OtterAccount, S3Account. All belong to User.
NotionDatabase belongs to NotionAccount. You can add a DB by URL, then a bunch of NotionDocuments are scraped from that.
GoodreadsAccount: you can add GoodreadsShelf, then click scrape, and it'll find a bunch of books and their highlights.
KindleAccount: just configure, then click scrape. Ruby gem.
ScrapeActions table keeps track of when things were scraped - won't do it again if too recent.
Everything gets
Quick experiment in polymorphism: can I quickly grab all highlights? Can I quickly GROUP_BY/average on source?
What is next? What is the next goal that excites me?
Having a DB full of Goodreads highlights. For now, just have highlights have nullable FKs to Book or NotionDoc.
Next
- Get Goodread shelves read/to-read into the same space as writing
- Basic queries: read/unread books like this opinion, opinions like this opinion
- Write up my vision, early learnings and circulate to get feedback.
Tommy feels like he has "no thoughts", wants to have more.
Danny works best by talking before writing, buys books even if he's not sure if he'll read them, and reads lots of snippets.
Patrick C buys thousands of books, reads 5 pages at a time at random.
Taylor decides what to read next by being inspired by random quotes from his reading list.
Armand has 3 progressively more specific to-read lists on Goodreads.
Kanjun listens to audiobooks while running.
Allie listens to audiobooks during 5 minute commutes.
Taylor thinks little snippets lower friction of running.
Tiger, James, Tina will think it's cool, maybe have feedback on the approach.
Theme: nobody wants to read whole books.
What if you could zoom to a popular quote, then progressively highlight?
What if you summarized books at varying levels of granularity until you get to your quote? Chapter/section/page/paragraph summaries, up to your chosen quote, then progressively zooming out again afterwards. Zoom in and out on the content.
Danny wants to have a conversation with e.g. Patrick about the quote. That makes me think of this - oh cool, that's related to this that you haven't heard...
Eventually: Andy Matuschak and Michael Nielsen may have feedback.
Melody Dye, NLP scientist at Netflix. Also working on NLP for sequence similarity, also doesn't know how to handle long sequences or structure. She's ingesting scripts, tagging them into genres.
11/14/19: exploring Goodreads API
Pipeline: will need to train, then save UMAP.
Most important next thing
I want to see things I've already read - Goodreads read and unread - show up in the map.
popular_shelves
, which includes genre tagssimilar_books
, could use as labels:Probably concat/average with about author and about series?
I really like the concept of exploding an embedding into highlights, author, series, book description...! That'd be so fucking cool! On hover, show/highlight an area covering the exploded parts of the book?!
https://www.goodreads.com/review/list/6481511.xml?key=YaQV1WrXtgkB9KYHvrDQ&v=2&shelf=read&per_page=200&page=1
works, but from Python doesn't always work - seems to return one result?11/13/19: Bert → UMAP
I've got a solution - what's the problem?
I forget everything I read, and they mostly don't impact what I believe.
I want a tool to help me understand how all my knowledge fits together.
Main app: put everything into massive DB, documents in S3, compress them with BERT, map them with UMAP.
Uses: reading recommendations, explore the fog of war, find writing you should link, etc. Queryable. Cosine similarity happens via procedure.
MVP:
BERT-as-a-service running on ml-box!
MILESTONE: BERT & UMAP projections
All reading is broken. I read things and barely remember a single fact a week later. The only fix is to explain or talk about the things you read, but most content isn't like that.
Solution is have all read and written content mapped, such that similar content is next to each other.
Next project: connection suggestions!
Notion-to-text program: for any Notion page, convert all contents to one string.
MVP vectorizer: just use sklearn's TF-IDF, see how it goes.
Later: run DistilBERT on my machine? Its job is to ingest documents of any length and produce an activations vector.
BERT-as-a-service on ml-box.
DB: sqlite. Load everything, do similarity search in memory. Will be a while before I have many thousands of documents and their vectors won't fit in memory.
MVP: copy-paste text into inputs.
Later: feed it just a URL. It'll GET and parse with MercuryParser.
Later: Chrome extension. One-click to compare. Backend responds with Notion links.
Should also work for Notion ←→ Notion connections. Can either pass raw text or a Notion link.
Goal: to build an insight machine: that means integrating content that I've saved and written.
collection.get_rows()
. Results in empty rows. Resolves itself when rerunning repeatedly.Ben recommends GPU ML algorithms, and newspaper3k for parsing.
Worldview metrics (add to a dashboard!)
- The number of ideas made presentable
- The number of new ideas added
- Target: 3 per day
Maybe these should go to separate decks?
MILESTONE: dashboard for my writing funnel
Improving my UMAP projection
11/12/19: text extractors/classifiers
idea: the distant future?
Inputs:
Record more conversations with Otter.
Evernote collects my readings and highlights.
Classifiers:
A program identifies the information worth saving, and/or its type - is it about a person, an idea or opinion, a great question, or a joke for my journal? Fact v opinion v experience classifier.
Another program suggests which Notion content it should be merged into. Language model activations of each new piece of content · language model activations (in DB) of each worldview and/or great question. No similarities? Great - prompt to create a new note.
Attention:
Things to think about more are included into Anki: open questions, my friends and acquaintances.
Content to read is suggested based on opinions that need fleshing out.
Outcomes:
High trajectory: the next time we have a conversation, there's a good chance I've internalized your insights and further developed them
Sample efficiency: nothing worthwhile is forgotten
11/11/19: Notion → Anki
https://www.facebook.com/photo.php?fbid=10101121905470403&set=a.616032894143&type=3
https://graph.facebook.com/10101121905470403/picture?type=normal
https://graph.facebook.com/10101121905470403/picture?type=large
Maybe upload images to my own S3 bucket?
11/10/19: Evernote memex, backlinker, Anki
v1?
- Notion is where I craft my worldview. Evernote can be linked into Notion as sources. These things will spill over into public writing frequently. I've started here: .
- Backlinker is a program that links Notion documents to each other. Just like Andy.
Done
Is this a thing that I could imagine committing to and deriving a lot of value from? I think so.
Would it push me to develop a fuller worldview? Yes, absolutely. I have so much previous writing and content that's actually quite good!
How much time would I need to spend in Anki on the Worldview deck per day for this to work? Maybe I could just sit down and flesh out one at a time every now and then? Having a hard time imagining doing this constantly. But over a long period of time, perhaps this could be really useful.
When would I invest that time? Definitely nights after work. It'd be very satisfying to absorb knowledge from e.g. conversations into my worldview. Perhaps my weekly review could include a brief review of new Evernote content?? That'd be pretty neat!
Would this make me a more interesting person? Oh hell yes, it already has!
I definitely like rereading the note to remind myself of the thought before asking any questions.
Linking Notion documents to each other is actually critical.
MILESTONE: wished for backlinks in Notion, Notion → Anki
Really awesome. It exports the whole thing, and it's all searchable in EN. Better than I could have hoped.
What kind of content should I clip? Full articles? Simplified or complete?
Do highlights help at all?
What about little summaries?
Would mind maps from margin note actually be helpful?
How long would it take to do this, and when would I? Evening commute?
When would I sit down to write and first read related content?? Could I work these in to Anki?
Are there other great streams/sources of information that aren't Anki? Suggestions based on your writing?
11/9/19: Anki MVPs
Main goals:
Hmm. Not compelling. I don't want to think about great research organizations. The times I want to think about this is actually when I'm reading related content, or when I'm wondering what to read next.
Opinions as note titles is a breakthrough. Forces me to defend it, articulate it, explain it.
Opinions as Anki cards is slightly disappointing.
Lots of things I've recently thought about and weakly believe I breeze through.
Some things I actually realize I don't care about. Updated my belief that everyone learns in the same way after a conversation last night with Austin.
Anki is out of date quickly.
What I want is to combine this with my reading list. Building a worldview is to read widely and critically.
RSS feed could be scraped and made sensible with Mercury Parser.
Related notes from Evernote could be inserted at the bottom of notes. Test the MVP. Implementation would copy content from Notion into Evernote, then use the Evernote suggestion API, which can take any number of notes.
Pretty great.
Implications: to get content to appear in my reading list, it has to first be motivated by an opinion.
Related notes from Notion as well.
Long term schedule, where I look at each one seriously: get to review in 1 day, 5 days, 10 days. Max reviews: every 30 days. Ehhh, but could just put them at 1 day.
Short term schedule, where I feel OK about skipping through em: 1 day. Not as good.
Pretty cool. Definitely makes things more interesting to read. Great for sharing.
Not great until I clean up Evernote.
Yep, now these are excellent suggestions... from mostly 2015. I want my suggestions to be high quality, and news is too topical to be high quality.
Perhaps my Evernote should be conversations, books, and timeless articles.
Going out to do research was fun and effective, but I'm that kind of mood this weekend.
Outstanding questions:
- Where do I source further reading from? Book list? Could my "reading list" be me Evernote clipping book blurbs/Goodreads summaries? One problem is that Evernote tends to suggest longer content over shorter content.
- When? This is a night activity, perhaps my default one, and it can motivate my reading and conversations.
- What's the endgame?
This could also be the way I collect and update thoughts after conversations, which is very satisfying. Eventually, I wish to record my whole day, use a NN to label the facts/interesting thoughts from the transcript, and simply review a queue of those at the end of a day and integrate them into my worldview.
Permanent high status, which helps me surround myself with the world's most brilliant and interesting people which will help me achieve any goal.
It will also help me raise the aspirations of others and teach them how to do so.
I also just want to be right about the world - to understand it and make genuine insights. It's probably never been possible to have so complex and interconnected a worldview before - I could be at the front lines of thought technology development, and demonstrate what it's capable of.
Compare representations to find the direction.