Abstract

This project was a research project in which I embedded my private writings, questions, writing fragments, notes, and reading wishlist into a semantic space using a language model to create an emergent map of a mind. Such a platform would have many interesting applications, particularly if you shared it with others:

SYNTOPIC READING: focus on one piece at a time. Read about it from all angles.
WRITING: find that writing is an indispensable part of thinking. I also have to ruminate on a subject for several days to do my best writing, especially for a new topic. I make notes of thought fragments, and they're added into a writing inbox automatically.
REMEMBERING: arbitrary subsets of semantic space can be exported to Anki.
IDEATING: one generator is combining ideas. Great combinations are not too similar and not too distant - there's a Goldilocks zone that's most fertile. This corresponds to a range of distances in semantic space.
Pair two complementary topics, like writing and cognition or love and friendship, can also be generative. DOUBLE SYNTOPIC READ-IATING
IDENTIFY BLIND SPOTS: multiple projections can exist: one optimized for laying out your content, one for you and a group of friends, and another for everyone's content. if you identify a gap in your semantic space, you could find recommendations from friends who have read things in that space.
DEEPEN FRIENDSHIPS: with a more complete understanding of where your mind graphs overlap, you can give and receive better recommendations and start more and deeper conversations.
IMPROVE INTELLECTUAL DIVERSITY: find people with totally complementary knowledge.
IDENTIFY COLLABORATORS: find people with important overlaps, and connect with them as study buddies, paper reviewers, cofounders, or in relationships.
IDENTITY SHARING: I might even generate an old timey map from this, where place names correspond to dominant topic models from that area, and where mountain ranges correspond to the depth of your writing.

This was fun, but I became distracted by AI research and then the Archive and it fell in my priority list. If you'd like to pick up where I left off, DM me! This document is my stream-of-consciousness Captain's Log.

Abstract
Table of contents
RETIRED in favor of a new research log in Roam.
Break: off to Australia until mid-January
MILESTONE: possible use cases
12/13: better document representations and/or clustering
12/9-12-12: NeurIPS 2019 main conference
MILESTONE: working semantic projections + visual explorer
12/6: executable writing system brainstorm
12/5: ship templates for Anki
12/2/19: Anki templates!
11/30/19: Backlinks
11/28-11/30/19: Similarity suggestions!
MILESTONE: BERT-powered semantic similarity
11/26/19: Andy's tweet, knowledge categories, Idea Stream, valid goals
11/22/19: streams of highlights like Gmail
11/20/19: NLP research areas
11/17/19: design ideas: notes app + similarity
11/15/19: embeddings for everything, interested people
11/14/19: exploring Goodreads API
11/13/19: Bert → UMAP
MILESTONE: BERT & UMAP projections
MILESTONE: dashboard for my writing funnel
11/12/19: text extractors/classifiers
11/11/19: Notion → Anki
11/10/19: Evernote memex, backlinker, Anki
11/9/19: Anki MVPs

RETIRED in favor of a new research log in Roam.

Break: off to Australia until mid-January

Amazing Twitter scraper: https://github.com/twintproject/twint

Contact Misha from hier topic modeling opt transport paper

Ways to build document representations:

Complete app ideas:

Ways to test the social hypotheses: I can scrape the Goodreads accounts of my friends! In fact, I can get my whole friendlist.
Chrome extension that lets you build up research lists. Then, you can read about whole topics, grouped together.
Topic pairings. Choose word clouds to combine.
...

The goal is to reliably and quickly accumulate wisdom.

What I really want is to think about one topic at a time, from many sources (Twitter is a source too, many great thinkers aren't book writers), collect/recombine/internalize my perspectives, and integrate my conclusions into a worldview.

Observation: forgetting is default, write about what you read, articles are great too not just books, great sources accumulate constantly and perfect curriculum generation ahead of time is difficult especially if you consider articles, context switching happens at many granularities, great essays are composed of many great thought fragments, articulating them allows you to test them, great writing requires days of ruminating unless you already have a deep understanding of the topic.

Query: a few words, then find similar docs and generate reading/remembering/writing inboxes. Let these mix together for a week or two and you'll produce excellent ideas and writing.

The best thinkers on many topics have not written books. They communicate through blogs and Twitter.

Keeping abreast of this content and integrating it into durable wisdom is overwhelming.

Solutions: curriculum (books), manually tag (frictionful)

Project it all into semantic space, and only engage with one selection at a time. As you focus on that, your clippings and favorites will accumulate safely in other parts of your mind graph.

Useful platform. A few ideas:

Each dot will either be a Tweet, a book from my Goodreads, an article I clipped into Evernote, or a document I wrote in Notion. In effect my whole knowledge graph will be in here in one ZUI.

MILESTONE: possible use cases

SYNTOPIC READING: focus on one piece at a time. Read about it from all angles.

WRITING: find that writing is an indispensable part of thinking. I also have to ruminate on a subject for several days to do my best writing, especially for a new topic. I make notes of thought fragments, and they're added into a writing inbox automatically.

REMEMBERING: arbitrary subsets of semantic space can be exported to Anki.

IDEATING: one generator is combining ideas. Great combinations are not too similar and not too distant - there's a Goldilocks zone that's most fertile. This corresponds to a range of distances in semantic space.

Pair two complementary topics, like writing and cognition or love and friendship, can also be generative. DOUBLE SYNTOPIC READ-IATING

IDENTIFY BLIND SPOTS: multiple projections can exist: one optimized for laying out your content, one for you and a group of friends, and another for everyone's content. if you identify a gap in your semantic space, you could find recommendations from friends who have read things in that space.

DEEPEN FRIENDSHIPS: with a more complete understanding of where your mind graphs overlap, you can give and receive better recommendations and start more and deeper conversations.

IMPROVE INTELLECTUAL DIVERSITY: find people with totally complementary knowledge.

IDENTIFY COLLABORATORS: find people with important overlaps, and connect with them as study buddies, paper reviewers, cofounders, or in relationships.

Emergent organization system of your mind. Platform with many interesting applications.

IDENTITY SHARING: I might even generate an old timey map from this, where place names correspond to dominant topic models from that area, and where mountain ranges correspond to the depth of your writing.

Compared to roam: this is a meta layer over all document apps. anything that roam can do export will work. I'm very careful not to build a note app - I've written a suite of data migration tools instead. All links open in their respective apps, and this app updates seamlessly where possible.

PROBLEMS: documents are often related in a graph, not just semantically - how to create embeddings that represent this? Just concatenate a topic model to a graph representation and project it?

Modules approach? Amazing Marvin style? Teach people to use each one?

12/13: better document representations and/or clustering

topic-modeling + word movers distance as separate projection

Python modules - POT Python Optimal Transport 0.6.0 documentation

The function solves the Wasserstein barycenter problem when the barycenter measure is constrained to be supported on k atoms. This problem is considered in [1] (Algorithm 2). There are two differences with the following codes: - we do not optimize over the weights - we do not do line search for the locations updates, we use i.e.

pot.readthedocs.io

https://github.com/IBM/HOTT/blob/master/distances.py

https://github.com/IBM/HOTT/blob/master/hott.py

https://github.com/IBM/HOTT/blob/master/knn_classifier.py

https://github.com/IBM/HOTT/blob/master/main.py

Questions:

what is a topic model, and how do I use it?
can i project topic models down with UMAP?

HOTT paper description outline:

Topic model: a topic is a distribution over a vocabulary
Document reps: distribution over topics.
WMD, but between topics in a doc instead of words in the doc, makes sense.

ot.emd2(p, q, C)

p, q are 1D histograms (sum to 1 and positive). C is the ground cost matrix.

In POT, most functions that solve OT or regularized OT problems have two versions that return the OT matrix or the value of the optimal solution. For instance ot.emd return the OT matrix and ot.emd2 return the Wassertsein distance. This approach has been implemented in practice for all solvers that return an OT matrix (even Gromov-Wasserstsein)

[method(doc, x, C) for x in X_train.T]

Word compositionality

Sense2vec with spaCy and Gensim · Blog · Explosion

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et. al, 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we've found surprisingly addictive.

explosion.ai

Sense2vec with spaCy and Gensim · Blog · Explosion

gensim: topic modelling for humans

We welcome contributions to our documentation via GitHub pull requests, whether it's fixing a typo or authoring an entirely new tutorial or guide. If you're thinking about contributing documentation, please see How to Author Gensim Documentation . If you're new to gensim, we recommend going through all core tutorials in order.

radimrehurek.com

Gensim. Try to detect bigrams and trigrams.

from gensim.models.phrases import Phrases, Phraser

tokenized_train = [t.split() for t in x_train]
phrases = Phrases(tokenized_train)
bigram = Phraser(phrases)

Oooh, I really ought to use sense2vec to preprocess this stuff, not word2vec.

sense2vec reloaded: contextually-keyed word vectors · Blog · Explosion

In 2016 we trained a sense2vec model on the 2015 portion of the Reddit comments corpus, leading to a useful library and one of our most popular demos. That work is now due for an update. In this post, we present a new version and a demo NER project that we trained to usable accuracy in just a few hours.

explosion.ai

sense2vec reloaded: contextually-keyed word vectors · Blog · Explosion

I really want to do sense2vec → topic modeling → umap.

After that, might try sense2vec → topic modeling → WMD but for topic senses → umap!

For visuals: activation atlas of topic model senses. and make it a ZUI. Whoa, this will be amazing. Then: submit to ACL?

Wow: to generate

spacy pretrain /path/to/data.jsonl en_vectors_web_lg /path/to/output
--batch-size 3000 --max-length 256 --depth 8 --embed-rows 10000
--width 128 --use-vectors -gpu 0

sense2vec: download other parts:

https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.001
https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.002
https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.003

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

~~First step: topic modeling~~

~~show topics with: topics = lda_model.show_topics(formatted=False)~~

OK. My corpus is far too small to generate realistic topics. I need to build a more comprehensive topic model on a very large dataset. Wonder if I could find a pretrained one.

Wrong - I just had number of topics set to 4!

~~Choose optimal number of topics.~~

Seems like it's about 20, but my dataset was real noisy. Get more data, run again later.

Visualize topic models by document, using 6. What is the Dominant topic and its percentage contribution in each document

[(0,
  '0.000*"cansing" + 0.000*"code" + 0.000*"jasonm" + 0.000*"anddrawing" + '
  '0.000*"aprivacy" + 0.000*"arcalineahttps" + 0.000*"artificialwomb" + '
  '0.000*"atime" + 0.000*"jamesv" + 0.000*"ibuilthappene"'),
 (1,
  '0.000*"cansing" + 0.000*"code" + 0.000*"jasonm" + 0.000*"anddrawing" + '
  '0.000*"aprivacy" + 0.000*"arcalineahttps" + 0.000*"artificialwomb" + '
  '0.000*"atime" + 0.000*"jamesv" + 0.000*"ibuilthappene"'),
 (2,
  '0.000*"cansing" + 0.000*"code" + 0.000*"jasonm" + 0.000*"anddrawing" + '
  '0.000*"aprivacy" + 0.000*"arcalineahttps" + 0.000*"artificialwomb" + '
  '0.000*"atime" + 0.000*"jamesv" + 0.000*"ibuilthappene"'),
...

Whelp, looks like coherence score isn't everything. There's a bunch of junk in here. Hmm, they're all zero probas, too.

[]To clean these up: check plaintext of https://www.notion.so/jasonbenn/Jay-G-5901c87500d34106894bf15e50f359a1, clean out any HTML/link junk, then rerun topic modeling and see what else looks weird. James V, too.

~~OK! fixed URLs being weird without breaking other stuff! Nice!~~

~~Rerun topic modeling: OK! No arcalinea, or other URLs. Now, why do person titles show up? Fixed!~~

~~Now rerun optimal num topics, because I'm seeing a lot of repetition...~~

Hmm. All these topics are pretty bad. I wonder what's happened. Probably just a ton of noise.

Will need to capture higher LDA coherence before my map makes much sense, unfortunately.

But let's proceed to sense2vec anyway.

Another thing to consider trying: TopMine.

anirudyd/topmine

This is an implementation of the algorithm detailed in: El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of the VLDB Endowment 8.3 (2014): 305-316.APA In order to run the code, simply follow these steps: Put the file on which you want to run topmine in the folder named "input" Open topmine.py and change the variable named "file_name" to point at the correct file in the "input" folder.

github.com

Scalable Topical Phrase Mining from Text Corpora

While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the inference results of unigram-based topic models, or utilizes complex n-gram-discovery topic models.

arxiv.org

Actually, this deserves more research.

Automated Phrase Mining from Massive Text Corpora: https://github.com/shangjingbo1226/AutoPhrase

Research Jiawei Han's publications: he's got an H-Index of 171 and works on graphs and text representations.

~~sense2vec~~

"James V" should be a person sense! Underscore, add concept to sense2vec, query for similar concepts in streamlot.

To do this, I just need to manually make them myself, then pass them, along with other lemmatized concepts, to corpora.Dictionary. From https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ part 11.

Goal:

Documents: should be connected semantically and by graphs. Some of them reference each other. 3m to find papers.

Enriching BERT with Knowledge Graph Embeddings for Document Classification: classifying books with cover blurbs, metadata, and knowledge graph embeddings for author information! Holy shit!

Message Passing Attention Networks for Document Understanding: we represent documents as word co-occurrence networks and propose an application of the message passing framework to NLP, the Message Passing Attention network for Document understanding (MPAD). Includes code.

Document representations for docs - could be tweets, could be 100pg articles

Current: sense2vec → topic model (divide long articles/books to chapters? make artificial summaries using quotes?)

Alternative: anything using language models?

Seeking: long document representations?

Graph representation library DGL. Timo

Guided Similarity Separation for Image Retrieval: they used graphs to identify related images, and used those in embeddings, somehow.

Cluster documents.

UMAP straight on topic models

WMD → UMAP

Talk to about document representations:

Luke de Oliveira: summarization with language models at Twilio.

Any recommendations for large-scale document summaries?
doc2vec, topic models,

Timo Denk: BERTGrid, saves structure. Char-level, embedding for chars.

Language models are trained on flat text, would be nice to train one on Markdown. There's so much structure thrown away. Fine tuning BERT?
Where's your code?

How does UMAP/t-SNE straight on topic models look?

5m: close all tabs.

Compress outstanding questions:

Is T-SNE alone the cause of that beautiful clustering?
Why are my topic models bad? Where are bad topics coming from?
What topics have each of my documents learned? Can that help me make better topics?
How do I learn senses relevant to my dataset?
Should I make a pitch deck first, or prototype the focusing thing?

Another idea: write the history of what I've wanted to build, and summarize how it's evolved over time, and what happened to each prototype.

Right now, it's helping manage cognitive overload by helping you manage open/closed loops in your brain without having to manage classification.

https://plot.ly/javascript/zoom-events/

https://plot.ly/javascript/lasso-selection/

On right: newest content. For now, just make it last 5 entries.

~~Build API. I think.~~

~~Update HTML: 10% at right, 10% at left, 80% for graph.~~

~~Lasso select should display links! Oh damn, so fast!~~

~~Recent docs.~~

~~Links to Notion.~~

~~Get arbitrary annotation to appear. A box or something.~~

~~Hover on doc, highlight it in graph.~~

Nicer hovertemplate in plotly.

~~Get scrolling working. Nice - max height on the parent, and flex-grow and overflow scroll on the child. Max height is respected by children.~~

~~Lol. Man, every time I take a shortcut, I learn why people don't do that thing. JSON render data from the backend, manipulate it on the frontend.~~

Man, huge progress today.

Let's try rescraping a page - why are properties coming through like this? {')qh{': [['Yes']], "rh!'": [['Life']], 'title': [['Wisdom is mostly an aesthetic']]}

OK, they were slugified.

get properties in a JSON-safe way. That means no datetimes allowed here. Then, rescrape all docs, and make inboxes!

Inboxes!

Writing inbox

Worldview content that isn't "presentable"

Playbooks without a system tag

People without a compression

Reading inbox

Books without a status

Evernote content without highlights

Written

Worldview content that's "presentable"

Read

Content from Goodreads.

Ingest content from Evernote.

Read vs unread: whether it had any highlights.

Display recent content somewhere, so you can hover on them and see an annotation pop up indicating its location in the projection

Put project ideas into the writing inbox, make writing up findings another writing inbox task

Stay on a topic until I get bored/inboxes clear

If no topic can be confidently extracted, then give that thing no projection, display on left grouped by parent DB. Should capture a lot of noisy people entries. Alternatively, could add them to writing inbox.

To plaintext should use flashcard information - would capture lots of groups, compressions. Or maybe there should be a separate template? Yes, the latter.

Draw a box/circle in the map

Inboxes generated from selection, hyperlinked, page ranked.

Export that content to Anki.

Activation Atlas? K means clustering, each closer gets word cloud/ average topic model underneath. Or do it for each grid in space.

Search? Ask a query, localize the query, show potential answers.

Marketing idea: human + AI team for thinking/focusing/having ideas/researching

12/9-12-12: NeurIPS 2019 main conference

Roadmap:

I want documents from Notion, Evernote, and Goodreads.

Then, I want document embeddings (weighted average BERT reps, tf-idf).

Then, document embeddings to UMAP.

Finally, a web interface to view all this.

Once I have small-scale control over my thoughts and ideas, what high-level structures feel best? Focusing on a single area for an entire month? 1 per week? Or rotating through 4, 2 days each? So many possibilities!

What ideas might be paired? Could intentionally create bridges between important fields, and fill in gaps in my knowledge graph!

Program your attention.

Knowledge is zoomable. XY is projection, Z is density. PhD is achieving some peak. Almost like you're growing a mountain range of knowledge. Could fly through terrain, haha. Old, foundational readings make up base of mountains.

Idea: stick with a content area for as long as you're having good ideas.

Good ideas = ones rated highly in a distributed science sense.

~~How do I plug and play different embedding techniques? Should I keep them separate? Yeah!~~

~~Simple document representations I can run on this computer. tfidf. Topic modeling, doc2vec, long term BERT.~~

~~Oh man, saving individual texts is so messy. I need the whole source document in the DB.~~

Well, it's interesting. Sometimes the document is made of many components (Goodreads books = author + series + book description + 3 highlights), sometimes it's just an entire document.

I think it'd be interesting to store information as closely to its original form as possible - which is different entitites that are related to each other - and then provide APIs that produce documents from these representations. And there can be many APIs for making documents. The Goodreads API set could include various combinations of sources. Notion might include or exclude the title or the backlinks. Anything is possible! All that matters is that document APIs take no arguments, read from the DB, and produce lists of strings.

So the Notion APIs are backwards: store the whole document, return them whole as documents, and let my BERT transformer transform things (or not) to Text. I might delete the whole Text concept now, actually. Well - things that can be embedded need a mixin

~~Much cleaner data model that reflects more closely its sources.~~

Shoot. Notion is tough because documents are really collections of blocks, and it's not obvious when each are updated. So I think my best bet, for now, is just to make each document a list of.

OK, I figured out how to serialize and deserialize Notion API responses to the DB - they're just JSON.

OK. Now my problem is that I would like to save JSON somewhere into the DB. This JSON can become an embeddable document, but because it could become one in multiple ways, it doesn't really make sense to make Embeddable have a JSONField. Also, other entities clearly only have one way to embed things... or do they? Maybe they don't. I'd honestly also prefer to have many embeddings. Honestly, I don't have that much text, and it would be convenient to store it all in a way that's duplicated... hundreds of Evernote notes, hundreds of Notion documents, hundreds of Goodreads books - it's really not that much. I can handle. Pocket content should go to Goodreads if I really want to savor it.

How do I elegantly access notion-py JSON things? Init into object, add a bunch of accessors? Yeah, probably. Better than passing a bunch of junk around.

~~Figure out where the Notion.py method is that doesn't require me to write a big switch statement - it'll just hydrate the JSON to all its classes correctly on its own.~~

~~Wow, all this would be vastly simpler if I just converted notion-py garbage to JSON and serialized that. Duh.~~

~~Write the Django postgres serializer by finding the Django docs.~~

~~Test with my one doc: can I serialize to DB, then recover from DB? Test is whether I can produce the correct plaintext and markdown text.~~

Source.objects.all()
Source.objects.instance_of(NotionDocument)
Project.objects.instance_of(ArtProject) | Project.objects.instance_of(ResearchProject)
Project.objects.filter(Q(ArtProject___artist='T. Turner') | Q(ResearchProject___supervisor='T. Turner'))

doc = NotionDocument.objects.get(url=url)
print(doc.to_plaintext())
Embeddable.objects.create(text=text, source=doc)
make_card_html(doc)

I've come real far, but now I need to scrape and store all nested documents in a doc JSON - but not nested pages.

{'id': 'd010fa67-abf5-4cda-9d98-7d1365032145',
 'version': 82,
 'type': 'text',
 'properties': {'title': [['by 2030 most people will be mindless']]},
 'content': ['3f86a76a-5c41-4899-875b-462359bbe00f',
  '506ca819-ade9-4d6f-86f5-f7f978c7be4f',
  'fb7fe0fd-ab2f-49bf-a9ac-456ed1660a10'],
 'created_by': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0',
 'created_time': 1575909060000,
 'last_edited_by': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0',
 'last_edited_time': 1575909300000,
 'parent_id': 'cef7fe8f-39bd-4d6f-a24d-244cb0252f75',
 'parent_table': 'block',
 'alive': True,
 'created_by_table': 'notion_user',
 'created_by_id': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0',
 'last_edited_by_table': 'notion_user',
 'last_edited_by_id': '3c6b94c0-680d-4476-90bd-7b76b5af5fb0'}

^ Example child with "content". If I fetch all this content, what are the types?

~~single nesting should print~~

~~try with doubly nested content~~

Hmm. If I flatten the content, it'll look less like its source, and eventually I'll have to rewrite all this. I won't be able to handle complicated pages with multiple columns, for example.

If I do flatten it, I simplify logic for myself in the next method, but... I'm not reflecting the original structure very well, which is a principle I'd like to stick to if possible.

What's the best way to unfold a tree structure? Build the tree with DFS? Explore each node? Then topo sort to lay them all out.

I want to go to the bathroom, get a bite, and then retry this.

What I do now: unfold and flatten the tree with BFS.

What I want to do: it's a bunch of list trees. Those will be easy to flatten once I have the whole structure.

How to do it: add nodes to visit in DFS order to a list, if I hit a string node insert its content right afterwards and continue, figure out how to build up tree structure after. Maybe pass 1 builds dictionary of content to replace,

~~try with a page type~~

fuck yeahhhh!! got it, and did it the right way!!

https://arxiv.org/abs/1909.01066
http://dev.evernote.com/doc/start/python.php
https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
https://django-polymorphic.readthedocs.io/en/stable/quickstart.html

OK. Free. Now, find a relaxing spot for a couple hours, then catch the last bits of papers, then dinner at 7, then W&B drinks after, then maybe a party or home to program. What a nice night! Caffeine??? Yes.

~~Test with links!~~

{'id': '688cf6ee-9753-4555-9cdc-fb8d28a50e08', 'version': 55, 'type': 'text', 'properties': {'title': [['‣', [['p', 'cab1d75a-1e31-4306-8907-f8bd11a9d140']]]]}}

{'id': '30397187-ce48-4cf3-bec6-4ee065fac2e8', 'version': 16, 'type': 'page', 'properties': {'title': [['Copy of '], ['TOP LEVEL PAGE']]}, 'content': ['38b23acb-e304-4252-9915-4613f3defc98']}

~~move into codebase~~

Real world problem: there are many types of things. Some don't work.

~~Rewrite the Notion scraper flow slightly. Start with scraping self on a single small doc. Eh... this is next. Resume tomorrow. Nice work tonight!~~

~~Fix the collections issue by manually implementing it, for now.~~

~~Save a bunch of Notion documents to the DB!~~

~~Separate pipelines needed for each data source.~~

~~Notion: scrape DB info, scrape documents, save to DB.~~

Goodreads: API for book, author, series, quotes, save to DB.

~~train tfidf~~

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)

~~save tfidf embeddings~~

~~Project down to 2D, on this computer.~~

~~View this interactively, so I can see why they're spread out weirdly.~~

~~Color-code by source DB, add a legend~~

~~Show legend~~

~~Fix colorscale~~

Plotly docs: https://plot.ly/javascript/reference/#scatter-marker-line-colorscale

~~Visually distinguish read/unread/written content.~~

Divide things as read, to-read, write, to-write. Could be open and filled circles? Shapes?

~~Show more text - paragraph?~~

~~Lemmatize/normalize everything, reproject!~~

Hey, it's not looking half bad!

MILESTONE: working semantic projections + visual explorer

Random intermediate notes...

interface:

to initialize the command, you need to specify the DocumentContainers.

scrape_notion (dbs)

scrape_evernote (notebooks)

scrape_goodreads (shelves)

~~Scrape Notion DBs.~~

~~Scrape children without getting reps from BERT.~~

Systematically generate great ideas.

For what? I want to be wise. For what? I want to be sure I'm spending my life well!

Worldview idea: collect facts, anecdotes, make it easy to build opinions that have more or less support from facts/anecdotes

Support building idea on top of a collection of fragments, and building other, even competing ideas on top of similar fragments. Should be able to see underlying fragments too.

Graph-based approximate NN search, implemented in Rust with Python bindings:

granne/granne

granne ( graph-based retrieval of approximate nearest neighbors) is a Rust library for ANN-search based on Hierarchical Navigable Small World (HNSW) graphs and is used in Cliqz Search. For some background and motivation behind granne, please read Indexing Billions of Text Vectors. Note: granne is still under active development.

github.com

12/6: executable writing system brainstorm

My long-term goals: improve wisdom. To that end, I want a reading and writing queue, and a writing system. I would feel good with this morning if I get that done.

Hmm. A few branches:

Brainstorm . Predicted: 10m, actual 4h. Next: more idea fragments for prototypes. Compression phase.

Try the system out! Write out some thought fragments from the 💝The Art of Loving.
Smoother interface for querying related content while writing.
Goodreads integration.
UMAP everything.
Backlinks.
Export into Anki more smoothly.
Task processor. (Spark?!) Make everything could happen more quickly and reliably.

Converation on creativity, how great people think:

Austin: dialog between stages. Research, synthesize, repeat.

IDEO "scales creativity' by harshly delimiting things.

Groups: people naturally go through diff phases at diff times. Wastes time to force everyone to be in sync.

Feynman. "When you come across problems, or tools that will solve your problem, keep it in your head. And when you get a new problem, try your whole collection of tools."

Austin: same, but for systems. How do they all fit together. Try to apply those to new things. Some things are obvious only through the lens of a new system.

Paul Graham. Defining things in opposition to each other. Asks the right question, explores. Probably writes a ton, then edits down a ton.

Scott Alexander, on writing. "Try so many things that you eventually do something without willpower or exertion."

Names: overview effect (astronauts, LSD, monks reaching nirvana), wisdom system, worldview

12/5: ship templates for Anki

Features wishlist: scrape Goodreads lists & quotes, the Twitter Idea Stream, the Kindle highlights Idea Stream (extract with this?).

What do I want? In my free time, I want confidence that I am spending my time well, and growing wiser. To that end, I'd like a process that works with me.

Major obstacle now: Anki is no fun. Make those templates work.

Next obstacle: making writing a habit. Where do I write my book notes? In bed? No, it should be more serious than that. Maybe in the morning, before my shower? But then I won't be thinking about work. Perhaps that's OK on days where I'm doing straightforward things.

For now, let's just get that Anki feature working, then test it on myself and with Taylor.

Bonus goal: export into Anki automatically.

Start: 3:48pm.

~~Great help text. 4:17pm~~

~~Validation allows link, text, url, title properties. 4:22pm.~~

Use templates when exporting Anki notes. Am I gonna have to rescrape every one of these, btw? What's my architecture here? Nah, just cache title. Everything else scraped on demand. That's the point! Wow, excellent, scrape self works!

Interface: export to Anki command. Preloads templates. Queries all pages, populates templates, makes Anki cards, writes to document for now.

~~Test that it works with multiline templates.~~

~~Replace single ‣ with actual links. 5:26pm.~~

~~Replace multiple ‣ with actual links, and make other HTML actually work.~~

~~Make nested things work! 10:09pm.~~

~~Factor these to functions, make available as Django admin action, test exporting one card (to a document).~~

~~Move templates into DBs~~

~~Fix up > error after text?~~

~~Search field by name~~

~~Make list tags comma separated, and date tags formatted~~

~~Update seeds, clean Makefile~~

~~Export DBs to Anki documents~~

Export that one document to Anki in a dumb script.

~~Exporting hangs?~~

Hmm. I currently open a lot of loops and don't really close them.

What should I do with potential learning materials that I don't need right now? Could add them to learning docs.

12/2/19: Anki templates!

Improve the Anki experience a bit.

~~Make template variables that accept raw html~~

~~Combine with validator~~

~~Scrape schema~~

~~Do I really want to debug this issue? Git bisect, identify the line.~~

Why is this a problem now, where it wasn't before?? Well, I had only done it on my other computer, I think. So go home, see if I can repro there. Until then, work around somehow.

dynamically generated help text.

Notion to Anki can and should be its own program, its own set of features.

Update export to Anki script to also export template stuff

Finish Anki API exploration, then invoke from Django command

I wonder - could I extract all of this to its own program?

11/30/19: Backlinks

Goal 1: backlinks between Worldview notes.

Goal 2: backlinks into People notes!!

First of all: can I detect links? If not, fork this program, or edit it directly. Use this URL: https://www.notion.so/jasonbenn/Status-is-zero-sum-but-we-can-invent-many-status-hierarchies-36c730c458dd4355837fe747baa2991f

What are the types of tables in Notion? The tables we currently support are block (via Block class and its subclasses, corresponding to different type of blocks), space (via Space class), collection (via Collection class), collection_view (via CollectionView and subclasses), and notion_user (via User class).

Does my link exist in these classes? Explore the block class.

Print the raw data from the block table for my mystery ID.

What other kinds of tables exist?

Nice! Found the link ID buried in the blocks with the right caret character!

Backlinks should always be text. Can I write these into another note? Text is better because it doesn't create elements in the sidebar dropdowns. It also is easier to incorporate into sentences.

This would be a nontrivial amount of work...

Iterating all notes, generating the mapping from ID to title

11/28-11/30/19: Similarity suggestions!

Holy COW I can't believe how much better cosine similarity is than a simple dot product (for NLP - image interpretability research uses dot products). Cosine similarity normalizes their lengths, right?

These suggestions are just so awesome.

Graph embeddings.

How do you surface unique things?

You could probably quantify how unique it is.

Larger context.

What do I think of "see everything all the time"?

I want supporting data, opposing data, extensions of the idea, opposing viewpoint, trends that inform it. Without filtering, all the other information would be overwhelming.

Large shaded area of everything this article touches on.

NLP will lead to totally different UI/UX:

solicit reasonable inputs

"Reasonable interface to fuzziness"

Theories of the amount of notes I want to see:

See everything all the time: Andy, David. Presentation/design problem.

?: Not convinced this is helpful.

?: How would it feel? Overwhelming?

See everything, except the obviously irrelevant stuff.

Experiment: most different notes. Would I want to see them while writing? What are 5 of my most mismatched pairs, and can I draw a connection between the two?

See only the relevant stuff.

+: Helps you write the note best. Especially if forced to reconcile opposing connections. You might gloss them over if you were focusing on superfluous, weak connections.

+: Similar to Cal Newport's/Ryan Holiday's blitz-blogging strategies. They bring the notes.

-: miss some especially creative connections.

+: that's what Anki is for.

Tweet: I'm glad you guys are digging into this, because it's the path I feel most uncertain about.

What is thought? Read Wikipedia summaries of great books on thought. Cicero, etc. Philosophy of thought?

Build argument mapping dataset! Classify as supporting/opposing.

https://paperswithcode.com/task/link-prediction
https://paperswithcode.com/area/graphs/representation-learning
https://paperswithcode.com/paper/interacte-improving-convolution-based
https://arxiv.org/pdf/1911.00219.pdf
https://arxiv.org/pdf/1903.12287v3.pdf
https://arxiv.org/pdf/1906.04239v1.pdf
https://github.com/Sujit-O/pykg2vec

Potential datasets for finding opposing viewpoint directions: argument mapping, rumor classification research, RumourEval, rumour stance prediction and rumour verification,

https://www.psychologytoday.com/us/blog/thoughts-thinking
https://www.psychologytoday.com/us/blog/thoughts-thinking/201811/improving-critical-thinking-through-argument-mapping
https://paperswithcode.com/task/relation-extraction
https://paperswithcode.com/task/citation-intent-classification
https://paperswithcode.com/task/text-categorization
https://paperswithcode.com/task/document-classification
https://paperswithcode.com/task/topic-models
https://paperswithcode.com/task/textual-analogy-parsing
https://paperswithcode.com/task/entity-linking
https://paperswithcode.com/task/relation-classification
https://paperswithcode.com/task/opinion-mining
https://paperswithcode.com/task/argument-mining
https://paperswithcode.com/task/relational-reasoning
https://paperswithcode.com/task/joint-entity-and-relation-extraction
https://arxiv.org/pdf/1704.07221v1.pdf
https://www.aclweb.org/anthology/S19-2147.pdf

~~Display least similar, consider.~~

~~Filter out people.~~

~~Document embeddings as weighted averages.~~

Sequence length confusion?

/Users/jasonbenn/.pyenv/versions/worldview/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: 
UserWarning: server does not put a restriction on "max_seq_len", it will determine
"max_seq_len" dynamically according to the sequences in the batch. you can restrict 
the sequence length on the client side for better efficiency
  warnings.warn('server does not put a restriction on "max_seq_len", '

/Users/jasonbenn/.pyenv/versions/worldview/lib/python3.7/site-packages/bert_serving/client/init.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency
warnings.warn('server does not put a restriction on "max_seq_len", '

~~Create and cache all document embeddings.~~

Wow, these similarities are great. Really, really strong.

Getting doc
SIMILAR TO: If I am the boldest version of myself, what would I be doing with my life?
similarity range: 0.975 - 0.850
-------------------
similarity: 0.975
# Worldview > An autonomous & highly satisfying career archetype: independent research aimed at influencing big companies
-------------------
similarity: 0.975
# Questions > What is my best work self?
-------------------
similarity: 0.975
# Processes > Ideas: recording & sharing
-------------------
similarity: 0.975
# Processes > Prioritizing
-------------------
similarity: 0.973
# Processes > Staying focused

What is the most different note for a few kinds of notes? Is it useful to see?

MILESTONE: BERT-powered semantic similarity

11/26/19: Andy's tweet, knowledge categories, Idea Stream, valid goals

Most relevant tweet of all time, responded:

Question: when and how do I translate a thought into Twitter to get feedback from the hivemind? Do I create a boolean field? Do I write a program to automatically post them to Twitter sometimes?

Buster Benson has been building and revising his worldview incrementally since 2012! I love how he classifies them by strength and how they might be falsified.

Codex Vitae

A belief is a personal perspective constructed out of pieces of information and experiences that we have. Beliefs are influenced by many. The believe can be a perspective on anything from "what is the best burrito in SF?" to "how did the universe begin and how will it end?"

busterbenson.com

All knowledge you wish to incorporate into your Worldview can be categorized as a Fact, a Belief, a Process, or a Question. Questions grow up into Facts and Beliefs. All can reference each other.

Facts must be Strongly Connected to land in your Anki (2+ backlinks). Otherwise, they're not notable enough to be worth memorizing? Not sure yet about this one - perhaps wait until I encounter lots of noise I don't care about to implement this one.

Beliefs, processes, and questions always export to your Anki.

People must also be Strongly Connected (2+ references in my journal) to land in Anki.

What about Ideas? Is that worth elevating to a concept I think about repeatedly? I imagine flipping through my ideas deck would be really fun and inspiring. Putting them in Anki gives me more confidence that I won't forget them... otherwise building all these different processes to check various parts of my Notion would be too hard/I'd get lost.

The Idea Stream:

classified automatically into fact, opinion, or belief
connections (of any of the above types) are suggested to the right, with a typeahead at the top to filter. Because there will be at most tens of thousands, just display them all, and sort by relevance. You should also be able to scan through every note yourself if you desire, and add connections as they come up without losing your place.
clicking on a connection adds you to that note, where either it will be added as a "Source" or it'll let you smoothly start describing the connection.
optional "retweet" text field on each piece of content lets you describe why you think it's interesting
snooze button lets you revisit that content at a random future date
archive button lets you decline to incorporate that content into your worldview

Make connection similarity a tuneable parameter: how similar does content have to be for it to be suggested. Could stimulate creativity.

Valid possible goals:

Build a worldview
Connect with the people in your life
Manage information overload
Improve conversations with people ("Most people have just 5-10 ideas, only 1 are new or interesting, and it's the main things they talk about")

Possible pain points:

I don't remember what I read
I come across good ideas and don't know where to put them, feels like losing them
My opinions are poorly thought through compared to others
I'm not interesting enough
I don't think about other people

20 design iterations on the Idea Stream.

Reach out to , , , , , , for feedback.

Beautifully designed landing page, I know exactly what it does and why I'd want it:

Cocoon · A private space for the most important people in your life.

Lightweight Feedback Replying to an update in Cocoon is as easy as tapping on it. Quickly comment or leave your unique mark with a heart on anything you see in Cocoon. Questions to Add Color From time to time, the narrator will pose a question like "What was the highlight of your weekend?"

cocoon.com

Cocoon · A private space for the most important people in your life.

11/22/19: streams of highlights like Gmail

Core parts of my system:

mapping from content to representation in my DB.
exporting Notion to Anki. Make this more user-friendly? I would like this to happen more often. Ideally from a web interface.

Let's get the ml-box online more reliably, so that we can do 1. Move keyboard, find a mouse, get the projector hooked up. Gah. That sounds annoying. How bout: make random vectors, save those instead lol.

Work on 1.

~~Goal in next 10 minutes: save bogus similarity, save into DB, compute similarity.~~

Done at 9:47pm.

~~Export Notion.~~

test deleting notes

test creating notes

Definitely should not feel like chores, like email. Or even FB notifs.

Should feel more like Instagram. They're rewards! And maybe they fade after a week or so. Or they come up again, later, randomly, so you don't have to worry about connecting them when they come up.

An Edges app.

Solves the problem of worrying about where to put stuff as you find it.

Roam also solves this problem: you just drop a note.

But: "the more I use Roam, the less organized I feel." (Taylor).

You promise yourself you'll organize everything, but you never do.

I wonder if revisiting them periodically with Anki would change this? Seems unlikely.

Roam's problem: enables too many connections, doesn't force you to prioritize or actually think about them. Also doesn't make you evaluate quality of source or quality of connection.

~~Technology goal: similarity to arbitrary text.~~

~~Design goal: gmail inbox MVP for my Kindle highlights, bookmarks, Pocket clips, Twitter likes for the day.~~

Oh snap - my Evernote Tweet inbox is this! Similarity on the side:

It's pretty interesting. Definitely feels good to review the thoughts I liked earlier. More and better context would be good, though. Iframes? Kindle context? That would be insane, though - it'd be like implementing a whole Kindle reader, a whole Twiter reader, all in the same app.

I have no desire to connect things, though, I'm too passive atm. Maybe if it were easier...? Eh, idk. Might just be about having to move my hand to click the mouse.

Also, doesn't really feel like I'm building a web of knowledge. When would I review this?

Maybe I would review connections in addition to the whole ideas? Might be interesting.

Stream of highlights since you last checked.

For each, suggest similar opinions. Hotkeys can either establish links or you can write something about that link. Can send to either Thought doc or Anki flashcard (suggest the deck!).

Great inbox processing software: gmail. Superhuman.

Great software for forwarding/making connections: Notion, Bear, etc.

Mindmaps? Using arrow keys to select/highlight possible connections?

Force diagrams? Each highlight has a constellation of possible connections?

You've got a snippet, you use arrows to page up and down and quickly glance at the contents, and you just start typing, and you're already in the document, and the new snippet is already copied in and linked. And you can keep going and link it to multiple documents, too. Wait - do I actually want to type about some of these things, or just make a connection? Could batch process connections later.

Zoomable UI?

Oh snap, Zapier supports webhooks! That covers: Twitter, Evernote, Pocket.

I can export manually from: MarginNote, Otter, Kindle.

requests.get("http://157.245.238.241:8909", input_data)
print(input_data)
return 0

If a piece of information isn't incorporated into an opinion, a process, or my facts base, then what's the point?

Might want to share it.

Might want to connect it in the future to other things.

Might be gathering it for some potential future opinion.

Might be collecting things like that for some purpose. Cool jobs, for example.

Perhaps the ideal process is to just immediately suggest the connection as you're reading it. After all, that is when you have the context about that thing.

Examples of great explanations of tools for thought:

MercuryOS

Mercury is a reimaging of the operating system as a fluid modeless experience drive by human intent.

www.mercuryos.com

Capstone, a tablet for thinking

Capstone is an experimental tool for creative professionals to develop their ideas. It explores questions about digital information curation; how creative people come up with good ideas; and what we at Ink & Switch think the future of power-user computing interfaces could look like.

www.inkandswitch.com

Knowledge platforms:

Trustory: so, so bad. So many uninformed opinions. No credibility whatsoever. I disagree with literally every opinion in their programming community. This is where Quora shines: subject matter experts respond to questions, and questions and answers are heavily moderated and cleaned up.

Allsides: presenting opposing viewpoints. How does it feel to read both at the same time? Confusing - don't know where to click. Nudges me towards center when sometimes facts really do support one side or another. It's pure opinion, but it should be more fact.

Debate is often framed by the phrasing of the question. Case in point: RationalWiki, which purports to offer objective, rational analysis of issues and individuals in the world of politics, business, and media, but which in reality has turned into a Marxist/leftist extremist dunkfest where they shit on anyone to the right of Bernie.

These are all trying to mix the opinions of others. I just want to collect my opinions and make sense of them all, and support/refute them as appropriate.

What we need is a boxscore for each piece of evidence. Maybe that's the layer that could go over everything. Reputation as a service. Reviews, trustworthiness. Different from reach. Relies on inventors, not influencers.

Tweet from BalajiS

As I think about it, this is a new kind of boxscore for a news story.

First party confirmation or denial
Primary sources
Independent confirmations
Anonymous sources

And if you login, it shows whether these sources are economically or otherwise aligned with you.

Tweet from Saku:

Algorithmic tools to evaluate truth. PageRank was an incredible good first attempt, because it works like humans do (credibility of the source = how many links)

"Modeled after academic citations" which, as a system, has plenty of problems.

https://twitter.com/c4mer0n/status/1197683865369755648
https://changeaview.com/
https://felonvoting.procon.org/
http://www.parli.co/
https://www.qutee.com/
https://perspectroscope.com/
https://twitter.com/LogicallyHQ
https://changeaview.com/
https://twitter.com/balajis/status/1197696369210904576
https://en.wikipedia.org/wiki/Knowledge_Graph

11/20/19: NLP research areas

Flashcards should have multiple priority (abstraction?) levels, and when you're not prioritizing a deck, it should only show you the highest-level cards. You should be able to shift your focus.

Research interests

How do I train on structured (Markdown, graphs) data? Melody Dye suggestion: learning from graphs (TransE). Retrain
I don't just want to identify similar documents, but also those with opposing views. How do I identify (or create) contrary directions in representation space? Do these datasets exist? Scrape AllSides.com? Scrape kialo.com, DAG arguments.

Scrape allsides?

Finetune model.

Two classifiers: multiclass vs positive/negative

How do I produce representation on long sequences (books)? Transformers XL?

Collaborative filtering

You could prime it with hints

Goal features

When I stake out an opinion on ML, relevant papers I've saved surface. I can toss papers into my system and know they'll come later when I'm most curious about them. (Scrape abstracts?)

Understands the relationship between papers and people (entity resolution).

Goal tonight: get Goodreads into my DB.

~~Fix dotenv~~

~~Fix debugging environment!~~

~~Scrape one document from Goodreads API.~~

11/17/19: design ideas: notes app + similarity

Design!

SF Symbols

Material design icons

Design icons for read vs written

What about when you’re reading? Chrome extension shows writing?

What about when you’re reading? Chrome extension shows notification bubble w number of detected matching opinions

Preview on hover

Oooh - what if the graph of related arguments was in background, and as you typed, they moved around? And maybe diff ones came to foreground, depending on relevance/type/opposing or supporting, etc

Try gray background, bring focus to writing experience

ZUI UMAP of knowledge

Animation of new additions

Replay how knowledge is accumulated based on adding documents

Fade things out if you haven't seen/touched them in a long time? Nudge to remember that material. It never disappears, just stays gray.

11/15/19: embeddings for everything, interested people

Observation: it's really exciting to hear a great idea and put it into my system. I have confidence that I'll think about it more, even if I don't have time right now. I don't have to be preoccupied with forgetting it. I think this system enables me to search more widely for ideas, in the same way that Anki enabled wider learning.

~~Schema and a server: everything mixes in fields that make them indexable.~~

Things that can be embedded: text, embedding

Things that can have embeddings:

Goodreads book: title, description, FKs to authors, FK to series

Goodreads author: name, URL, about

Goodreads series: title, description

Highlight? Source document, highlighted text...?

Some types are composite embeddings:

A book is a weighted average of the book description, author description, and highlights

An article is all highlights

These are themselves records, and could be exploded into their parts

What about PDFs/other docs that could live on S3? Should I host those myself, or just reference links that could go down?

~~Implement polymorphic relationship between Text, NotionDocument, GoodreadsEntity, and in the future an S3Object.~~

Scraping sources:

Notion. Scrape via API.

Goodreads. Scrape via API.

Kindle Notes & Highlights. Can export through web interface.

Otter transcripts. Can export through web interface!

Pictures of text with my finger pointing at a passage? Eh, just

Interface:

Admin page for each type of parent model. NotionAccount, GoodreadsAccount, KindleAccount, OtterAccount, S3Account. All belong to User.

NotionDatabase belongs to NotionAccount. You can add a DB by URL, then a bunch of NotionDocuments are scraped from that.

GoodreadsAccount: you can add GoodreadsShelf, then click scrape, and it'll find a bunch of books and their highlights.

KindleAccount: just configure, then click scrape. Ruby gem.

ScrapeActions table keeps track of when things were scraped - won't do it again if too recent.

Everything gets

Quick experiment in polymorphism: can I quickly grab all highlights? Can I quickly GROUP_BY/average on source?

What is next? What is the next goal that excites me?

Having a DB full of Goodreads highlights. For now, just have highlights have nullable FKs to Book or NotionDoc.

Next

Get Goodread shelves read/to-read into the same space as writing
Basic queries: read/unread books like this opinion, opinions like this opinion
Write up my vision, early learnings and circulate to get feedback.

Tommy feels like he has "no thoughts", wants to have more.

Danny works best by talking before writing, buys books even if he's not sure if he'll read them, and reads lots of snippets.

Patrick C buys thousands of books, reads 5 pages at a time at random.

Taylor decides what to read next by being inspired by random quotes from his reading list.

Armand has 3 progressively more specific to-read lists on Goodreads.

Kanjun listens to audiobooks while running.

Allie listens to audiobooks during 5 minute commutes.

Taylor thinks little snippets lower friction of running.

Tiger, James, Tina will think it's cool, maybe have feedback on the approach.

Theme: nobody wants to read whole books.

What if you could zoom to a popular quote, then progressively highlight?

What if you summarized books at varying levels of granularity until you get to your quote? Chapter/section/page/paragraph summaries, up to your chosen quote, then progressively zooming out again afterwards. Zoom in and out on the content.

Danny wants to have a conversation with e.g. Patrick about the quote. That makes me think of this - oh cool, that's related to this that you haven't heard...

Eventually: Andy Matuschak and Michael Nielsen may have feedback.

Melody Dye, NLP scientist at Netflix. Also working on NLP for sequence similarity, also doesn't know how to handle long sequences or structure. She's ingesting scripts, tagging them into genres.

11/14/19: exploring Goodreads API

Pipeline: will need to train, then save UMAP.

Most important next thing

I want to see things I've already read - Goodreads read and unread - show up in the map.

~~Get description content for book, author, series~~

‣

Ooh, some interesting data: popular_shelves, which includes genre tags

‣

Also, similar_books, could use as labels:

Probably concat/average with about author and about series?

I really like the concept of exploding an embedding into highlights, author, series, book description...! That'd be so fucking cool! On hover, show/highlight an area covering the exploded parts of the book?!

Man, I really need/want a great filter interface.

~~Oauth to get books my shelf.~~

BUG: https://www.goodreads.com/review/list/6481511.xml?key=YaQV1WrXtgkB9KYHvrDQ&v=2&shelf=read&per_page=200&page=1 works, but from Python doesn't always work - seems to return one result?

~~Parse XML~~

~~Goodreads API: get all books, ratings (so I can filter)~~

Can't do same method more than 1x/second.

11/13/19: Bert → UMAP

I've got a solution - what's the problem?

I forget everything I read, and they mostly don't impact what I believe.

I want a tool to help me understand how all my knowledge fits together.

Main app: put everything into massive DB, documents in S3, compress them with BERT, map them with UMAP.

Uses: reading recommendations, explore the fog of war, find writing you should link, etc. Queryable. Cosine similarity happens via procedure.

MVP:

~~Dump all Writing Notion documents.~~

Inline links not yet supported.

~~Convert to vectors~~

BERT-as-a-service running on ml-box!

~~Compress down with UMAP~~

nieghbors_10__min_dist_0.25

nieghbors_10__min_dist_0.5

MILESTONE: BERT & UMAP projections

‣

name ideas: going with text-mapper for codename

All reading is broken. I read things and barely remember a single fact a week later. The only fix is to explain or talk about the things you read, but most content isn't like that.

Solution is have all read and written content mapped, such that similar content is next to each other.

Next project: connection suggestions!

Notion-to-text program: for any Notion page, convert all contents to one string.

MVP vectorizer: just use sklearn's TF-IDF, see how it goes.

Later: run DistilBERT on my machine? Its job is to ingest documents of any length and produce an activations vector.

BERT-as-a-service on ml-box.

DB: sqlite. Load everything, do similarity search in memory. Will be a while before I have many thousands of documents and their vectors won't fit in memory.

MVP: copy-paste text into inputs.

Later: feed it just a URL. It'll GET and parse with MercuryParser.

Later: Chrome extension. One-click to compare. Backend responds with Notion links.

Should also work for Notion ←→ Notion connections. Can either pass raw text or a Notion link.

Goal: to build an insight machine: that means integrating content that I've saved and written.

~~Write program that extracts text from Notion document.~~

~~Write program that POSTs documents, saves result into sqlite (for now).~~

~~Write program that ranks documents by cosine similarity.~~

~~Move ml-box to common room, clean up back there. Make available to SSH.~~

~~Set up bert-as-a-service.~~

~~When I get tired, go workout for EOD, as late as 630. Decide by 615.~~

~~Bug in notion-to-anki: intermittent failures in collection.get_rows(). Results in empty rows. Resolves itself when rerunning repeatedly.~~

Ben recommends GPU ML algorithms, and newspaper3k for parsing.

Worldview metrics (add to a dashboard!)

The number of ideas made presentable
The number of new ideas added
Target: 3 per day

Maybe these should go to separate decks?

MILESTONE: dashboard for my writing funnel

Improving my UMAP projection

Fix bug: color code by document. It's just the wrong format.

Lemmatize + HashVectorizer

512 seqlen

Middle layer of NN

Average embeddings

Generate UMAP transforms a Understanding UMAP

11/12/19: text extractors/classifiers

idea: the distant future?

Inputs:

Record more conversations with Otter.

Evernote collects my readings and highlights.

Classifiers:

A program identifies the information worth saving, and/or its type - is it about a person, an idea or opinion, a great question, or a joke for my journal? Fact v opinion v experience classifier.

Another program suggests which Notion content it should be merged into. Language model activations of each new piece of content · language model activations (in DB) of each worldview and/or great question. No similarities? Great - prompt to create a new note.

Attention:

Things to think about more are included into Anki: open questions, my friends and acquaintances.

Content to read is suggested based on opinions that need fleshing out.

Outcomes:

High trajectory: the next time we have a conversation, there's a good chance I've internalized your insights and further developed them

Sample efficiency: nothing worthwhile is forgotten

11/11/19: Notion → Anki

~~Build Notion DB → Anki export flow.~~

~~Notion DB → Anki deck~~

~~Notion people → Anki deck~~

~~FB ID from FB page.~~

https://www.facebook.com/photo.php?fbid=10101121905470403&set=a.616032894143&type=3

https://graph.facebook.com/10101121905470403 /picture?type=normal

https://graph.facebook.com/10101121905470403/picture?type=large

Maybe upload images to my own S3 bucket?

~~FB ID → working FB image~~

~~Basic people exporter, no images~~

Anki plugin: get mapping from field 1s to ID, delete cards that are not found in cards to update

Images...

Chrome extension that downloads profile image

Program that uploads image to S3

Chrome extension that makes Anki card with button

Chrome extension to create or update Notion page with FB link! Ideally with ID.

Done! https://github.com/JasonBenn/notion-to-anki

Add export Kindle highlights to weekly review process.

Fix Zapier integration: append to note, don't create new note every time.

Better way to replicate Dex? Taking notes on people/new FB friends? Add these events to Notion? Or: an extension that creates new page in Notion from FB with a single click OR opens it, and then I can write a new entry myself.

11/10/19: Evernote memex, backlinker, Anki

v1?

‣

Evernote is my memex. Its sole purpose is as a repository for all my inputs worth saving: web browsing, Kindle, Newsblur, Pocket, my notebooks, saved Facebook posts, Otter, MarginNote.

Notion is where I craft my worldview. Evernote can be linked into Notion as sources. These things will spill over into public writing frequently. I've started here: .

Backlinker is a program that links Notion documents to each other. Just like Andy.

‣

Connector is a program that suggests connections. I use this tool to develop my ideas with content I've saved to my memex. Someday, it may even suggest opinions worth extending when I'm reading content on the web.

‣

Anki is how I program my brain. This is how I prioritize my relationships, cultivate my worldview, and learn.

‣

Unresolved questions.

‣

Other content inputs/outputs should be deleted.

Done

~~Copy Bear notes into Notion.~~

~~Test an MVP of the Notion → Anki system~~

~~Copy Worldview to Anki manually~~

~~Review the deck a bit~~

~~Think about things for a short time~~

~~Think about things a long time~~

~~Try not updating text~~

~~Try updating text~~

~~Try "have I learned/thought anything recently that adds to this opinion?"~~

Brainstorm other prompts

What would disprove this opinion? What are great counterexamples?

Who would most resonate with this? Who could I send it to?

Is it ready to post to FB?

~~Work out a format!~~

~~How was it?~~

Is this a thing that I could imagine committing to and deriving a lot of value from? I think so.

Would it push me to develop a fuller worldview? Yes, absolutely. I have so much previous writing and content that's actually quite good!

How much time would I need to spend in Anki on the Worldview deck per day for this to work? Maybe I could just sit down and flesh out one at a time every now and then? Having a hard time imagining doing this constantly. But over a long period of time, perhaps this could be really useful.

When would I invest that time? Definitely nights after work. It'd be very satisfying to absorb knowledge from e.g. conversations into my worldview. Perhaps my weekly review could include a brief review of new Evernote content?? That'd be pretty neat!

Would this make me a more interesting person? Oh hell yes, it already has!

What's the right review schedule? Ideally daily. I could add a thinking walk, but the lack of ability to type, search, and make connections would kill me. Working in this position would totally work, though. Perhaps I could build my worldview every morning. This certainly satisfies my goal to Be Interesting. Could also research things right before posting them.

I definitely like rereading the note to remind myself of the thought before asking any questions.

Test an MVP of the Notion ← Backlinker system

Linking Notion documents to each other is actually critical.

MILESTONE: wished for backlinks in Notion, Notion → Anki

~~Pick a document, conduct a bunch of related searches for Evernote content~~

~~Try searching in MN~~

Really awesome. It exports the whole thing, and it's all searchable in EN. Better than I could have hoped.

Try searching in Otter transcripts

Is this helpful? Does it flesh out my view?

What kind of content should I clip? Full articles? Simplified or complete?

Do highlights help at all?

What about little summaries?

Would mind maps from margin note actually be helpful?

How long would it take to do this, and when would I? Evening commute?

When would I sit down to write and first read related content?? Could I work these in to Anki?

Are there other great streams/sources of information that aren't Anki? Suggestions based on your writing?

When I have an idea, what's the process? Write, search Evernote to quickly check, post to FB?

11/9/19: Anki MVPs

Main goals:

~~Try to do the Anki first thing in the morning and into the shower.~~

Hmm. Not compelling. I don't want to think about great research organizations. The times I want to think about this is actually when I'm reading related content, or when I'm wondering what to read next.

~~Try to do the Anki during a walk.~~

~~If I have a good thought, try to sit down and write it out, and figure out when I'd want to search Evernote.~~

Opinions as note titles is a breakthrough. Forces me to defend it, articulate it, explain it.

Opinions as Anki cards is slightly disappointing.

Lots of things I've recently thought about and weakly believe I breeze through.

Some things I actually realize I don't care about. Updated my belief that everyone learns in the same way after a conversation last night with Austin.

Anki is out of date quickly.

What I want is to combine this with my reading list. Building a worldview is to read widely and critically.

RSS feed could be scraped and made sensible with Mercury Parser.

Related notes from Evernote could be inserted at the bottom of notes. Test the MVP. Implementation would copy content from Notion into Evernote, then use the Evernote suggestion API, which can take any number of notes.

~~MVP: flashcards with Evernote links, Goodreads links at the bottom.~~

Pretty great.

Implications: to get content to appear in my reading list, it has to first be motivated by an opinion.

Related notes from Notion as well.

MVP system:

A plan of when to use it. When I'm bored and want to read about something, flip through Anki. Kinda feels like work, though... even going to Pocket feels like work. If I read something good, I'd feel compelled to save it. Could do it on my home commute, but the Evernote reading experience is just bad... but the content is great. Or I could read the thing and think about books to read.

Long term schedule, where I look at each one seriously: get to review in 1 day, 5 days, 10 days. Max reviews: every 30 days. Ehhh, but could just put them at 1 day.

Short term schedule, where I feel OK about skipping through em: 1 day. Not as good.

~~Add backlinks~~

~~MVP: do this myself.~~

Pretty cool. Definitely makes things more interesting to read. Great for sharing.

Add further reading

~~MVP: do this myself.~~

Not great until I clean up Evernote.

~~OK, done~~

Yep, now these are excellent suggestions... from mostly 2015. I want my suggestions to be high quality, and news is too topical to be high quality.

Perhaps my Evernote should be conversations, books, and timeless articles.

Going out to do research was fun and effective, but I'm that kind of mood this weekend.

Notion to Anki sync. Notion documents → Anki cards. Scrapes a Notion document with a DSL, exports to Notion with update by ID, which is hidden. Deletes cards that no longer exist, as well. Where and when does this run? For now, from either of my MBPs, on demand.

Worldview → front: linked opinion, back: note

Thoughts → front: linked reminder, back: note

People DB → front: face, back: name + compressed description

Outstanding questions:

Where do I source further reading from? Book list? Could my "reading list" be me Evernote clipping book blurbs/Goodreads summaries? One problem is that Evernote tends to suggest longer content over shorter content.
When? This is a night activity, perhaps my default one, and it can motivate my reading and conversations.

This could also be the way I collect and update thoughts after conversations, which is very satisfying. Eventually, I wish to record my whole day, use a NN to label the facts/interesting thoughts from the transcript, and simply review a queue of those at the end of a day and integrate them into my worldview.

What's the endgame?

Permanent high status, which helps me surround myself with the world's most brilliant and interesting people which will help me achieve any goal.

It will also help me raise the aspirations of others and teach them how to do so.

I also just want to be right about the world - to understand it and make genuine insights. It's probably never been possible to have so complex and interconnected a worldview before - I could be at the front lines of thought technology development, and demonstrate what it's capable of.

Compare representations to find the direction.

Designing my attention by building an emergent map of my mind