Paper Reading


Atom url GitHub last commit GitHub watchers

List of papers we cover during our weekly paper reading session. For past and missing links/notes, check out the (private) wiki.

28 August, 2020

Overlapping experiment infrastructure: More, better, faster experimentation

CUSTOM_ID: tang2010overlapping
YEAR: 2010
AUTHOR: Tang, Diane and Agarwal, Ashish and O'Brien, Deirdre and Meyer, Mike

Optimal testing in the experiment-rich regime

CUSTOM_ID: schmit2019optimal
YEAR: 2019
AUTHOR: Schmit, Sven and Shah, Virag and Johari, Ramesh

21 August, 2020

Bridging Anaphora Resolution as Question Answering

CUSTOM_ID: hou2020bridging
YEAR: 2020
AUTHOR: Hou, Yufang

A framework for understanding unintended consequences of machine learning

CUSTOM_ID: suresh2019framework
YEAR: 2019
AUTHOR: Suresh, Harini and Guttag, John V

14 August, 2020

Reformer: The Efficient Transformer

CUSTOM_ID: kitaev2020reformer
YEAR: 2020
AUTHOR: Nikita Kitaev and Łukasz Kaiser and Anselm Levskaya

24 July, 2020

Adversarial examples that fool both computer vision and time-limited humans

CUSTOM_ID: elsayed2018adversarial
YEAR: 2018
AUTHOR: Elsayed, Gamaleldin and Shankar, Shreya and Cheung, Brian and Papernot, Nicolas and Kurakin, Alexey and Goodfellow, Ian and Sohl-Dickstein, Jascha

Language models are few-shot learners

CUSTOM_ID: brown2020language
YEAR: 2020
AUTHOR: Brown, Tom B and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others

10 July, 2020

DIET: Lightweight Language Understanding for Dialogue Systems

CUSTOM_ID: bunk2020diet
YEAR: 2020
AUTHOR: Bunk, Tanja and Varshneya, Daksh and Vlasov, Vladimir and Nichol, Alan

DialogueRNN: An attentive RNN for emotion detection in conversations

CUSTOM_ID: majumder2019dialoguernn
YEAR: 2019
AUTHOR: Majumder, Navonil and Poria, Soujanya and Hazarika, Devamanyu and Mihalcea, Rada and Gelbukh, Alexander and Cambria, Erik

3 July, 2020

StarSpace: Embed All The Things!

CUSTOM_ID: wu2018starspace
YEAR: 2018
AUTHOR: Wu, Ledell Yu and Fisch, Adam and Chopra, Sumit and Adams, Keith and Bordes, Antoine and Weston, Jason

Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding

CUSTOM_ID: huang2020learning
YEAR: 2020
AUTHOR: Huang, Chao-Wei and Chen, Yun-Nung

12 June, 2020

Sentence-bert: Sentence embeddings using siamese bert-networks

CUSTOM_ID: reimers2019sentence
YEAR: 2019
AUTHOR: Reimers, Nils and Gurevych, Iryna

PyTorch: An imperative style, high-performance deep learning library

CUSTOM_ID: paszke2019pytorch
YEAR: 2019
AUTHOR: Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others

What's Hidden in a Randomly Weighted Neural Network?

CUSTOM_ID: ramanujan2020s
YEAR: 2020
AUTHOR: Ramanujan, Vivek and Wortsman, Mitchell and Kembhavi, Aniruddha and Farhadi, Ali and Rastegari, Mohammad

Weakly Supervised Attention Networks for Entity Recognition

CUSTOM_ID: patra2019weakly
YEAR: 2019
AUTHOR: Patra, Barun and Moniz, Joel Ruben Antony

Improving BERT with Self-Supervised Attention

CUSTOM_ID: kou2020improving
YEAR: 2020
AUTHOR: Kou, Xiaoyu and Yang, Yaming and Wang, Yujing and Zhang, Ce and Chen, Yiren and Tong, Yunhai and Zhang, Yan and Bai, Jing

5 June, 2020

Hierarchical attention networks for document classification

CUSTOM_ID: yang2016hierarchical
YEAR: 2016
AUTHOR: Yang, Zichao and Yang, Diyi and Dyer, Chris and He, Xiaodong and Smola, Alex and Hovy, Eduard

Training classifiers with natural language explanations

CUSTOM_ID: hancock2018training
YEAR: 2018
AUTHOR: Hancock, Braden and Bringmann, Martin and Varma, Paroma and Liang, Percy and Wang, Stephanie and R, Christopher

ERD'14: entity recognition and disambiguation challenge

CUSTOM_ID: carmel2014entity
YEAR: 2014
AUTHOR: Carmel, David and Chang, Ming-Wei and Gabrilovich, Evgeniy and Hsu, Bo-June and Wang, Kuansan

Audio adversarial examples: Targeted attacks on speech-to-text

CUSTOM_ID: carlini2018adversarial
YEAR: 2018
AUTHOR: Carlini, Nicholas and Wagner, David

29 May, 2020

Differentiable Reasoning over a Virtual Knowledge Base

CUSTOM_ID: dhingra2020differentiable
YEAR: 2020
AUTHOR: Dhingra, Bhuwan and Zaheer, Manzil and Balachandran, Vidhisha and Neubig, Graham and Salakhutdinov, Ruslan and Cohen, William W

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

CUSTOM_ID: shen2018naturaltts
YEAR: 2018
AUTHOR: Shen, Jonathan and Pang, Ruoming and Weiss, Ron J and Schuster, Mike and Jaitly, Navdeep and Yang, Zongheng and Chen, Zhifeng and Zhang, Yu and Wang, Yuxuan and Skerrv-Ryan, Rj and others

15 May, 2020

NBDT: Neural-Backed Decision Trees

CUSTOM_ID: wan2020nbdt
YEAR: 2020
AUTHOR: Wan, Alvin and Dunlap, Lisa and Ho, Daniel

Faster Neural Network Training with Data Echoing

CUSTOM_ID: choi2019echoing
YEAR: 2019
AUTHOR: Choi, Dami and Passos, Alexandre and Shallue, Christopher J. and Dahl, George E.

Universal Language Model Fine-tuning for Text Classification

CUSTOM_ID: howard2018universal
YEAR: 2018
AUTHOR: Howard, Jeremy and Ruder, Sebastian

8 May, 2020

Designing and Deploying Online Field Experiments

CUSTOM_ID: bakshy2014designing
YEAR: 2014
AUTHOR: Bakshy, Eytan and Eckles, Dean and Bernstein, Michael S.

Intelligent Selection of Language Model Training Data

CUSTOM_ID: moore2010intelligent
YEAR: 2010
AUTHOR: Moore, Robert C. and Lewis, William

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

CUSTOM_ID: rudin2019stop
YEAR: 2019
AUTHOR: Rudin, Cynthia

25 April, 2020

Gmail Smart Compose: Real-Time Assisted Writing

CUSTOM_ID: andrew2019gmail
YEAR: 2019
AUTHOR: Andrew Dai and Benjamin Lee and Gagan Bansal and Jackie Tsay and Justin Lu and Mia Chen and Shuyuan Zhang and Tim Sohn and Yinan Wang and Yonghui Wu and Yuan Cao and Zhifeng Chen

Supervised Learning with Quantum-Inspired Tensor Networks

CUSTOM_ID: stoudenmire2017supervised
YEAR: 2017
AUTHOR: Stoudenmire, Miles and Schwab, David

The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

CUSTOM_ID: corbett2020measure
YEAR: 2018
AUTHOR: Corbett-Davies, Sam and Goel, Sharad

Understanding deep learning requires rethinking generalization

CUSTOM_ID: zhang2017understanding
YEAR: 2017
AUTHOR: Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol

17 April, 2020

Zoom In: An Introduction to Circuits

CUSTOM_ID: olah2020zoom
YEAR: 2020
AUTHOR: Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan

3 April, 2020

Sideways: Depth-Parallel Training of Video Models

CUSTOM_ID: malinowski2020sideways
YEAR: 2020
AUTHOR: Malinowski, Mateusz and Swirszcz, Grzegorz and Carreira, Joao and Patraucean, Viorica

Speech2face: Learning the face behind a voice

CUSTOM_ID: oh2019speech2face
YEAR: 2019
AUTHOR: Oh, Tae-Hyun and Dekel, Tali and Kim, Changil and Mosseri, Inbar and Freeman, William T and Rubinstein, Michael and Matusik, Wojciech

Dialog Methods for Improved Alphanumeric String Capture

CUSTOM_ID: peters2011dialog
YEAR: 2011
AUTHOR: Peters, Doug and Stubley, Peter

Presents a way for dialog level collection of alpha numeric strings via an ASR. Two main ideas:

  1. Skip listing over n-best hypothesis across turns (attempts)
  2. Chunking and confirming pieces one by one

28 February, 2020

Self-supervised dialogue learning

CUSTOM_ID: wu2019self
YEAR: 2019
AUTHOR: Wu, Jiawei and Wang, Xin and Wang, William Yang

The self-supervision signal here is coming from a model which tries to predict whether a provided tuple of turns is in order or not. Connecting this as the discriminator in generative-discriminative dialog systems they find better results.

7 February, 2020

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

CUSTOM_ID: hancock2019learning
YEAR: 2019
AUTHOR: Hancock, Braden and Bordes, Antoine and Mazare, Pierre-Emmanuel and Weston, Jason

This is an approach to collect supervision signal from deployment data. There are three tasks for the system (which is a chat bot doing ranking on candidate responses):

  1. Dialogue. The main task. Given the turns till now, the bot ranks which response to utter.
  2. Satisfaction. Given turns till now, last being user utterance, predict whether the user is satisfied.
  3. Feedback. After asking for feedback from the user, predict user's response (feedback) based on the turns till now.

The models have shared weights, mostly among task 1 and 3.

31 January, 2020

Modeling Sequences with Quantum States: A Look Under the Hood

CUSTOM_ID: bradley2019modeling
YEAR: 2019
AUTHOR: Bradley, Tai-Danae and Stoudenmire, E Miles and Terilla, John

This paper explores a new direction in language modelling. The idea is still to learn the underlying distribution of sequences of characters, but here they do it by learning the quantum analogue of the classical probability distribution function. Unlike the classical case, marginal distributions there carry enough information to re-construct the joint distribution. This is the central idea of the paper, and is explained in the first half. The second half of the paper explains the theory and implementation of the training algorithm, with a simple example. Future work would be to apply this algorithm to a more complicated example, and even adapt it to variable length sequences.

Deep voice 2: Multi-speaker neural text-to-speech

CUSTOM_ID: gibiansky2017deep
YEAR: 2017
AUTHOR: Gibiansky, Andrew and Arik, Sercan and Diamos, Gregory and Miller, John and Peng, Kainan and Ping, Wei and Raiman, Jonathan and Zhou, Yanqi

This paper suggests improvements to DeepVoice and Tacotron, and also proposes a way to add trainable speaker embeddings. The speaker embeddings are initialized randomly and trained jointly through backpropagation. The paper lists some patterns that lead to better performance

  1. Transforming speaker embeddings to appropriate dimension and form for every place it is added to the model. The transformed speaker embeddings are called site-specific speaker embeddings
  2. Initializing recurrent layer hidden states with the site-specific speaker embeddings.
  3. Concatenating the site-specific speaker embedding to input at every timestep of the recurrent layer
  4. Multiplying layer activations element-wise to the site-specific speaker embeddings

A credit assignment compiler for joint prediction

CUSTOM_ID: chang2016credit
YEAR: 2016
AUTHOR: Chang, Kai-Wei and He, He and Ross, Stephane and Daume III, Hal and Langford, John

This talks about an API for framing L2S style search problems in style of an imperative program which allows for two optimizations:

  1. memoization
  2. forced path collapse, getting losses without going to the last state

Main reduction that happens here is to a cost-sensitive classification problem.

17 January, 2020

Learning language from a large (unannotated) corpus

CUSTOM_ID: vepstas2014learning
YEAR: 2014
AUTHOR: Vepstas, Linas and Goertzel, Ben

Introductory paper on the general approach used in learn. The idea is to learn various generalizable syntactic and semantic relations from unannotated corpus. The relations are expressed using graphs sitting on top of link grammar and meaning text theory (MTT). While the general approach is sketched out decently enough, there are details to filled in various steps and experiments to run (as of the writing in 2014).

On another note, the document is a nice read because of the many interesting ways of looking at various ideas in understanding languages and going from syntax to reasoning via semantics.

10 January, 2020

Parsing English with a link grammar

CUSTOM_ID: sleator1995parsing
YEAR: 1995
AUTHOR: Sleator, Daniel DK and Temperley, Davy

We came to here via opencog's learn project. This is a nice perspective setup also if you are missing out on formal introduction of grammars and all. Overall a link grammar defines connectors on left and right side of a word with disjunctions and conjunctions incorporated which then link together to form a sentence, under certain constraints.

This specific paper shows the formulation and creates a parser for English, covering many (not all) linguistics phenomena.

20 December, 2019

Generalized end-to-end loss for speaker verification

CUSTOM_ID: wan2018generalized
YEAR: 2018
AUTHOR: Wan, Li and Wang, Quan and Papir, Alan and Moreno, Ignacio Lopez

This paper is development over their previous research work, Tuple-based end to end(TE2E) loss, for speaker identification. They try to generalize the concept of the cosine similarity being used in TE2E by creating similarity matrics for utterances by a user. They have suggested two losses in the paper:

  1. Softmax loss
  2. Contrast loss

Both these loss functions had two components, one which brings utterances by a user together and others, which separates the utterances of different users. Out of the two, Contrast loss is more rigorous.

13 December, 2019

Towards end-to-end spoken language understanding

CUSTOM_ID: serdyuk2018towards
YEAR: 2018
AUTHOR: Serdyuk, Dmitriy and Wang, Yongqiang and Fuegen, Christian and Kumar, Anuj and Liu, Baiyang and Bengio, Yoshua

This paper talks about developing an end to end model for intent recognition form speech. Currently, all the models have several components like ASR and NLU, which each have some errors of their own degrading the quality of the speech to intent pipeline. Experiments for two tasks, speech to domain and speech to intent were performed using the model. The model's architecture is mostly inspired from end to end speech synthesis models. A unique feature of the architecture is that they perform sub-sampling after the first GRU layer to reduce the size of the vector and to tackle the problem of vanishing gradient.

Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One

CUSTOM_ID: grathwohl2019classifier
YEAR: 2019
AUTHOR: Will Grathwohl and Kuan-Chieh Wang and Jörn-Henrik Jacobsen and David Duvenaud and Mohammad Norouzi and Kevin Swersky

They take a regular classifier, pick out logits before softmax and try to formulate an energy based model able to give \(P(x, y)\) and \(P(x)\). The formulation itself is pretty simple with the energy function being \(E(x) = −LogSumExp_yf_\Theta(x)[y]\). Final loss sums cross entropy (for discriminative part) and negative log likelhood of \(P(x)\) approximated using SGLD. Check out the repo here.

Although the learning mechanism is a little fragile and needs work to be generally stable, the results are neat.

29 November, 2019

Overton: A Data System for Monitoring and Improving Machine-Learned Products

CUSTOM_ID: re2019overton
YEAR: 2019
AUTHOR: Ré, Christopher and Niu, Feng and Gudipati, Pallavi and Srisuwananukorn, Charles

This is more about managing supervision than model. There are 3 problems that they are trying to solve:

  1. Fine grained quality monitoring,
  2. Support for multi-component pipelines, and
  3. Updating supervision

For this, they make easy to use abstractions for describing supervision and developing models. They also do a lot of multitask learning and snorkelish weak supervision, including the recent slicing abstractions for fine grained quality control.

While you have to adapt a few pieces for your own case (and scale), Overton is a nice testimony for success of things like weak supervision and higher level development abstractions in production.

Slice-based learning: A programming model for residual learning in critical data slices

CUSTOM_ID: chen2019slice
YEAR: 2019
AUTHOR: Chen, Vincent and Wu, Sen and Ratner, Alexander J and Weng, Jen and Ré, Christopher

This is taking the snorkel's labelling function idea to group data instances in slices, segments which are interesting to us from an overall quality perspective. These slicing functions are important not only for identifying and narrowing down to specific kinds of data instances but also for learning slice specific representations which works out as computationally cheap way (there are other benefits too) of replicating a Mixture of Experts style model.

Like with labelling functions, we have the slice membership predicted using heuristics which are noisy. This membership value along with slice representations (and slice prediction confidences) help create the slice aware representation to be used for the final task. The appendix has few good examples of slicing functions.

21 September, 2019

3 August, 2019

27 July, 2019

20 July, 2019

13 July, 2019

6 July, 2019

1 July, 2019

25 June, 2019

15 June, 2019

8 June, 2019

1 June, 2019

  • Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, , A tutorial on thompson sampling, Foundations and Trends{\textregistered} in Machine Learning, 11(1), 1–96 (2018). (cite:russo2018tutorial)

18 May, 2019

13 May, 2019

Created: 2020-09-24 Thu 22:09

Emacs 26.3 (Org mode 9.1.9)