Paper Reading
@ Vernacular.ai
List of papers we cover during our weekly paper reading session. For past and missing links/notes, check out the (private) wiki.
28 August, 2020
21 August, 2020
14 August, 2020
24 July, 2020
10 July, 2020
3 July, 2020
12 June, 2020
Sentence-bert: Sentence embeddings using siamese bert-networks
CUSTOM_ID: reimers2019sentence YEAR: 2019 AUTHOR: Reimers, Nils and Gurevych, Iryna
PyTorch: An imperative style, high-performance deep learning library
CUSTOM_ID: paszke2019pytorch YEAR: 2019 AUTHOR: Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others
What's Hidden in a Randomly Weighted Neural Network?
CUSTOM_ID: ramanujan2020s YEAR: 2020 AUTHOR: Ramanujan, Vivek and Wortsman, Mitchell and Kembhavi, Aniruddha and Farhadi, Ali and Rastegari, Mohammad
5 June, 2020
Hierarchical attention networks for document classification
CUSTOM_ID: yang2016hierarchical YEAR: 2016 AUTHOR: Yang, Zichao and Yang, Diyi and Dyer, Chris and He, Xiaodong and Smola, Alex and Hovy, Eduard
Training classifiers with natural language explanations
CUSTOM_ID: hancock2018training YEAR: 2018 AUTHOR: Hancock, Braden and Bringmann, Martin and Varma, Paroma and Liang, Percy and Wang, Stephanie and R, Christopher
29 May, 2020
Differentiable Reasoning over a Virtual Knowledge Base
CUSTOM_ID: dhingra2020differentiable YEAR: 2020 AUTHOR: Dhingra, Bhuwan and Zaheer, Manzil and Balachandran, Vidhisha and Neubig, Graham and Salakhutdinov, Ruslan and Cohen, William W
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
CUSTOM_ID: shen2018naturaltts YEAR: 2018 AUTHOR: Shen, Jonathan and Pang, Ruoming and Weiss, Ron J and Schuster, Mike and Jaitly, Navdeep and Yang, Zongheng and Chen, Zhifeng and Zhang, Yu and Wang, Yuxuan and Skerrv-Ryan, Rj and others
15 May, 2020
NBDT: Neural-Backed Decision Trees
CUSTOM_ID: wan2020nbdt YEAR: 2020 AUTHOR: Wan, Alvin and Dunlap, Lisa and Ho, Daniel
8 May, 2020
Designing and Deploying Online Field Experiments
CUSTOM_ID: bakshy2014designing YEAR: 2014 AUTHOR: Bakshy, Eytan and Eckles, Dean and Bernstein, Michael S.
25 April, 2020
Gmail Smart Compose: Real-Time Assisted Writing
CUSTOM_ID: andrew2019gmail YEAR: 2019 AUTHOR: Andrew Dai and Benjamin Lee and Gagan Bansal and Jackie Tsay and Justin Lu and Mia Chen and Shuyuan Zhang and Tim Sohn and Yinan Wang and Yonghui Wu and Yuan Cao and Zhifeng Chen
Supervised Learning with Quantum-Inspired Tensor Networks
CUSTOM_ID: stoudenmire2017supervised YEAR: 2017 AUTHOR: Stoudenmire, Miles and Schwab, David
17 April, 2020
3 April, 2020
Sideways: Depth-Parallel Training of Video Models
CUSTOM_ID: malinowski2020sideways YEAR: 2020 AUTHOR: Malinowski, Mateusz and Swirszcz, Grzegorz and Carreira, Joao and Patraucean, Viorica
Speech2face: Learning the face behind a voice
CUSTOM_ID: oh2019speech2face YEAR: 2019 AUTHOR: Oh, Tae-Hyun and Dekel, Tali and Kim, Changil and Mosseri, Inbar and Freeman, William T and Rubinstein, Michael and Matusik, Wojciech
Dialog Methods for Improved Alphanumeric String Capture
CUSTOM_ID: peters2011dialog YEAR: 2011 AUTHOR: Peters, Doug and Stubley, Peter
Presents a way for dialog level collection of alpha numeric strings via an ASR. Two main ideas:
- Skip listing over n-best hypothesis across turns (attempts)
- Chunking and confirming pieces one by one
28 February, 2020
Self-supervised dialogue learning
CUSTOM_ID: wu2019self YEAR: 2019 AUTHOR: Wu, Jiawei and Wang, Xin and Wang, William Yang
The self-supervision signal here is coming from a model which tries to predict whether a provided tuple of turns is in order or not. Connecting this as the discriminator in generative-discriminative dialog systems they find better results.
7 February, 2020
Learning from Dialogue after Deployment: Feed Yourself, Chatbot!
CUSTOM_ID: hancock2019learning YEAR: 2019 AUTHOR: Hancock, Braden and Bordes, Antoine and Mazare, Pierre-Emmanuel and Weston, Jason
This is an approach to collect supervision signal from deployment data. There are three tasks for the system (which is a chat bot doing ranking on candidate responses):
- Dialogue. The main task. Given the turns till now, the bot ranks which response to utter.
- Satisfaction. Given turns till now, last being user utterance, predict whether the user is satisfied.
- Feedback. After asking for feedback from the user, predict user's response (feedback) based on the turns till now.
The models have shared weights, mostly among task 1 and 3.
31 January, 2020
Modeling Sequences with Quantum States: A Look Under the Hood
CUSTOM_ID: bradley2019modeling YEAR: 2019 AUTHOR: Bradley, Tai-Danae and Stoudenmire, E Miles and Terilla, John
This paper explores a new direction in language modelling. The idea is still to learn the underlying distribution of sequences of characters, but here they do it by learning the quantum analogue of the classical probability distribution function. Unlike the classical case, marginal distributions there carry enough information to re-construct the joint distribution. This is the central idea of the paper, and is explained in the first half. The second half of the paper explains the theory and implementation of the training algorithm, with a simple example. Future work would be to apply this algorithm to a more complicated example, and even adapt it to variable length sequences.
Deep voice 2: Multi-speaker neural text-to-speech
CUSTOM_ID: gibiansky2017deep YEAR: 2017 AUTHOR: Gibiansky, Andrew and Arik, Sercan and Diamos, Gregory and Miller, John and Peng, Kainan and Ping, Wei and Raiman, Jonathan and Zhou, Yanqi
This paper suggests improvements to DeepVoice and Tacotron, and also proposes a way to add trainable speaker embeddings. The speaker embeddings are initialized randomly and trained jointly through backpropagation. The paper lists some patterns that lead to better performance
- Transforming speaker embeddings to appropriate dimension and form for every place it is added to the model. The transformed speaker embeddings are called site-specific speaker embeddings
- Initializing recurrent layer hidden states with the site-specific speaker embeddings.
- Concatenating the site-specific speaker embedding to input at every timestep of the recurrent layer
- Multiplying layer activations element-wise to the site-specific speaker embeddings
A credit assignment compiler for joint prediction
CUSTOM_ID: chang2016credit YEAR: 2016 AUTHOR: Chang, Kai-Wei and He, He and Ross, Stephane and Daume III, Hal and Langford, John
This talks about an API for framing L2S style search problems in style of an imperative program which allows for two optimizations:
- memoization
- forced path collapse, getting losses without going to the last state
Main reduction that happens here is to a cost-sensitive classification problem.
17 January, 2020
Learning language from a large (unannotated) corpus
CUSTOM_ID: vepstas2014learning YEAR: 2014 AUTHOR: Vepstas, Linas and Goertzel, Ben
Introductory paper on the general approach used in learn. The idea is to learn various generalizable syntactic and semantic relations from unannotated corpus. The relations are expressed using graphs sitting on top of link grammar and meaning text theory (MTT). While the general approach is sketched out decently enough, there are details to filled in various steps and experiments to run (as of the writing in 2014).
On another note, the document is a nice read because of the many interesting ways of looking at various ideas in understanding languages and going from syntax to reasoning via semantics.
10 January, 2020
Parsing English with a link grammar
CUSTOM_ID: sleator1995parsing YEAR: 1995 AUTHOR: Sleator, Daniel DK and Temperley, Davy
We came to here via opencog's learn project. This is a nice perspective setup also if you are missing out on formal introduction of grammars and all. Overall a link grammar defines connectors on left and right side of a word with disjunctions and conjunctions incorporated which then link together to form a sentence, under certain constraints.
This specific paper shows the formulation and creates a parser for English, covering many (not all) linguistics phenomena.
20 December, 2019
Generalized end-to-end loss for speaker verification
CUSTOM_ID: wan2018generalized YEAR: 2018 AUTHOR: Wan, Li and Wang, Quan and Papir, Alan and Moreno, Ignacio Lopez
This paper is development over their previous research work, Tuple-based end to end(TE2E) loss, for speaker identification. They try to generalize the concept of the cosine similarity being used in TE2E by creating similarity matrics for utterances by a user. They have suggested two losses in the paper:
- Softmax loss
- Contrast loss
Both these loss functions had two components, one which brings utterances by a user together and others, which separates the utterances of different users. Out of the two, Contrast loss is more rigorous.
13 December, 2019
Towards end-to-end spoken language understanding
CUSTOM_ID: serdyuk2018towards YEAR: 2018 AUTHOR: Serdyuk, Dmitriy and Wang, Yongqiang and Fuegen, Christian and Kumar, Anuj and Liu, Baiyang and Bengio, Yoshua
This paper talks about developing an end to end model for intent recognition form speech. Currently, all the models have several components like ASR and NLU, which each have some errors of their own degrading the quality of the speech to intent pipeline. Experiments for two tasks, speech to domain and speech to intent were performed using the model. The model's architecture is mostly inspired from end to end speech synthesis models. A unique feature of the architecture is that they perform sub-sampling after the first GRU layer to reduce the size of the vector and to tackle the problem of vanishing gradient.
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One
CUSTOM_ID: grathwohl2019classifier YEAR: 2019 AUTHOR: Will Grathwohl and Kuan-Chieh Wang and Jörn-Henrik Jacobsen and David Duvenaud and Mohammad Norouzi and Kevin Swersky
They take a regular classifier, pick out logits before softmax and try to formulate an energy based model able to give \(P(x, y)\) and \(P(x)\). The formulation itself is pretty simple with the energy function being \(E(x) = −LogSumExp_yf_\Theta(x)[y]\). Final loss sums cross entropy (for discriminative part) and negative log likelhood of \(P(x)\) approximated using SGLD. Check out the repo here.
Although the learning mechanism is a little fragile and needs work to be generally stable, the results are neat.
29 November, 2019
Overton: A Data System for Monitoring and Improving Machine-Learned Products
CUSTOM_ID: re2019overton YEAR: 2019 AUTHOR: Ré, Christopher and Niu, Feng and Gudipati, Pallavi and Srisuwananukorn, Charles
This is more about managing supervision than model. There are 3 problems that they are trying to solve:
- Fine grained quality monitoring,
- Support for multi-component pipelines, and
- Updating supervision
For this, they make easy to use abstractions for describing supervision and developing models. They also do a lot of multitask learning and snorkelish weak supervision, including the recent slicing abstractions for fine grained quality control.
While you have to adapt a few pieces for your own case (and scale), Overton is a nice testimony for success of things like weak supervision and higher level development abstractions in production.
Slice-based learning: A programming model for residual learning in critical data slices
CUSTOM_ID: chen2019slice YEAR: 2019 AUTHOR: Chen, Vincent and Wu, Sen and Ratner, Alexander J and Weng, Jen and Ré, Christopher
This is taking the snorkel's labelling function idea to group data instances in slices, segments which are interesting to us from an overall quality perspective. These slicing functions are important not only for identifying and narrowing down to specific kinds of data instances but also for learning slice specific representations which works out as computationally cheap way (there are other benefits too) of replicating a Mixture of Experts style model.
Like with labelling functions, we have the slice membership predicted using heuristics which are noisy. This membership value along with slice representations (and slice prediction confidences) help create the slice aware representation to be used for the final task. The appendix has few good examples of slicing functions.
21 September, 2019
- Moody, C. E., Mixing dirichlet topic models and word embeddings to make lda2vec, arXiv preprint arXiv:1605.02019, (), (2016). (cite:moody2016mixing)
- Ren, L., Xie, K., Chen, L., & Yu, K., Towards universal dialogue state tracking, arXiv preprint arXiv:1810.09587, (), (2018). (cite:ren2018towards)
- Coucke, A., Saade, A., Ball, A., Th\'eodore Bluche, Caulier, A., Leroy, D., Cl\'ement Doumouro, …, Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces, CoRR, abs/1805.10190(), (2018). (cite:DBLP:journals/corr/abs-1805-10190)
3 August, 2019
- Kim, S., Eriksson, T., Kang, H., & Hee Youn, D., A pitch synchronous feature extraction method for speaker recognition, In , (pp. ) (2004). : . (cite:PSMFCC)
- Chen, J., Elements of human voice (2016), : . (cite:HumanVoice)
- Ghorbani, A., & Zou, J., Data shapley: equitable valuation of data for machine learning, arXiv preprint arXiv:1904.02868, (), (2019). (cite:ghorbani2019data)
- Shen, G., Horikawa, T., Majima, K., & Kamitani, Y., Deep image reconstruction from human brain activity, PLoS computational biology, 15(1), 1006633 (2019). (cite:shen2019deep)
- Daum\'e III, Hal, Frustratingly easy domain adaptation, arXiv preprint arXiv:0907.1815, (), (2009). (cite:daume2009frustratingly)
27 July, 2019
- Belkin, M., Hsu, D., Ma, S., & Mandal, S., Reconciling modern machine learning and the bias-variance trade-off, arXiv preprint arXiv:1812.11118, (), (2018). (cite:belkin2018reconciling)
20 July, 2019
- Locatello, F., Bauer, S., Lucic, M., Gelly, S., Sch\"olkopf, Bernhard, & Bachem, O., Challenging common assumptions in the unsupervised learning of disentangled representations, arXiv preprint arXiv:1811.12359, (), (2018). (cite:locatello2018challenging)
13 July, 2019
- Advani, M. S., & Saxe, A. M., High-dimensional dynamics of generalization error in neural networks, arXiv preprint arXiv:1710.03667, (), (2017). (cite:advani2017high)
6 July, 2019
- Friedman, J., Hastie, T., & Tibshirani, R., The elements of statistical learning, In (Eds.), (pp. 51–61) (2001). : Springer series in statistics New York. (cite:friedman2001elements)
- Barham, P., & Isard, M., Machine learning systems are stuck in a rut, In , Proceedings of the Workshop on Hot Topics in Operating Systems (pp. 177–183) (2019). New York, NY, USA: ACM. (cite:barham2019machine)
- Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J., Surprises in high-dimensional ridgeless least squares interpolation, arXiv preprint arXiv:1903.08560, (), (2019). (cite:hastie2019surprises)
- Levitan, S. I., Mishra, T., & Bangalore, S., Automatic identification of gender from speech, In , Proceeding of Speech Prosody (pp. 84–88) (2016). : . (cite:levitan2016automatic)
1 July, 2019
- Friedman, J., Hastie, T., & Tibshirani, R., The elements of statistical learning, In (Eds.), (pp. 51–61) (2001). : Springer series in statistics New York. (cite:friedman2001elements)
- Graf, S., Herbig, T., Buck, M., & Schmidt, G., Features for voice activity detection: a comparative analysis, EURASIP Journal on Advances in Signal Processing, 2015(1), 91 (2015). (cite:graf2015features)
- Welling, M., & Teh, Y. W., Bayesian learning via stochastic gradient langevin dynamics, In , Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688) (2011). : . (cite:welling2011bayesian)
- Goodman, J., A bit of progress in language modeling, arXiv preprint arXiv:cs/0108005, (), (2001). (cite:goodman2001progress)
- Cotterell, R., Mielke, S. J., Eisner, J., & Roark, B., Are all languages equally hard to language-model?, In , Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 536–541) (2018). New Orleans, Louisiana: Association for Computational Linguistics. (cite:cotterell-etal-2018-languages)
25 June, 2019
- Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., Speaker verification using adapted gaussian mixture models, Digital signal processing, 10(1-3), 19–41 (2000). (cite:reynolds2000speaker)
- Jasper Snoek, H. L., & Adams, R. P., Practical bayesian optimization of machine learning algorithms, arXiv preprint arXiv:1206.2944, (), (2012). (cite:snoek2012practical)
- Breck, E., Zinkevich, M., Polyzotis, N., Whang, S., & Roy, S., Data validation for machine learning, In , Proceedings of SysML (pp. ) (2019). : . (cite:breck2019data)
- Carbonell, J. G., Learning by analogy: formulating and generalizing plans from past experience, In (Eds.), Machine learning (pp. 137–161) (1983). : Springer. (cite:carbonell1983learning)
- Liu, B., Wang, L., Liu, M., & Xu, C., Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems, , abs/1901.06455(), (2019). (cite:Liu2019LifelongFR)
15 June, 2019
- Mohri, M., Pereira, F., & Riley, M., Weighted finite-state transducers in speech recognition, Computer Speech & Language, 16(1), 69–88 (2002). (cite:MOHRI200269)
- Ueffing, N., Bisani, M., & Vozila, P., Improved models for automatic punctuation prediction for spoken and written text., In , Interspeech (pp. 3097–3101) (2013). : . (cite:ueffing2013improved)
- Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X., Large-scale long-tailed recognition in an open world, arXiv preprint arXiv:1904.05160, (), (2019). (cite:liu2019large)
- Iyer, A., Jonnalagedda, M., Parthasarathy, S., Radhakrishna, A., & Rajamani, S. K., Synthesis and machine learning for heterogeneous extraction, In , Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 301–315) (2019). : . (cite:iyer2019synthesis)
8 June, 2019
- Dehak, N., Kenny, P. J., Dehak, R\'eda, Dumouchel, P., & Ouellet, P., Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798 (2010). (cite:dehak2010front)
- Dehak, N., Dehak, R., Kenny, P., Br\"ummer, Niko, Ouellet, P., & Dumouchel, P., Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, In , Tenth Annual conference of the international speech communication association (pp. ) (2009). : . (cite:dehak2009support)
- Sutton, C., & McCallum, A., An introduction to conditional random fields for relational learning, In (Eds.), Introduction to Statistical Relational Learning (pp. ) (2006). : . (cite:sutton06introduction)
- Mendis, C., Droppo, J., Maleki, S., Musuvathi, M., Mytkowicz, T., & Zweig, G., Parallelizing wfst speech decoders, In , 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5325–5329) (2016). : . (cite:mendis2016parallelizing)
1 June, 2019
- Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, , A tutorial on thompson sampling, Foundations and Trends{\textregistered} in Machine Learning, 11(1), 1–96 (2018). (cite:russo2018tutorial)
18 May, 2019
- Gravano, A., Jansche, M., & Bacchiani, M., Restoring punctuation and capitalization in transcribed speech, In , 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4741–4744) (2009). : . (cite:gravano2009restoring)
- Mintz, M., Bills, S., Snow, R., & Jurafsky, D., Distant supervision for relation extraction without labeled data, In , Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2 (pp. 1003–1011) (2009). : . (cite:mintz2009distant)
- Beygelzimer, A., Daum\'e, Hal, Langford, J., & Mineiro, P., Learning reductions that really work, Proceedings of the IEEE, 104(1), 136–147 (2016). (cite:beygelzimer2016learning)
13 May, 2019
- Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., …, Hidden technical debt in machine learning systems, In , Advances in neural information processing systems (pp. 2503–2511) (2015). : . (cite:sculley2015hidden)
- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., …, Google's neural machine translation system: bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144, (), (2016). (cite:wu2016google)
- Ghahramani, Z., Unsupervised learning, In , Summer School on Machine Learning (pp. 72–112) (2003). : . (cite:ghahramani2003unsupervised)
- Hundman, K., Constantinou, V., Laporte, C., Colwell, I., & Soderstrom, T., Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding, In , Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining (pp. 387–395) (2018). : . (cite:hundman2018detecting)