Word-Online: recreating Karpathy’s char-RNN (with supervised linear online learning of word embeddings) for text completion

T. Moudiki

5 months ago

This article was first published on T. Moudiki's Webpage - Python , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this post, I implement a simple word completion model, based on Karpathy’s char-RNN, but using supervised linear online learning of word embeddings. More precisely, I use the SGDClassifier from scikit-learn, which is a simple linear classifier that can be updated incrementally.

Keep in mind that this is an illustrative example, based on a few words and small vocabulary. There are many, many ways to improve the model, and many other configurations could be envisaged. So, feel free to experiment and extend this example. Nonetheless, the grammatical structure of the generated text (don’t generalize this result yet) is surprisingly good.

My 2-cents-non-scientific (?) extrapolation about this is, is that artificial neural networks are not intrinsically better than other methods: it takes a model with high capacity, capable of learning and generalize well.

Here is how to reproduce the example, assuming you named the file word-online.py (the repository is named word-online):

uv venv venv --python=3.11
source venv/bin/activate
uv pip install -r requirements.txt

python word-online.py

word-online.py contains the following code:

Python version

import numpy as np
import gensim
import time  # Added for the delay parameter

from collections import deque
from tqdm import tqdm
from scipy.special import softmax
from sklearn.linear_model import SGDClassifier


# Sample text 
text = """Hello world, this is an online learning example with word embeddings.
          It learns words and generates text incrementally using an SGD classifier."""

def debug_print(x):
    print(f"{x}")

# Tokenization (simple space-based)
words = text.lower().split()
vocab = sorted(set(words))
vocab.append("<UNK>")  # Add unknown token for OOV words

# Train Word2Vec model (or load pretrained embeddings)
embedding_dim = 50  # Change to 100/300 if using a larger model
word2vec = gensim.models.Word2Vec([words], vector_size=embedding_dim, window=5, min_count=1, sg=0)

# Create word-to-index mapping
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for word, i in word_to_idx.items()}

# Hyperparameters
context_size = 12  # Default 10, Words used for prediction context
learning_rate = 0.005
epochs = 10

# Prepare training data
X_train, y_train = [], []

for i in tqdm(range(len(words) - context_size)):
    context = words[i:i + context_size]
    target = words[i + context_size]
    # Convert context words to embeddings
    context_embedding = np.concatenate([word2vec.wv[word] for word in context])
    X_train.append(context_embedding)
    y_train.append(word_to_idx[target])

X_train, y_train = np.array(X_train), np.array(y_train)

# Initialize SGD-based classifier
clf = SGDClassifier(loss="hinge", max_iter=1, learning_rate="constant", eta0=learning_rate)

# Online training (stochastic updates, multiple passes)
for epoch in tqdm(range(epochs)):
    for i in range(len(X_train)):
        clf.partial_fit([X_train[i]], [y_train[i]], classes=np.arange(len(vocab)))

# 🔥 **Softmax function for probability scaling**
def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))  # Stability trick
    return exp_logits / np.sum(exp_logits)


def sample_from_logits(logits, k=5, temperature=1.0, random_seed=123):
    """ Applies Top-K sampling & Temperature scaling """
    logits = np.array(logits) / temperature  # Apply temperature scaling
    probs = softmax(logits)  # Convert logits to probabilities
    # Select top-K indices
    top_k_indices = np.argsort(probs)[-k:]
    top_k_probs = probs[top_k_indices]
    top_k_probs /= top_k_probs.sum()  # Normalize
    # Sample from Top-K distribution
    np.random.seed(random_seed)
    return np.random.choice(top_k_indices, p=top_k_probs)


def generate_text(seed="this is", length=20, k=5, temperature=1.0, random_state=123, delay=3):
    seed_words = seed.lower().split()

    # Ensure context has `context_size` words (pad with zero vectors if needed)
    while len(seed_words) < context_size:
        seed_words.insert(0, "<PAD>")

    context = deque(
        [word_to_idx[word] if word in word_to_idx else -1 for word in seed_words[-context_size:]],
        maxlen=context_size
    )

    generated = seed
    previous_word = seed

    for _ in range(length):
        # Generate embeddings, use a zero vector if word is missing
        context_embedding = np.concatenate([
            word2vec.wv[idx_to_word[idx]] if idx in idx_to_word else np.zeros(embedding_dim)
            for idx in context
        ])
        logits = clf.decision_function([context_embedding])[0]  # Get raw scores
        # Sample next word using Top-K & Temperature scaling
        pred_idx = sample_from_logits(logits, k=k, temperature=temperature)
        next_word = idx_to_word.get(pred_idx, "<PAD>")
        
        print(f"Generating next word: {next_word}")  # Added this line
        time.sleep(delay)  # Added this line
        
        if previous_word[-1] == "." and previous_word[-1] != "" and previous_word[-1] != seed:
          generated += " " + next_word.capitalize()
        else: 
          generated += " " + next_word
        previous_word = next_word
        context.append(pred_idx)

    return generated

# 🔥 Generate text
print("\n\n Generated Text:")
seed = "This is a"
print(seed)
print(generate_text(seed, length=12, k=1, delay=0)) # delay seconds for next word generation, optimal for delay=0 seconds

100%|████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12164.45it/s]
100%|███████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  8.34it/s]


 Generated Text:
This is a
Generating next word: classifier.
Generating next word: an
Generating next word: sgd
Generating next word: classifier.
Generating next word: and
Generating next word: generates
Generating next word: text
Generating next word: incrementally
Generating next word: using
Generating next word: an
Generating next word: sgd
Generating next word: classifier.
This is a classifier. An sgd classifier. And generates text incrementally using an sgd classifier.

R version

%%R
library(reticulate)
library(progress)
library(stats)

# Initialize Python modules through reticulate
np <- import("numpy")
gensim <- import("gensim")
time <- import("time")  # Added for the delay parameter

# Sample text 
text <- "This is a model used for classification purposes. It applies continuous learning on word vectors, converting words into embeddings, learning from those embeddings, and gradually producing text through the iterative process of an SGD classifier."

debug_print <- function(x) {
  print(paste0(x))
}

# Tokenization (simple space-based)
words <- strsplit(tolower(text), "\\s+")[[1L]]
vocab <- sort(unique(words))
vocab <- c(vocab, "<UNK>")  # Add unknown token for OOV words

# Train Word2Vec model (or load pretrained embeddings)
embedding_dim <- 50L  # Change to 100/300 if using a larger model
word2vec <- gensim$models$Word2Vec(list(words), vector_size=embedding_dim, window=5L, min_count=1L, sg=0L)

# Ensure "<UNK>" is in the Word2Vec vocabulary
# This is the crucial step to fix the KeyError
if (!("<UNK>" %in% word2vec$wv$index_to_key)) {
  word2vec$wv$add_vector("<UNK>", rep(0, embedding_dim))  # Add "<UNK>" with a zero vector
}


# Create word-to-index mapping
word_to_idx <- setNames(seq_along(vocab) - 1L, vocab)  # 0-based indexing to match Python
idx_to_word <- setNames(vocab, as.character(word_to_idx))

# Hyperparameters
context_size <- 12L  # Default 10, Words used for prediction context
learning_rate <- 0.005
epochs <- 10L
    
# Prepare training data
X_train <- list()
y_train <- list()

pb <- progress_bar$new(total = length(words) - context_size)
for (i in 1L:(length(words) - context_size)) {
  context <- words[i:(i + context_size - 1L)]
  target <- words[i + context_size]
  # Convert context words to embeddings
  context_vectors <- lapply(context, function(word) as.array(word2vec$wv[word]))
  context_embedding <- np$concatenate(context_vectors)
  X_train[[i]] <- context_embedding
  y_train[[i]] <- word_to_idx[target]
  pb$tick()
}

# Initialize SGD-based classifier
sklearn <- import("sklearn.linear_model")
clf <- sklearn$SGDClassifier(loss="hinge", max_iter=1L, learning_rate="constant", eta0=learning_rate)

# Online training (stochastic updates, multiple passes)
pb <- progress_bar$new(total = epochs)
for (epoch in 1L:epochs) {
  for (i in 1L:length(X_train)) {
    # Use the list version for indexing individual samples
    clf$partial_fit(
      np$array(list(X_train[[i]])), 
      np$array(list(y_train[[i]])), 
      classes=np$arange(length(vocab))
    )
  }
  pb$tick()
}

# Softmax function for probability scaling
softmax_fn <- function(logits) {
  exp_logits <- exp(logits - max(logits))  # Stability trick
  return(exp_logits / sum(exp_logits))
}

sample_from_logits <- function(logits, k=5L, temperature=1.0, random_seed=123L) {
  # Applies Top-K sampling & Temperature scaling
  logits <- as.numeric(logits) / temperature  # Apply temperature scaling
  probs <- softmax_fn(logits)  # Convert logits to probabilities
  
  # Select top-K indices - ensure k doesn't exceed the length of logits
  k <- min(k, length(logits))
  sorted_indices <- order(probs)
  top_k_indices <- sorted_indices[(length(sorted_indices) - k + 1L):length(sorted_indices)]
  
  # Handle case where k=1 specially
  if (k == 1L) {
    return(top_k_indices)
  }
  
  top_k_probs <- probs[top_k_indices]
  # Ensure probabilities sum to 1
  top_k_probs <- top_k_probs / sum(top_k_probs)
  
  # Check if all probabilities are valid
  if (any(is.na(top_k_probs)) || length(top_k_probs) != length(top_k_indices)) {
    # If there are issues with probabilities, just return the highest probability item
    return(top_k_indices[which.max(probs[top_k_indices])])
  }
  
  # Sample from Top-K distribution
  set.seed(random_seed)
  return(sample(top_k_indices, size=1L, prob=top_k_probs))
}

generate_text <- function(seed="this is", length=20L, k=5L, temperature=1.0, random_state=123L, delay=3L) {
  seed_words <- strsplit(tolower(seed), "\\s+")[[1L]]
  
  # Ensure context has `context_size` words (pad with zero vectors if needed)
  while (length(seed_words) < context_size) {
    seed_words <- c("<PAD>", seed_words)
  }
  
  # Use a fixed-size list as a ring buffer
  context <- vector("list", context_size)
  for (i in 1L:context_size) {
    word <- tail(seed_words, context_size)[i]
    if (word %in% names(word_to_idx)) {
      context[[i]] <- word_to_idx[word]
    } else {
      context[[i]] <- -1L
    }
  }
  
  # Track position in the ring buffer
  context_pos <- 1L
  
  generated <- seed
  previous_word <- seed
  
  for (i in 1L:length) {
    # Generate embeddings, use a zero vector if word is missing
    context_vectors <- list()
    for (idx in unlist(context)) {
      if (as.character(idx) %in% names(idx_to_word)) {
        word <- idx_to_word[as.character(idx)]
        context_vectors <- c(context_vectors, list(as.array(word2vec$wv[word])))
      } else {
        context_vectors <- c(context_vectors, list(np$zeros(embedding_dim)))
      }
    }
    
    context_embedding <- np$concatenate(context_vectors)
    logits <- clf$decision_function(np$array(list(context_embedding)))[1L,]
    
    # Sample next word using Top-K & Temperature scaling
    pred_idx <- sample_from_logits(logits, k=k, temperature=temperature, random_seed=random_state+i)
    next_word <- if (as.character(pred_idx) %in% names(idx_to_word)) {
      idx_to_word[as.character(pred_idx)]
    } else {
      "<PAD>"
    }
    
    print(paste0("Generating next word: ", next_word))
    if (delay > 0) {
      time$sleep(delay)  # Added delay
    }
    
    if (substr(previous_word, nchar(previous_word), nchar(previous_word)) == "." && 
        previous_word != "" && previous_word != seed) {
      generated <- paste0(generated, " ", toupper(substr(next_word, 1, 1)), substr(next_word, 2, nchar(next_word)))
    } else {
      generated <- paste0(generated, " ", next_word)
    }
    
    previous_word <- next_word
    
    # Update context (ring buffer style)
    context[[context_pos]] <- pred_idx
    context_pos <- (context_pos %% context_size) + 1L
  }
  
  return(generated)
}
    
cat("\n\n Generated Text:\n")
seed <- "This classifier is"
cat(seed, "\n")
result <- generate_text(seed, length=2L, k=3L, delay=0L)  # delay seconds for next word generation
print(result)

Generated Text:
This classifier is 
[1] "Generating next word: for"
[1] "Generating next word: text"
[1] "This classifier is for text"

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - Python .

Want to share your content on python-bloggers? click here.