ARTICLE AD BOX
Overview
Welcome to mlX 2.0 Regression Challenge!
In this competition, your mission is to predict a song’s popularity score (0-100) using real-world music data—from audio features and artist stats to track metadata. Using machine learning, you’ll analyze trends in danceability, energy, valence, and more to train a model that forecasts how listeners will react before a song even hits the charts.
The dataset includes real Billboard-tracked tracks, and your goal is to build the most accurate regression model possible. Can your algorithm decode the secret formula behind viral hits, or will it flop harder than a one-hit wonder?
The Challenge:
Predict continuous popularity scores (not just hits/flops).
Uncover hidden patterns in audio and artist data.
Good luck, and may your model top the (leaderboard) charts!
questions/79915177/dlcs1972-machine-learning
# split features and target variable
df["first_name"] = df.Name.str.split(" ").map(lambda x: x[0])
#drop values
df.drop(['Name', 'Cabin','Ticket'], axis=1, inplace=True)
#### label encoding for target variable and others
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['Sex'] = labelencoder.fit_transform(df['Sex'])
#### seprete the x,y in train data set
data = df.values
X = data[:,0:8]
Y = data[:,8]
############seperate the numerical and categorical columns
# Numerical columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
# Categorical columns
cat_cols = df.select_dtypes(include=['object']).columns
##feature selection
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
model_lr = LogisticRegression()
recur_fe = RFE(model_lr, n_features_to_select=4)
features = recur_fe.fit(X, Y)
print(features.n_features_)
print(features.support_)
print(features.ranking_)
######feature selection
from sklearn.linear_model import Ridge
ridge_re = Ridge(alpha=1.0)
ridge_re.fit(X, Y)
def print_coefs(coef,names = None,sort = False):
if names is None: names = \["X%s" % x for x in range(len(coef))\] lst = zip(coef,names) if sort: lst = sorted(lst,key=lambda x:-np.abs(x\[0\])) return "+".join("%s \* %s" % (round(coef,3),name) for coef,name in lst)print_coefs(ridge_re.coef_,names=df.columns[:-1],sort=True)
###########visualization of feature importance
import matplotlib.pyplot as plt
import seaborn as sns
feature_importance = pd.Series(ridge_re.coef_, index=df.columns[:-1])
feature_importance.nlargest(10).plot(kind='barh')
##########K-NN clussification
from sklearn.neighbors import KNeighborsClassifier
2️⃣ Prepare data
# X = features, y = target
X = train.drop('target', axis=1)
y = train['target']
3️⃣ Split data (for testing)
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
4️⃣ Create K-NN model
model = KNeighborsClassifier(n_neighbors=5) # K = 5 is common
5️⃣ Train model
model.fit(X_train, y_train)
6️⃣ Make predictions
pred = model.predict(X_val)
7️⃣ Evaluate accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_val, pred))
########################################################
Feature scaling is CRUCIAL
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
Without scaling, features with large ranges dominate distance calculation
Distance metric
Default: Euclidean
Can try Manhattan (metric='manhattan') if it fits data
🏎 4. Predict on test CSV
# scale test data
test_scaled = scaler.transform(test)
predictions = model.predict(test_scaled)
# save predictions
import pandas as pd
submission = pd.DataFrame({'Prediction': predictions})
submission.to_csv('submission.csv', index=False)
##############################
######scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
# scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
##########logical regression
model = LogisticRegression()
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
###RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
